Web垂直信息檢索技術(shù)及算法的研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-06-20 07:33

本文選題：垂直信息檢索 + 中文分詞�。� 參考：《廣東技術(shù)師范學(xué)院》2017年碩士論文

【摘要】：伴隨著計(jì)算機(jī)硬件的不斷發(fā)展,互聯(lián)網(wǎng)也隨著硬件方面的提升而得到了空前的發(fā)展,特別是在這個(gè)數(shù)據(jù)爆發(fā)的時(shí)代,大量信息覆蓋著整個(gè)社會(huì),隨之就出現(xiàn)了大數(shù)據(jù)以及相關(guān)計(jì)算機(jī)的新技術(shù)。在大數(shù)據(jù)時(shí)代里,信息檢索系統(tǒng)能夠準(zhǔn)確無(wú)誤地找到人們自身所需的數(shù)據(jù),其中信息檢索系統(tǒng)的定義是用戶根據(jù)一定的檢索關(guān)鍵字或者策略,借助相關(guān)的爬蟲(chóng)技術(shù),將互聯(lián)網(wǎng)上的相關(guān)的數(shù)據(jù)爬取下來(lái),并通過(guò)中文分詞、網(wǎng)頁(yè)去重、排序優(yōu)化等相關(guān)技術(shù)對(duì)爬取下來(lái)的數(shù)據(jù)信息進(jìn)行處理,最后呈現(xiàn)給用戶所需求的信息。其中,中國(guó)的百度、360以及國(guó)外的Google、Yahoo等最具有代表意義,盡管它們專注于檢索領(lǐng)域,但都各有各的特點(diǎn),成為人們生活中必不可少的工具。由于Google、Yahoo、百度、360等檢索的范圍大,涉及到信息量廣,針對(duì)特定的領(lǐng)域檢索可能還存在一定的困難。為了克服針對(duì)特定的領(lǐng)域的專業(yè)檢索,人們便引入了垂直信息檢索系統(tǒng)的概念。垂直信息檢索系統(tǒng)的定義是基于某一專業(yè)領(lǐng)域所開(kāi)發(fā)的信息檢索系統(tǒng),如文檔垂直信息檢索系統(tǒng)、旅游垂直信息檢索系統(tǒng)以及購(gòu)物垂直信息檢索系統(tǒng)等等。本項(xiàng)目,主要研究新聞垂直信息檢索系統(tǒng),并在原有技術(shù)的基礎(chǔ)上進(jìn)行了相關(guān)的優(yōu)化操作:首先,在Heritrix的原型上進(jìn)行二次開(kāi)發(fā),使得優(yōu)化后的Heritrix爬蟲(chóng)技術(shù)爬取網(wǎng)頁(yè)資源效率更高;然后在獲得網(wǎng)頁(yè)資源的基礎(chǔ)上,通過(guò)HTMLParser技術(shù)將網(wǎng)頁(yè)格式轉(zhuǎn)化成TXT文本格式,并以IK Analyzer分詞技術(shù)為基礎(chǔ)進(jìn)行了優(yōu)化,對(duì)TXT文本內(nèi)容進(jìn)行分詞以及過(guò)濾TXT文本內(nèi)容中的臟數(shù)據(jù);接著改進(jìn)TF-IDF加權(quán)算法,有效去除網(wǎng)頁(yè)中重復(fù)的部分;最后,以Struts+Spring+Hibernate為架構(gòu),以MySQL為存儲(chǔ)數(shù)據(jù)庫(kù),借助PageRank算法改進(jìn)Lucene的排序算法,創(chuàng)建以及查詢索引,實(shí)現(xiàn)新聞垂直信息檢索系統(tǒng)。
[Abstract]:With the continuous development of computer hardware, the Internet has also been unprecedented development with the improvement of hardware, especially in this era of data explosion, a large amount of information covers the whole society. Then came the new technology of big data and related computers. In the era of big data, the information retrieval system can accurately find the data that people need, and the definition of information retrieval system is that the user can use the relevant crawler technology according to a certain search keyword or strategy. The related data on the Internet is crawled down and processed by Chinese word segmentation, web page removal, ranking optimization and so on. Finally, the information required by users is presented. Among them, China's Baidu 360 and foreign Google Yahoo have the most representative significance. Although they are focused on the search field, they all have their own characteristics and become indispensable tools in people's lives. Because of the wide range of search such as Google Yahoo, Baidu and 360, which involves a wide amount of information, there may still be some difficulties in searching for specific fields. In order to overcome the specialized retrieval in specific fields, the concept of vertical information retrieval system is introduced. The definition of vertical information retrieval system is based on the information retrieval system developed by a professional field, such as document vertical information retrieval system, tourism vertical information retrieval system and shopping vertical information retrieval system. This project mainly studies the news vertical information retrieval system, and carries on the related optimization operation based on the original technology: first, carries on the secondary development on the Heritrix prototype, The optimized Heritrix crawler technology is more efficient in crawling web resources, and then the web page format is transformed into TXT text format by HTML Parser technology and optimized based on IK Analyzer participle technology. The text content of TXT is segmented and the dirty data in TXT text content is filtered. Then the TF-IDF weighted algorithm is improved to effectively remove the duplicate parts of the web page. Finally, the Struts Spring hibernate is used as the framework and MySQL as the storage database. The PageRank algorithm is used to improve Lucene's sorting algorithm, to create and query indexes, and to realize the news vertical information retrieval system.
【學(xué)位授予單位】：廣東技術(shù)師范學(xué)院
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 楊新艷;于偉濤;;基于Maven的輕量級(jí)Java軟件開(kāi)發(fā)研究[J];科技傳播;2015年17期

2 張禹;周翔;;結(jié)合PageRank算法的Lucene評(píng)分機(jī)制改進(jìn)研究[J];三明學(xué)院學(xué)報(bào);2015年04期

3 ;微軟宣布Windows 10將是最后一個(gè)Windows[J];電腦迷;2015年06期

4 婁丹;梁春美;;SSH技術(shù)的應(yīng)用及發(fā)展研究[J];信息與電腦(理論版);2015年07期

5 李君;;巧用Jsp和Java連接Mysql數(shù)據(jù)庫(kù)[J];現(xiàn)代商貿(mào)工業(yè);2015年07期

6 杜遠(yuǎn)坤;黃于欣;;Tomcat6.0連接池的配置與應(yīng)用[J];計(jì)算機(jī)光盤(pán)軟件與應(yīng)用;2015年02期

7 張軍強(qiáng);李煒;沈奇威;;一種爬蟲(chóng)監(jiān)控系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];電信工程技術(shù)與標(biāo)準(zhǔn)化;2014年12期

8 孫鴻飛;侯偉;;改進(jìn)TFIDF算法在潛在合作關(guān)系挖掘中的應(yīng)用研究[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2014年10期

9 鄭霖;徐德華;;基于改進(jìn)TFIDF算法的文本分類研究[J];計(jì)算機(jī)與現(xiàn)代化;2014年09期

10 郭永利;盧穎穎;;網(wǎng)絡(luò)搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)[J];微型電腦應(yīng)用;2014年04期

相關(guān)碩士學(xué)位論文前10條

1 陶俊文;基于Heritrix框架的專業(yè)鎮(zhèn)信息網(wǎng)絡(luò)爬蟲(chóng)系統(tǒng)[D];華南理工大學(xué);2015年

2 潘磊寧;基于Lucene的商品垂直搜索引擎研究與實(shí)現(xiàn)[D];東華大學(xué);2015年

3 孫靜;基于Lucene的手機(jī)查詢軟件的研究與實(shí)現(xiàn)[D];重慶大學(xué);2014年

4 楊靜嫻;面向數(shù)碼商品垂直搜索引擎原型系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];西南交通大學(xué);2014年

5 李樂(lè);基于Lucene的企業(yè)級(jí)搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2014年

6 杜赫;輿情監(jiān)測(cè)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];北京交通大學(xué);2013年

7 王峰;基于垂直主題搜索的交通術(shù)語(yǔ)相似性比對(duì)研究[D];長(zhǎng)安大學(xué);2013年

8 黃翼彪;開(kāi)源中文分詞器的比較研究[D];鄭州大學(xué);2013年

9 張博;基于Lucene倒排索引性能的研究與優(yōu)化[D];昆明理工大學(xué);2013年

10 李偉;面向遠(yuǎn)程教育主題搜索引擎的研究與實(shí)現(xiàn)[D];西安電子科技大學(xué);2012年

，

本文編號(hào)：2043496

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2043496.html

上一篇：虹膜特征穩(wěn)定性提取的關(guān)鍵技術(shù)研究
下一篇：一種半自動(dòng)化COSMIC方法研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web垂直信息檢索技術(shù)及算法的研究與實(shí)現(xiàn)