基于元搜索引擎的網(wǎng)頁(yè)采集技術(shù)的研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-10-19 14:35

【摘要】：隨著互聯(lián)網(wǎng)的迅速發(fā)展，網(wǎng)絡(luò)信息急劇膨脹，對(duì)互聯(lián)網(wǎng)信息敏感的政府部門和企事業(yè)單位已經(jīng)無(wú)法單單依靠人工監(jiān)控來(lái)把握互聯(lián)網(wǎng)的動(dòng)向了。為了幫助用戶更好地實(shí)時(shí)監(jiān)控分析網(wǎng)絡(luò)信息,近些年涌現(xiàn)了大量的互聯(lián)網(wǎng)信息處理平臺(tái)。這些互聯(lián)網(wǎng)信息處理平臺(tái)借助于高性能的計(jì)算機(jī)，及時(shí)、準(zhǔn)確、全面的采集網(wǎng)絡(luò)信息，并進(jìn)一步為用戶提供有價(jià)值的分析結(jié)果。然而，現(xiàn)有的網(wǎng)頁(yè)信息采集技術(shù)在采集數(shù)據(jù)的時(shí)效性、全面性和有效率上還存在一定缺陷，并且設(shè)計(jì)復(fù)雜，維護(hù)困難，需要消耗大量的人力、物力。為了克服上述缺陷，本文將元搜索技術(shù)遷移應(yīng)用到了互聯(lián)網(wǎng)信息采集系統(tǒng)中去，提出了基于元搜索引擎的網(wǎng)頁(yè)采集技術(shù)——采集型元搜索技術(shù)。實(shí)驗(yàn)結(jié)果表明，比起已有的網(wǎng)頁(yè)信息采集技術(shù)，新的網(wǎng)頁(yè)采集技術(shù)能夠保證采集數(shù)據(jù)的時(shí)效性、全面性和有效率。本文所做主要工作如下： 1)對(duì)傳統(tǒng)的網(wǎng)頁(yè)采集技術(shù)進(jìn)行了詳細(xì)的研究和分析，闡述了各種網(wǎng)絡(luò)爬蟲在滿足互聯(lián)網(wǎng)信息處理平臺(tái)的網(wǎng)頁(yè)采集需求時(shí)的優(yōu)缺點(diǎn)，提出了基于元搜索引擎的網(wǎng)頁(yè)采集技術(shù)。 2)針對(duì)現(xiàn)有元搜索引擎應(yīng)用于采集模塊存在采集規(guī)模過(guò)小的問(wèn)題，提出了基于局部共現(xiàn)統(tǒng)計(jì)的查詢擴(kuò)展技術(shù)（LCOOCS），通過(guò)增加查詢次數(shù)的方式來(lái)獲取更多相關(guān)網(wǎng)頁(yè)。 3)針對(duì)LCOOCS需要對(duì)初檢結(jié)果進(jìn)行文本分析，而元搜索引擎的采集結(jié)果都是HTML網(wǎng)頁(yè)源代碼的問(wèn)題，設(shè)計(jì)并實(shí)現(xiàn)了一種全自動(dòng)的正文抽取算法TextEx。 4)設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)采集型元搜索系統(tǒng)�？偨Y(jié)提取了百度新聞、bing資訊等六大互聯(lián)網(wǎng)搜索引擎的查詢語(yǔ)法和結(jié)果頁(yè)結(jié)構(gòu)，，實(shí)現(xiàn)了查詢提交以及結(jié)果下載的自動(dòng)化。
[Abstract]:With the rapid development of the Internet and the rapid expansion of network information, government departments and enterprises that are sensitive to Internet information can no longer rely solely on manual monitoring to grasp the trend of the Internet. In order to help users monitor and analyze network information in real time, a large number of Internet information processing platforms have emerged in recent years. With the help of high performance computers, these Internet information processing platforms collect network information in a timely, accurate and comprehensive manner, and further provide valuable analysis results for users. However, the existing web page information collection technology still has some defects in the timeliness, comprehensiveness and efficiency of collecting data, and the design is complex and the maintenance is difficult, so it needs to consume a lot of manpower and material resources. In order to overcome the above shortcomings, this paper applies the meta-search technology migration to the Internet information collection system, and puts forward the web page acquisition technology based on meta search engine, which is the acquisition meta-search technology. The experimental results show that the new technology can ensure the timeliness, comprehensiveness and efficiency of the data collection. The main work of this paper is as follows: 1) the traditional web page acquisition technology is studied and analyzed in detail, and the advantages and disadvantages of various web crawlers in meeting the needs of the web page collection of the Internet information processing platform are expounded. This paper puts forward the technology of web page acquisition based on meta search engine. 2) aiming at the problem that the existing meta-search engine is used in the collection module, the scale of collection is too small. A query expansion technique based on local co-occurrence statistics (LCOOCS),) is proposed to obtain more relevant web pages by increasing the number of queries. 3) the text analysis of the first check results is carried out according to the needs of LCOOCS. The acquisition results of meta search engine are all the problems of HTML web page source code. A kind of automatic text extraction algorithm TextEx. 4) is designed and implemented, and a collection meta search system is designed and implemented. This paper summarizes and extracts the query syntax and result page structure of six Internet search engines, such as Baidu News, bing Information and so on, and realizes the automation of query submission and result download.
【學(xué)位授予單位】：西安電子科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP393.092;TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 沈宇;黃衛(wèi)東;;基于領(lǐng)域本體的元搜索技術(shù)研究[J];信息通信;2008年02期

2 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期

3 劉國(guó)靖;康麗;羅長(zhǎng)壽;;基于遺傳算法的主題爬蟲策略[J];計(jì)算機(jī)應(yīng)用;2007年S2期

4 王磊;蔣建中;郭軍利;;基于擴(kuò)展DOM樹(shù)的Web頁(yè)面信息抽取[J];計(jì)算機(jī)應(yīng)用與軟件;2007年06期

5 黃名選;嚴(yán)小衛(wèi);張師超;;查詢擴(kuò)展技術(shù)進(jìn)展與展望[J];計(jì)算機(jī)應(yīng)用與軟件;2007年11期

6 林子熠;沈備軍;;基于統(tǒng)計(jì)的自動(dòng)化Web新聞?wù)某槿J];計(jì)算機(jī)應(yīng)用與軟件;2010年12期

7 孫承杰,關(guān)毅;基于統(tǒng)計(jì)的網(wǎng)頁(yè)正文信息抽取方法的研究[J];中文信息學(xué)報(bào);2004年05期

8 梅雪;程學(xué)旗;郭巖;張剛;丁國(guó)棟;;一種全自動(dòng)生成網(wǎng)頁(yè)信息抽取Wrapper的方法[J];中文信息學(xué)報(bào);2008年01期

9 崔航,文繼榮,李敏強(qiáng);基于用戶日志的查詢擴(kuò)展統(tǒng)計(jì)模型[J];軟件學(xué)報(bào);2003年09期

10 楊少華;林海略;韓燕波;;針對(duì)模板生成網(wǎng)頁(yè)的一種數(shù)據(jù)自動(dòng)抽取方法(英文)[J];軟件學(xué)報(bào);2008年02期

相關(guān)博士學(xué)位論文前4條

1 郭秀娟;基于關(guān)聯(lián)規(guī)則數(shù)據(jù)挖掘算法的研究[D];吉林大學(xué);2004年

2 李榮陸;文本分類及其相關(guān)技術(shù)研究[D];復(fù)旦大學(xué);2005年

3 李強(qiáng);基于本體論的個(gè)性化和社會(huì)化元搜索引擎的研究[D];浙江大學(xué);2006年

4 高茂庭;文本聚類分析若干問(wèn)題研究[D];天津大學(xué);2007年

相關(guān)碩士學(xué)位論文前4條

1 陳劍銳;基于Hadoop海量數(shù)據(jù)存儲(chǔ)仿真平臺(tái)的研究與設(shè)計(jì)[D];華南理工大學(xué);2011年

2 萬(wàn)晶;Web網(wǎng)頁(yè)正文抽取方法研究[D];南昌大學(xué);2010年

3 程錦佳;基于Hadoop的分布式爬蟲及其實(shí)現(xiàn)[D];北京郵電大學(xué);2010年

4 于洪波;中文網(wǎng)頁(yè)自動(dòng)采集與分類系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];北京郵電大學(xué);2010年

本文編號(hào)：2281433

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2281433.html

上一篇：基于大數(shù)據(jù)的城市公園游憩功能研究
下一篇：論網(wǎng)絡(luò)信息資源編目的實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于元搜索引擎的網(wǎng)頁(yè)采集技術(shù)的研究與實(shí)現(xiàn)