基于元搜索引擎的網(wǎng)頁(yè)采集技術(shù)的研究與實(shí)現(xiàn)
[Abstract]:With the rapid development of the Internet and the rapid expansion of network information, government departments and enterprises that are sensitive to Internet information can no longer rely solely on manual monitoring to grasp the trend of the Internet. In order to help users monitor and analyze network information in real time, a large number of Internet information processing platforms have emerged in recent years. With the help of high performance computers, these Internet information processing platforms collect network information in a timely, accurate and comprehensive manner, and further provide valuable analysis results for users. However, the existing web page information collection technology still has some defects in the timeliness, comprehensiveness and efficiency of collecting data, and the design is complex and the maintenance is difficult, so it needs to consume a lot of manpower and material resources. In order to overcome the above shortcomings, this paper applies the meta-search technology migration to the Internet information collection system, and puts forward the web page acquisition technology based on meta search engine, which is the acquisition meta-search technology. The experimental results show that the new technology can ensure the timeliness, comprehensiveness and efficiency of the data collection. The main work of this paper is as follows: 1) the traditional web page acquisition technology is studied and analyzed in detail, and the advantages and disadvantages of various web crawlers in meeting the needs of the web page collection of the Internet information processing platform are expounded. This paper puts forward the technology of web page acquisition based on meta search engine. 2) aiming at the problem that the existing meta-search engine is used in the collection module, the scale of collection is too small. A query expansion technique based on local co-occurrence statistics (LCOOCS),) is proposed to obtain more relevant web pages by increasing the number of queries. 3) the text analysis of the first check results is carried out according to the needs of LCOOCS. The acquisition results of meta search engine are all the problems of HTML web page source code. A kind of automatic text extraction algorithm TextEx. 4) is designed and implemented, and a collection meta search system is designed and implemented. This paper summarizes and extracts the query syntax and result page structure of six Internet search engines, such as Baidu News, bing Information and so on, and realizes the automation of query submission and result download.
【學(xué)位授予單位】:西安電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP393.092;TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 沈宇;黃衛(wèi)東;;基于領(lǐng)域本體的元搜索技術(shù)研究[J];信息通信;2008年02期
2 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期
3 劉國(guó)靖;康麗;羅長(zhǎng)壽;;基于遺傳算法的主題爬蟲策略[J];計(jì)算機(jī)應(yīng)用;2007年S2期
4 王磊;蔣建中;郭軍利;;基于擴(kuò)展DOM樹(shù)的Web頁(yè)面信息抽取[J];計(jì)算機(jī)應(yīng)用與軟件;2007年06期
5 黃名選;嚴(yán)小衛(wèi);張師超;;查詢擴(kuò)展技術(shù)進(jìn)展與展望[J];計(jì)算機(jī)應(yīng)用與軟件;2007年11期
6 林子熠;沈備軍;;基于統(tǒng)計(jì)的自動(dòng)化Web新聞?wù)某槿J];計(jì)算機(jī)應(yīng)用與軟件;2010年12期
7 孫承杰,關(guān)毅;基于統(tǒng)計(jì)的網(wǎng)頁(yè)正文信息抽取方法的研究[J];中文信息學(xué)報(bào);2004年05期
8 梅雪;程學(xué)旗;郭巖;張剛;丁國(guó)棟;;一種全自動(dòng)生成網(wǎng)頁(yè)信息抽取Wrapper的方法[J];中文信息學(xué)報(bào);2008年01期
9 崔航,文繼榮,李敏強(qiáng);基于用戶日志的查詢擴(kuò)展統(tǒng)計(jì)模型[J];軟件學(xué)報(bào);2003年09期
10 楊少華;林海略;韓燕波;;針對(duì)模板生成網(wǎng)頁(yè)的一種數(shù)據(jù)自動(dòng)抽取方法(英文)[J];軟件學(xué)報(bào);2008年02期
相關(guān)博士學(xué)位論文 前4條
1 郭秀娟;基于關(guān)聯(lián)規(guī)則數(shù)據(jù)挖掘算法的研究[D];吉林大學(xué);2004年
2 李榮陸;文本分類及其相關(guān)技術(shù)研究[D];復(fù)旦大學(xué);2005年
3 李強(qiáng);基于本體論的個(gè)性化和社會(huì)化元搜索引擎的研究[D];浙江大學(xué);2006年
4 高茂庭;文本聚類分析若干問(wèn)題研究[D];天津大學(xué);2007年
相關(guān)碩士學(xué)位論文 前4條
1 陳劍銳;基于Hadoop海量數(shù)據(jù)存儲(chǔ)仿真平臺(tái)的研究與設(shè)計(jì)[D];華南理工大學(xué);2011年
2 萬(wàn)晶;Web網(wǎng)頁(yè)正文抽取方法研究[D];南昌大學(xué);2010年
3 程錦佳;基于Hadoop的分布式爬蟲及其實(shí)現(xiàn)[D];北京郵電大學(xué);2010年
4 于洪波;中文網(wǎng)頁(yè)自動(dòng)采集與分類系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];北京郵電大學(xué);2010年
本文編號(hào):2281433
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2281433.html