基于網(wǎng)頁分塊的主題爬蟲技術(shù)研究

發(fā)布時間：2018-02-22 19:46

本文關(guān)鍵詞： 網(wǎng)頁分塊視覺信息標(biāo)簽屬性主題鏈接塊 Shark-Search算法　出處：《山東師范大學(xué)》2017年碩士論文　論文類型：學(xué)位論文

【摘要】：隨著Web信息的多元化發(fā)展以及信息量的膨脹速度日益加快,不僅存儲成本提高,信息采集也變得越來越難。通用爬蟲在工作過程中會消耗大量的網(wǎng)絡(luò)帶寬,造成系統(tǒng)資源的浪費(fèi)。而且它不太關(guān)心搜索到的頁面是否符合用戶的搜索主題,往往會返回很多與用戶并不感興趣的頁面。因此,為了提高爬取效率,改善用戶體驗(yàn)度,產(chǎn)生了以主題爬蟲為核心的垂直搜索引擎。主題爬蟲在頁面抓取過程中采取啟發(fā)式搜索策略,通過計算頁面與用戶搜索主題的相關(guān)度,將與用戶搜索主題不相關(guān)的頁面過濾掉,只下載與主題相關(guān)的頁面存入待訪問隊(duì)列。網(wǎng)上的信息豐富多彩,如何有效的獲取并整合主題內(nèi)容信息以及如何利用爬蟲全面準(zhǔn)確地下載主題相關(guān)網(wǎng)頁是面臨的關(guān)鍵技術(shù)挑戰(zhàn)。本文通過研究主題爬蟲技術(shù)領(lǐng)域已取得的研究成果,主要對網(wǎng)頁分塊處理以及候選鏈接搜索策略進(jìn)行了深入研究。在基于標(biāo)簽信息和視覺信息的分塊布局下,提出了引入主題鏈接塊因子的候選鏈接搜索算法。具體主要工作如下:(1)基于標(biāo)簽屬性與視覺信息進(jìn)行網(wǎng)頁分塊。利用table標(biāo)簽和div標(biāo)簽的布局規(guī)律,結(jié)合CSS樣式表和style屬性中的視覺信息進(jìn)行分塊處理。首先根據(jù)網(wǎng)頁設(shè)計規(guī)律制定分類規(guī)則,將內(nèi)容塊分為文本塊、鏈接塊和無關(guān)塊三類。然后進(jìn)行主題文本塊提取,先利用標(biāo)簽屬性值進(jìn)行初步過濾,再與基準(zhǔn)塊進(jìn)行相似度計算進(jìn)行進(jìn)一步過濾,得到最終符合條件的文本。利用主題鏈接塊提取規(guī)則進(jìn)行主題塊匹配,過濾噪音鏈接,獲取所需的主題鏈接塊。本文選取的基于標(biāo)簽屬性與視覺信息的分塊方法在實(shí)際應(yīng)用中易于實(shí)現(xiàn),避免塊間大范圍盲目匹配,具有較低的時間和空間復(fù)雜度。(2)主題爬蟲在爬取過程中,需要先計算待爬取鏈接隊(duì)列中的鏈接權(quán)重,按照權(quán)重大小決定訪問順序。本文在Shark-Search算法的基礎(chǔ)上引入主題鏈接塊權(quán)重的概念,提出基于主題鏈接塊的改進(jìn)搜索策略對網(wǎng)頁中的URL進(jìn)行優(yōu)先級預(yù)測。將鏈接塊中所有子鏈接的錨文本作為鏈接相關(guān)度計算的主要影響因素,在Shark-Search算法的理論基礎(chǔ)上,引入主題鏈接塊權(quán)重概念,并結(jié)合了鏈接結(jié)構(gòu)的影響。(3)為了保證系統(tǒng)的有效性,首先在不同的閾值下分別實(shí)現(xiàn)HITS算法、Shark-Search算法和本文算法,將三種算法的結(jié)果進(jìn)行對比分析。實(shí)驗(yàn)數(shù)據(jù)證明本文系統(tǒng)在多個閾值設(shè)置下都優(yōu)于其他兩種算法。然后對三種算法下的查全率和信息量總和進(jìn)行了詳細(xì)比較,并針對語義明確的主題和抽象概念的主題漂移率進(jìn)行了實(shí)驗(yàn)分析,結(jié)果證明改進(jìn)系統(tǒng)性能更優(yōu)秀。
[Abstract]:With the diversified development of Web information and the increasing expansion of information, not only the storage cost increases, but also the information collection becomes more and more difficult. The universal crawler will consume a lot of network bandwidth in the working process. It often returns many pages that are not of interest to the user. Therefore, in order to improve crawling efficiency and user experience, it does not care much about whether the search page is in line with the user's search theme. A vertical search engine with theme crawler as the core is produced. The topic crawler adopts heuristic search strategy in the process of page crawling. By calculating the correlation between the page and the user search theme, the pages that are not related to the user search theme are filtered out. Download only the topic-related pages into the queue to be visited. The information on the web is rich and colorful, How to effectively obtain and integrate the topic content information and how to use crawlers to download the relevant web pages are the key technical challenges. This paper mainly studies the partitioning of web pages and the strategy of candidate link search. Under the partitioning layout based on label information and visual information, A candidate link search algorithm based on topic link block factor is proposed. The main work is as follows: 1) partitioning web pages based on tag attributes and visual information. The layout rules of table tags and div tags are used. According to the rules of web page design, the content block is divided into three categories: text block, link block and irrelevant block. First, the label attribute value is used for preliminary filtering, and then the similarity calculation with the reference block is carried out to further filter, and finally the eligible text is obtained. The topic block extraction rule is used to match the topic block, and the noise link is filtered. The method based on label attribute and visual information is easy to implement in practical application, and avoid blind matching between blocks. The crawler with low time and space complexity needs to calculate the link weight in the queue of links to be crawled. This paper introduces the concept of topic link block weight based on Shark-Search algorithm. An improved search strategy based on topic link block is proposed to predict the priority of URL in web pages. The anchor text of all sub-links in the link block is taken as the main influencing factor in the calculation of link correlation, and based on the theory of Shark-Search algorithm, the anchor text of all sub-links in the link block is considered as the main influencing factor. This paper introduces the concept of topic link block weight, and combines the influence of link structure. In order to ensure the effectiveness of the system, we implement the HITS algorithm Shark-Search algorithm and the algorithm in this paper at different thresholds, respectively. The results of the three algorithms are compared and analyzed. The experimental data show that the system is superior to the other two algorithms in many threshold settings. Then, the recall rate and the sum of information under the three algorithms are compared in detail. The experimental results show that the improved system performance is better.
【學(xué)位授予單位】：山東師范大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP393.092;TP391.3

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 高俊波;安博文;王曉峰;;在線論壇中潛在影響力主題的發(fā)現(xiàn)研究[J];計算機(jī)應(yīng)用;2008年01期

2 吳玲達(dá),謝毓湘,欒悉道,肖鵬;互聯(lián)網(wǎng)多媒體主題信息自動收集與處理系統(tǒng)的研制[J];計算機(jī)應(yīng)用研究;2005年05期

3 蔣凡,高俊波,張敏,王煦法;BBS中主題發(fā)現(xiàn)原型系統(tǒng)的設(shè)計與實(shí)現(xiàn)[J];計算機(jī)工程與應(yīng)用;2005年31期

4 周亦鵬;杜軍平;;基于時空情境模型的主題跟蹤[J];華南理工大學(xué)學(xué)報(自然科學(xué)版);2012年08期

5 陳雄;都云程;李渝勤;施水才;;基于頁面結(jié)構(gòu)分析的論壇主題信息定位方法研究[J];微計算機(jī)信息;2010年27期

6 何利益;陸國鋒;羅鵬;;動態(tài)新聞主題信息推薦系統(tǒng)設(shè)計[J];指揮信息系統(tǒng)與技術(shù);2013年04期

7 關(guān)慧芬;師軍;;基于本體的主題爬蟲技術(shù)研究[J];計算機(jī)仿真;2009年10期

8 張宇;宋巍;劉挺;李生;;基于URL主題的查詢分類方法[J];計算機(jī)研究與發(fā)展;2012年06期

9 歐健文,董守斌,蔡斌;模板化網(wǎng)頁主題信息的提取方法[J];清華大學(xué)學(xué)報(自然科學(xué)版);2005年S1期

10 呂聚旺;都云程;王弘蔚;施水才;;基于新型主題信息量化方法的Web主題信息提取研究[J];現(xiàn)代圖書情報技術(shù);2008年12期

相關(guān)會議論文前6條

1 吳晨;宋丹;薛德軍;師慶輝;;科技主題識別及表示[A];第五屆全國信息檢索學(xué)術(shù)會議論文集[C];2009年

2 熊方;王曉宇;鄭駿;周傲英;;ITED:一種基于鏈接的主題提取和主題發(fā)現(xiàn)系統(tǒng)[A];第十九屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集（研究報告篇）[C];2002年

3 王玉婷;杜亞軍;涂騰濤;;基于Web鏈接的主題爬行蟲初始URL的研究[A];第四屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會議論文集（上）[C];2008年

4 馮少卿;都云程;施水才;;基于模板的網(wǎng)頁主題信息抽取[A];第三屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會議論文集[C];2007年

5 王琦;唐世渭;楊冬青;王騰蛟;;基于DOM的網(wǎng)頁主題信息自動提取[A];第二十一屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（研究報告篇）[C];2004年

6 刁宇峰;王昊;林鴻飛;楊亮;;博客中重復(fù)評論發(fā)現(xiàn)[A];中國計算語言學(xué)研究前沿進(jìn)展（2009-2011）[C];2011年

相關(guān)博士學(xué)位論文前5條

1 楊肖;基于主題的互聯(lián)網(wǎng)信息抓取研究[D];浙江大學(xué);2014年

2 趙一鳴;基于多維尺度分析的潛在主題可視化研究[D];華中師范大學(xué);2013年

3 吳永輝;面向?qū)I(yè)領(lǐng)域的網(wǎng)絡(luò)信息采集及主題檢測技術(shù)研究與應(yīng)用[D];哈爾濱工業(yè)大學(xué);2010年

4 薛利;面向證券應(yīng)用的WEB主題觀點(diǎn)挖掘若干關(guān)鍵問題研究[D];復(fù)旦大學(xué);2013年

5 周厚奎;概率主題模型的研究及其在多媒體主題發(fā)現(xiàn)和演化中的應(yīng)用[D];浙江大學(xué);2017年

相關(guān)碩士學(xué)位論文前10條

1 解琰;主題優(yōu)化過濾方法研究與應(yīng)用[D];大連海事大學(xué);2015年

2 楊春艷;基于語義和引用加權(quán)的文獻(xiàn)主題提取研究[D];浙江大學(xué);2015年

3 盧洋;基于主題模型的混合推薦算法研究[D];電子科技大學(xué);2014年

4 黃志;基于維基歧義頁的搜索結(jié)果聚類方法研究[D];北京理工大學(xué);2015年

5 王亮;基于主題模型的文本挖掘的研究[D];大連理工大學(xué);2015年

6 任昱鳳;基于Hadoop的分布式主題爬蟲及其實(shí)現(xiàn)[D];陜西師范大學(xué);2015年

7 韓琳;基于貝葉斯主題爬蟲的研究與實(shí)現(xiàn)[D];北京工業(yè)大學(xué);2015年

8 黎楠;面向?qū)＠闹黝}挖掘技術(shù)研究及應(yīng)用[D];北京工業(yè)大學(xué);2015年

9 劉學(xué)江;超大規(guī)模社交網(wǎng)絡(luò)中基于結(jié)構(gòu)與主題的社團(tuán)挖掘[D];電子科技大學(xué);2015年

10 黃文強(qiáng);安卓技術(shù)信息的主題爬蟲技術(shù)研究與實(shí)現(xiàn)[D];東南大學(xué);2015年

，

本文編號：1525156

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/xixikjs/1525156.html

上一篇：F-OFDM通信系統(tǒng)的優(yōu)化設(shè)計研究及FPGA實(shí)現(xiàn)
下一篇：基于用戶體驗(yàn)的女性手機(jī)游戲界面設(shè)計研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于網(wǎng)頁分塊的主題爬蟲技術(shù)研究