基于動(dòng)態(tài)概念圖的主題網(wǎng)絡(luò)爬蟲的設(shè)計(jì)與分析
本文選題:主題網(wǎng)絡(luò)爬蟲 + 網(wǎng)頁分塊; 參考:《遼寧科技大學(xué)》2013年碩士論文
【摘要】:網(wǎng)絡(luò)信息時(shí)代的到來使得網(wǎng)絡(luò)中的信息量呈指數(shù)增長,這使得研究如何從網(wǎng)頁中高效地提取出有用信息成為網(wǎng)絡(luò)信息檢索領(lǐng)域中的重要課題。通用搜索引擎對Internet海量的數(shù)據(jù)和爆炸式增長的趨勢顯得無能為力,,同時(shí)用戶對數(shù)據(jù)的全面性和更新速度有了更高的需求,他們面向的不僅僅只是針對某一關(guān)鍵詞,而是對某一主題或領(lǐng)域,這就導(dǎo)致了主題網(wǎng)絡(luò)爬蟲的出現(xiàn)。主題網(wǎng)絡(luò)爬蟲是主題搜索引擎的基礎(chǔ)和重要組成部分,其設(shè)計(jì)目標(biāo)是盡可能搜集與特定主題相關(guān)的網(wǎng)頁,同時(shí)盡可能剔除與主題無關(guān)的網(wǎng)頁,有效地利用網(wǎng)絡(luò)帶寬和節(jié)約存儲(chǔ)空間,提高主題網(wǎng)絡(luò)爬蟲的爬行效率和主題覆蓋率。本文從主題網(wǎng)絡(luò)爬蟲的特點(diǎn)出發(fā)對其進(jìn)行了詳細(xì)的研究,主要有以下幾方面工作: 1.基于網(wǎng)頁的兩大基本特征提出了一種通過檢測出的分隔條直接對網(wǎng)頁分塊的算法,并用相對位置排版的概念解決了在部分分塊的高度未知的情況下如何表示各分塊的相對位置的問題,通過限制分塊的總數(shù)及節(jié)點(diǎn)的字符長度、寬高信息等綜合決定此節(jié)點(diǎn)是否可被繼續(xù)分割,優(yōu)先利用了統(tǒng)一性進(jìn)行分塊從而大幅度提高分塊效率,直接通過檢測分隔條進(jìn)行分塊,使用節(jié)點(diǎn)特征序列樹避免了對同一節(jié)點(diǎn)的大量重復(fù)信息提取。此算法是自頂向下,非常高效的。 2.首先提出網(wǎng)站的三大觀察理論,并根據(jù)這些理論得出一些結(jié)論,比如:結(jié)合網(wǎng)頁分塊及網(wǎng)頁風(fēng)格的統(tǒng)一性實(shí)現(xiàn)了內(nèi)容頁的判斷;根據(jù)網(wǎng)站穩(wěn)定性提出算法服務(wù)器的概念;根據(jù)對同一主題的分類與歸類的相似性提出了基于動(dòng)態(tài)概念加權(quán)有向圖的主題網(wǎng)絡(luò)爬蟲并給出概念圖的框架。 3.主題相關(guān)性計(jì)算使用加權(quán)求值的方法對各種因素進(jìn)行了綜合,引了入層的概念來表示關(guān)鍵詞距離主題的遠(yuǎn)近,在層權(quán)值計(jì)算方面對關(guān)鍵詞進(jìn)行了更為細(xì)致的劃分,把基于概念圖的預(yù)測節(jié)點(diǎn)納入主題相關(guān)性預(yù)測中。 4.給出了概念圖的節(jié)點(diǎn)結(jié)構(gòu),并基于此得出概念圖的動(dòng)態(tài)更新方法。為了保證主題的可擴(kuò)展性同時(shí)避免主題偏移,提出了專用詞的概念,并針對兩種不同的主題擴(kuò)展方式給出相應(yīng)的擴(kuò)展方法。
[Abstract]:With the advent of the era of network information, the amount of information in the network increases exponentially, which makes the research on how to extract useful information from web pages efficiently become an important topic in the field of network information retrieval. The general search engine is powerless to cope with the huge amount of Internet data and the explosive growth trend, and users have a higher demand for the comprehensiveness and update speed of the data. It is about a topic or a domain, which leads to the emergence of thematic web crawlers. Topic web crawler is the foundation and important part of theme search engine. Its design goal is to collect as many pages as possible related to a particular topic, and to remove as many pages as possible that are not related to the subject. The efficiency and coverage of topic crawler can be improved by using network bandwidth and saving storage space. Based on the characteristics of the topic web crawler, this paper makes a detailed study on it, mainly as follows: 1. Based on the two basic features of web pages, this paper proposes an algorithm for dividing web pages directly by detecting the separation bars. The concept of relative position layout is used to solve the problem of how to represent the relative position of each block when the height of the partial block is unknown. By limiting the total number of blocks and the character length of nodes, The combination of width and height information determines whether the node can continue to be partitioned, and the unity is first used to divide the block, thus greatly improving the efficiency of the partition, and dividing the node directly through the detection of the splitter bar. The feature sequence tree is used to avoid the repeated information extraction from the same node. This algorithm is top-down, very efficient. 2. 2. First, three observation theories of website are put forward, and some conclusions are drawn according to these theories. For example, the judgment of content pages is realized with the combination of web page partitioning and the unity of web page style, the concept of algorithm server is put forward according to the stability of website. According to the similarity of classification and classification of the same topic, a topic web crawler based on dynamic concept weighted directed graph is proposed and the framework of concept graph is given. The method of weighted evaluation is used to synthesize all kinds of factors, the concept of entering layer is introduced to express the distance from the topic, and the key words are classified in detail in the calculation of layer weight. The concept map-based prediction node is included in the topic correlation prediction. 4. 4. The node structure of the concept graph is given, and the dynamic updating method of the concept graph is obtained. In order to ensure the extensibility of the topic and avoid the topic deviation, the concept of special words is proposed, and the corresponding extension methods are given for two different ways of topic expansion.
【學(xué)位授予單位】:遼寧科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 吳明禮,施水才;一種結(jié)合超鏈接分析的搜索引擎排序方法[J];計(jì)算機(jī)工程;2004年15期
2 王津濤,蘭皓;面向主題元搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程;2005年07期
3 尹存燕;戴新宇;陳家駿;;Internet上文本的自動(dòng)摘要技術(shù)[J];計(jì)算機(jī)工程;2006年03期
4 葛玲;蔣宗禮;;基于共現(xiàn)詞查詢的主題爬蟲研究[J];計(jì)算機(jī)工程;2010年08期
5 李勇;韓亮;;主題搜索引擎中網(wǎng)絡(luò)爬蟲的搜索策略研究[J];計(jì)算機(jī)工程與科學(xué);2008年03期
6 吳濤;張毛迪;陳傳波;;一種改進(jìn)的統(tǒng)計(jì)與后串最大匹配的中文分詞算法研究[J];計(jì)算機(jī)工程與科學(xué);2008年08期
7 于滿泉,陳鐵睿,許洪波;基于分塊的網(wǎng)頁信息解析器的研究與設(shè)計(jì)[J];計(jì)算機(jī)應(yīng)用;2005年04期
8 黃文蓓;楊靜;顧君忠;;基于分塊的網(wǎng)頁正文信息提取算法研究[J];計(jì)算機(jī)應(yīng)用;2007年S1期
9 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年10期
10 于海龍;劉麗萍;鄔倫;謝剛生;;基于RM-ODP的模型復(fù)用框架OMRF[J];計(jì)算機(jī)應(yīng)用研究;2008年03期
本文編號(hào):2021479
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2021479.html