天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 軟件論文 >

面向主題的多線程網(wǎng)絡(luò)爬蟲的設(shè)計與實現(xiàn)

發(fā)布時間:2018-04-01 11:00

  本文選題:網(wǎng)絡(luò)爬蟲 切入點:主題爬蟲 出處:《西北民族大學》2017年碩士論文


【摘要】:網(wǎng)絡(luò)爬蟲是一種自動獲取網(wǎng)頁內(nèi)容的程序,通常作為搜索引擎的重要構(gòu)成從互聯(lián)網(wǎng)上抓取網(wǎng)頁。近年來,互聯(lián)網(wǎng)的飛速發(fā)展使得網(wǎng)絡(luò)信息呈現(xiàn)爆炸式增長,要從數(shù)據(jù)的汪洋大海中快速準確地獲得需要的信息,通用的網(wǎng)絡(luò)爬蟲已經(jīng)難以勝任,主題網(wǎng)絡(luò)爬蟲(也被稱為聚焦爬蟲,focused crawler)由此產(chǎn)生。主題爬蟲根據(jù)一定的頁面分析算法過濾掉跟主題不相關(guān)的URL,只保留符合要求的鏈接,再抓取并存儲頁面,為下一步的查詢和檢索提供資源。本文首先對網(wǎng)絡(luò)爬蟲的發(fā)展情況與相關(guān)技術(shù)進行介紹,對主題爬蟲關(guān)鍵技術(shù)進行分析。著重針對通用網(wǎng)絡(luò)爬蟲的不足,分析了多線程主題網(wǎng)絡(luò)爬蟲工作原理及相關(guān)技術(shù),給出主題爬蟲的工作流程和總體設(shè)計,包括基本功能架構(gòu)、網(wǎng)頁抓取模塊組、前端展示模塊組、數(shù)據(jù)庫設(shè)計以及系統(tǒng)界面的總體設(shè)計。通過對主題相關(guān)性判斷算法的分析,在頁面內(nèi)容的處理上,使用向量空間模型將網(wǎng)頁的內(nèi)容表示成向量,再給這些向量定義一個相似度,這樣就可以能夠判斷出內(nèi)容的相似度,本文采用基于內(nèi)容評價的Fish-Search算法來實現(xiàn)這一目標;在對URL的處理上,采用基于鏈接分析的PageRank算法來實現(xiàn),根據(jù)數(shù)量假設(shè)和質(zhì)量假設(shè)計算得出的結(jié)果可以評價介網(wǎng)頁的重要性。本文結(jié)合上述兩種算法實現(xiàn)主題相關(guān)度評價,保證下載的網(wǎng)頁與主題之間的相關(guān)度,有效地避免"主題漂移"現(xiàn)象,也保證查準率與查全率。在多線程的處理上,本文采用的Python線程池對IO密集型任務(wù)比較友好,能夠有效提高工作效率。
[Abstract]:Web crawler is a kind of program that automatically acquires the content of web pages, which is usually used as an important component of search engines to grab web pages from the Internet. In recent years, the rapid development of the Internet has caused the explosive growth of network information. In order to quickly and accurately obtain the information needed from the vast ocean of data, the universal web crawler is no longer competent. The topic crawler (also known as focused crawler) is created. The topic crawler filters out URLs that are not related to the topic according to a certain page analysis algorithm, keeps only the links that meet the requirements, and then grabs and stores the page. This paper first introduces the development of web crawler and related technologies, analyzes the key technologies of topic crawler, and focuses on the deficiency of common web crawler. This paper analyzes the working principle and related technology of multi-thread theme web crawler, and gives the workflow and overall design of theme crawler, including basic function structure, web crawling module group, front-end display module group, and so on. The database design and the overall design of the system interface. Through the analysis of the algorithm for judging the relevance of the topic, the vector space model is used to represent the content of the web page into vectors, and then define a similarity degree for these vectors. In this way, we can judge the similarity of content, this paper uses the Fish-Search algorithm based on content evaluation to achieve this goal, and the PageRank algorithm based on link analysis is used to deal with URL. According to the results of quantitative and qualitative assumptions, the importance of web pages can be evaluated. In this paper, we combine the two algorithms to evaluate the correlation between the downloaded pages and the topics, so as to ensure the relevance between the downloaded pages and the topics. The Python thread pool used in this paper is friendly to IO intensive tasks and can effectively improve the working efficiency.
【學位授予單位】:西北民族大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP393.092;TP391.3

【參考文獻】

相關(guān)期刊論文 前10條

1 趙兵;郭才正;;深網(wǎng)和搜索引擎[J];情報探索;2016年01期

2 胡秀麗;;基于VSM和LDA模型相結(jié)合的微博話題漂移檢測[J];蘭州理工大學學報;2015年05期

3 鄒睿;肖達;肖睿卿;劉勝利;;一種基于計數(shù)型Bloom Filter的報文分類算法[J];信息工程大學學報;2015年05期

4 管瑩;;基于CSS框架的應用網(wǎng)站設(shè)計[J];電腦知識與技術(shù);2015年04期

5 陳睿嘉;康志忠;張衛(wèi)濤;;基于網(wǎng)絡(luò)爬蟲的導航深度服務(wù)信息自動采集[J];測繪工程;2015年01期

6 范意興;郭巖;李希鵬;趙嶺;劉悅;俞曉明;程學旗;;一種基于網(wǎng)頁塊特征的多級網(wǎng)頁聚類方法[J];山東大學學報(理學版);2015年07期

7 黃沖;;MVC構(gòu)架模式下的Web應用設(shè)計與分析[J];電子技術(shù)與軟件工程;2014年14期

8 董日壯;郭曙超;;網(wǎng)絡(luò)爬蟲的設(shè)計與實現(xiàn)[J];電腦知識與技術(shù);2014年17期

9 舒奔;尹珂;;基于內(nèi)容與鏈接分析的主題爬蟲研究與設(shè)計[J];計算機與現(xiàn)代化;2014年04期

10 孫青云;王俊峰;趙宗渠;高夢超;;一種基于模擬登錄的微博數(shù)據(jù)采集方案[J];計算機技術(shù)與發(fā)展;2014年03期

,

本文編號:1695247

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1695247.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶ef6b3***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com