爬蟲日志數(shù)據(jù)信息抽取與統(tǒng)計系統(tǒng)設計與實現(xiàn)
發(fā)布時間:2018-05-26 01:06
本文選題:信息抽取 + 爬蟲指標數(shù)據(jù)統(tǒng)計 ; 參考:《北京郵電大學》2012年碩士論文
【摘要】:隨著網(wǎng)絡信息的膨脹,人們更大程度上越來越依靠搜索引擎。爬蟲作為搜索引擎不可或缺的一部分,它抓取網(wǎng)頁質量的好壞,直接影響著整個搜索引擎的搜索效果。因此即使檢索,索引等相關工作做的很好很完美,而爬蟲收錄的大部分是些垃圾網(wǎng)頁,那么用戶體驗也無從談起。這樣就需要根據(jù)抓取效果來調整爬蟲的調度和抓取策略。那么怎樣才能評價爬蟲抓取網(wǎng)頁的質量和效果呢?這就是本文爬蟲日志數(shù)據(jù)信息抽取與統(tǒng)計系統(tǒng)需要解決的問題。 本文所作的工作如下: 1_爬蟲在種子合并調度和網(wǎng)頁下載時會記錄日志,爬蟲的這些相關日志文件分布在爬蟲部署集群的每個節(jié)點上,本文將對每個節(jié)點上的爬蟲日志數(shù)據(jù)進行收集,合并歸檔壓縮處理,然后將處理好的壓縮文件上傳到分布式文件存儲系統(tǒng)HDFS上,最后對壓縮文件產生索引文件。 2.對于一個分布式爬蟲集群來說,若每天下載的url數(shù)目控制在8億到十幾億之間,那么每天爬蟲日志至少會在幾百GB級,所以每天上傳到分布式文件存儲系統(tǒng)HDFS上的壓縮文件也在150GB左右,單機對于處理海量數(shù)據(jù)顯得力不從心,因此本文采用信息抽取技術作為技術基礎,通過Hadoop作為計算平臺,利用Hive對爬蟲日志數(shù)據(jù)進行結構化處理,由Hql語句將爬蟲關心的統(tǒng)計指標轉化成Job提交Hadoop集群處理,最后將MapReduce計算之后的指標結果數(shù)據(jù)導入到Mysql數(shù)據(jù)庫中。 3.最后本文采用PHP的輕量級框架CI (Codelgniter)對導入到Mysql中的爬蟲指標數(shù)據(jù)信息進行頁而展示和報表郵件發(fā)送。 實驗數(shù)據(jù)表明,本文以爬蟲的日志數(shù)據(jù)作為數(shù)據(jù)來源,采用Hadoop, Hive的海量數(shù)據(jù)處理平臺,能在有限的時間內完成有效信息的抽取,為爬蟲的策略調整提供可靠的數(shù)據(jù)支持。
[Abstract]:With the expansion of network information, people rely more and more on search engines. As an indispensable part of search engine, crawler grabs the quality of web pages, which directly affects the search results of the whole search engine. So even if the search, indexing and other related work is done well and perfectly, and most of the crawlers are garbage pages, then the user experience is impossible to talk about. Therefore, it is necessary to adjust the crawler scheduling and crawling strategy according to the capture effect. So how can we evaluate the quality and effectiveness of crawler crawling web pages? This is the problem that the crawler log data extraction and statistics system needs to solve. The work done in this paper is as follows: 1 _ crawler logs are recorded during seed merge scheduling and web page download. These log files are distributed on each node of the crawler deployment cluster. This paper will collect the crawler log data on each node. Then the compressed files are uploaded to the distributed file storage system (HDFS), and the index files are generated for the compressed files. 2. For a distributed crawler cluster, if the number of url downloads per day is kept between 800 million and more than a billion, then the crawler log will be at least several hundred gigabytes a day. So the compressed files uploaded to the distributed file storage system (HDFS) every day are also around 150GB, so the single machine is unable to deal with the massive data, so this paper adopts the information extraction technology as the technical foundation, and uses Hadoop as the computing platform. The crawler log data are processed structurally by Hive, and the statistical indexes concerned by the reptiles are transformed by Hql statements into Job submitted to Hadoop cluster processing. Finally, the index data after MapReduce calculation is imported into the Mysql database. 3. Finally, this paper uses the lightweight framework of PHP (CI / Codelgniteri) to page and display the crawler index data imported into Mysql and send the report mail. The experimental data show that the crawler log data is used as the data source, and the massive data processing platform of Hadoop and Hive is used to extract the effective information in a limited time, which can provide reliable data support for the policy adjustment of the crawler.
【學位授予單位】:北京郵電大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3
【參考文獻】
相關期刊論文 前4條
1 李穎基,彭宏,鄭啟倫,曾煒;Web日志中有趣關聯(lián)規(guī)則的發(fā)現(xiàn)[J];計算機研究與發(fā)展;2003年03期
2 任永功;付玉;張亮;呂君義;;一種新的基于Web日志的挖掘用戶瀏覽偏愛路徑的方法[J];計算機科學;2008年10期
3 吳斌;肖剛;陸佳煒;;基于關聯(lián)規(guī)則挖掘領域的Apriori算法的優(yōu)化研究[J];計算機工程與科學;2009年06期
4 王亮;葛瑋;;ETL過程的思考[J];計算機技術與發(fā)展;2008年10期
,本文編號:1935352
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1935352.html
最近更新
教材專著