爬蟲日志數(shù)據(jù)信息抽取與統(tǒng)計系統(tǒng)設(shè)計與實(shí)現(xiàn)
發(fā)布時間:2018-05-26 01:06
本文選題:信息抽取 + 爬蟲指標(biāo)數(shù)據(jù)統(tǒng)計。 參考:《北京郵電大學(xué)》2012年碩士論文
【摘要】:隨著網(wǎng)絡(luò)信息的膨脹,人們更大程度上越來越依靠搜索引擎。爬蟲作為搜索引擎不可或缺的一部分,它抓取網(wǎng)頁質(zhì)量的好壞,直接影響著整個搜索引擎的搜索效果。因此即使檢索,索引等相關(guān)工作做的很好很完美,而爬蟲收錄的大部分是些垃圾網(wǎng)頁,那么用戶體驗也無從談起。這樣就需要根據(jù)抓取效果來調(diào)整爬蟲的調(diào)度和抓取策略。那么怎樣才能評價爬蟲抓取網(wǎng)頁的質(zhì)量和效果呢?這就是本文爬蟲日志數(shù)據(jù)信息抽取與統(tǒng)計系統(tǒng)需要解決的問題。 本文所作的工作如下: 1_爬蟲在種子合并調(diào)度和網(wǎng)頁下載時會記錄日志,爬蟲的這些相關(guān)日志文件分布在爬蟲部署集群的每個節(jié)點(diǎn)上,本文將對每個節(jié)點(diǎn)上的爬蟲日志數(shù)據(jù)進(jìn)行收集,合并歸檔壓縮處理,然后將處理好的壓縮文件上傳到分布式文件存儲系統(tǒng)HDFS上,最后對壓縮文件產(chǎn)生索引文件。 2.對于一個分布式爬蟲集群來說,若每天下載的url數(shù)目控制在8億到十幾億之間,那么每天爬蟲日志至少會在幾百GB級,所以每天上傳到分布式文件存儲系統(tǒng)HDFS上的壓縮文件也在150GB左右,單機(jī)對于處理海量數(shù)據(jù)顯得力不從心,因此本文采用信息抽取技術(shù)作為技術(shù)基礎(chǔ),通過Hadoop作為計算平臺,利用Hive對爬蟲日志數(shù)據(jù)進(jìn)行結(jié)構(gòu)化處理,由Hql語句將爬蟲關(guān)心的統(tǒng)計指標(biāo)轉(zhuǎn)化成Job提交Hadoop集群處理,最后將MapReduce計算之后的指標(biāo)結(jié)果數(shù)據(jù)導(dǎo)入到Mysql數(shù)據(jù)庫中。 3.最后本文采用PHP的輕量級框架CI (Codelgniter)對導(dǎo)入到Mysql中的爬蟲指標(biāo)數(shù)據(jù)信息進(jìn)行頁而展示和報表郵件發(fā)送。 實(shí)驗數(shù)據(jù)表明,本文以爬蟲的日志數(shù)據(jù)作為數(shù)據(jù)來源,采用Hadoop, Hive的海量數(shù)據(jù)處理平臺,能在有限的時間內(nèi)完成有效信息的抽取,為爬蟲的策略調(diào)整提供可靠的數(shù)據(jù)支持。
[Abstract]:With the expansion of network information, people rely more and more on search engines. As an indispensable part of search engine, crawler grabs the quality of web pages, which directly affects the search results of the whole search engine. So even if the search, indexing and other related work is done well and perfectly, and most of the crawlers are garbage pages, then the user experience is impossible to talk about. Therefore, it is necessary to adjust the crawler scheduling and crawling strategy according to the capture effect. So how can we evaluate the quality and effectiveness of crawler crawling web pages? This is the problem that the crawler log data extraction and statistics system needs to solve. The work done in this paper is as follows: 1 _ crawler logs are recorded during seed merge scheduling and web page download. These log files are distributed on each node of the crawler deployment cluster. This paper will collect the crawler log data on each node. Then the compressed files are uploaded to the distributed file storage system (HDFS), and the index files are generated for the compressed files. 2. For a distributed crawler cluster, if the number of url downloads per day is kept between 800 million and more than a billion, then the crawler log will be at least several hundred gigabytes a day. So the compressed files uploaded to the distributed file storage system (HDFS) every day are also around 150GB, so the single machine is unable to deal with the massive data, so this paper adopts the information extraction technology as the technical foundation, and uses Hadoop as the computing platform. The crawler log data are processed structurally by Hive, and the statistical indexes concerned by the reptiles are transformed by Hql statements into Job submitted to Hadoop cluster processing. Finally, the index data after MapReduce calculation is imported into the Mysql database. 3. Finally, this paper uses the lightweight framework of PHP (CI / Codelgniteri) to page and display the crawler index data imported into Mysql and send the report mail. The experimental data show that the crawler log data is used as the data source, and the massive data processing platform of Hadoop and Hive is used to extract the effective information in a limited time, which can provide reliable data support for the policy adjustment of the crawler.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 李穎基,彭宏,鄭啟倫,曾煒;Web日志中有趣關(guān)聯(lián)規(guī)則的發(fā)現(xiàn)[J];計算機(jī)研究與發(fā)展;2003年03期
2 任永功;付玉;張亮;呂君義;;一種新的基于Web日志的挖掘用戶瀏覽偏愛路徑的方法[J];計算機(jī)科學(xué);2008年10期
3 吳斌;肖剛;陸佳煒;;基于關(guān)聯(lián)規(guī)則挖掘領(lǐng)域的Apriori算法的優(yōu)化研究[J];計算機(jī)工程與科學(xué);2009年06期
4 王亮;葛瑋;;ETL過程的思考[J];計算機(jī)技術(shù)與發(fā)展;2008年10期
,本文編號:1935352
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1935352.html
最近更新
教材專著