網絡爬蟲性能提升與功能拓展的研究與實現
[Abstract]:With the rapid development of the network, the World wide Web has become the carrier of a lot of information. How to extract and utilize the information effectively becomes a huge challenge. In order to meet this demand, the network crawler came into being. It is a program or script that automatically grabs World wide Web information according to certain rules. Firstly, this paper introduces the history of web crawler and its application field. By analyzing the mainstream web crawler, it is found that today's web crawler mainly serves search engine and prepares data resources for subject oriented user query. Based on the highly extensible crawling architecture of web crawlers, the importance of traditional crawlers to search engines has gradually weakened its flexibility and functional characteristics. Then, this paper discusses some indexes to evaluate the performance of web crawlers, and then introduces the optimization strategies of small and medium-sized web crawlers from two aspects of performance improvement and function expansion. In terms of performance improvement, this paper introduces several optimization schemes according to different function modules. First, choose Gzip-deflate compression code transmission to reduce the network transmission time by reducing the amount of transmission; second, asynchronous request download, increase bandwidth utilization and CPU utilization; third, use breadth first crawling, Using Bloom filter to achieve large-scale URL re-detection; fourth, using well-designed regular expressions to extract page links; fifthly, strictly regularizing the URL crawled to reduce the error of URL to the reptile misleading; sixth, The optimized thread pool efficiently manages multithreading. In the aspect of function expansion, this paper mainly tries to distinguish the traditional reptile from the following three aspects. First, static page performance analysis provides performance improvement advice to the website; second, it acts as an automated test tool for performing test cases on a specified page; third, customizable focused data extraction, According to the needs of the user for the specified format of data capture. Based on the verification of the above optimization strategy, the .NET platform is particularly suitable for lightweight crawlers. The crawler is developed in Visual Studio 2008 with C # language based on. Net platform. The program runs in command-line mode and is highly configurable based on files.
【學位授予單位】:吉林大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3
【參考文獻】
相關期刊論文 前10條
1 李悅;;搜索引擎技術的產生與發(fā)展綜述[J];福建電腦;2010年05期
2 呂曉峰,董守斌,張凌;并行數據采集器任務分配策略的設計與實現[J];華中科技大學學報(自然科學版);2003年S1期
3 周源遠,王繼成,鄭剛,張福炎;Web頁面清洗技術的研究與實現[J];計算機工程;2002年09期
4 周立柱,林玲;聚焦爬蟲技術研究綜述[J];計算機應用;2005年09期
5 尹江;尹治本;黃洪;;網絡爬蟲效率瓶頸的分析與解決方案[J];計算機應用;2008年05期
6 王華,馬亮,顧明;線程池技術研究與應用[J];計算機應用研究;2005年11期
7 劉金紅;陸余良;;主題網絡爬蟲研究綜述[J];計算機應用研究;2007年10期
8 程嵐嵐;;基于正則表達式的大規(guī)模網頁術語對抽取研究[J];情報雜志;2008年11期
9 許笑;張偉哲;張宏莉;方濱興;;廣域網分布式Web爬蟲[J];軟件學報;2010年05期
10 鄒志華;陳玉健;劉強;;一種維護WAP網站的網絡爬蟲的設計[J];微計算機信息;2006年21期
相關會議論文 前1條
1 樸星海;趙鐵軍;鄭德權;張迪;;面向Blog的網絡爬行器設計與實現[A];中文信息處理前沿進展——中國中文信息學會二十五周年學術會議論文集[C];2006年
相關碩士學位論文 前3條
1 何世林;基于Java技術的搜索引擎研究與實現[D];西南交通大學;2006年
2 朱良峰;主題網絡爬蟲的研究與設計[D];南京理工大學;2008年
3 劉喜亮;面向主題的網絡爬蟲設計與實現[D];湖南大學;2009年
,本文編號:2137449
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2137449.html