礦山設(shè)備領(lǐng)域主題爬蟲的設(shè)計(jì)與實(shí)現(xiàn)
[Abstract]:With the rapid development of society and Internet technology, people's access to information gradually from the traditional way to the Internet search engine transition. In the vast network information, people begin to pay attention to the topic search engine which can obtain accurate and effective information quickly. Subject search is a search engine facing a specific industry. Theme crawler is an important part of theme search engine. Topic crawler crawling information efficiency and information accuracy will affect the quality of search results. A high quality topic crawler can quickly and accurately crawl the effective information in the Internet. This paper analyzes and studies the related technology of the topic crawler in order to set up a subject crawler system in the field of mine equipment. In this paper, the structure and development of search engine, search strategy and working principle of web crawler are introduced. It includes the detailed design and description of the method based on keyword theme, the classification and research of the methods of web page denoising and web page denoising. And the key technology in the system point page information extraction link extraction and content extraction research and design; summed up the advantages and disadvantages of three word segmentation methods; the text similarity calculation method focuses on the introduction of vector space model and PageRank algorithm, The calculation of vector space model involves the calculation of weights and feature selection. The whole process of realizing themed crawler system in the field of mine equipment can be embodied in this paper. By analyzing and studying the theoretical knowledge of themed crawler, the process and structure of crawler system are designed. According to the system design requirements, the initial URL, is selected and the database of the system is designed. In order to improve the accuracy of the system, the classical vector space model algorithm is introduced in the algorithm of system correlation calculation. In the implementation of the system, the details of the system implementation are also introduced, and the interface of the system running time is shown. Finally, the subject crawler system in the field of mine equipment is realized.
【學(xué)位授予單位】:河北工程大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 于成龍;于洪波;;網(wǎng)絡(luò)爬蟲技術(shù)研究[J];東莞理工學(xué)院學(xué)報(bào);2011年03期
2 薛惠;何棟;馬靜媛;;基于AHP方法構(gòu)建教學(xué)評(píng)價(jià)指標(biāo)的研究[J];電腦知識(shí)與技術(shù);2009年12期
3 常育紅,姜哲,朱小燕;基于標(biāo)記樹表示方法的頁面結(jié)構(gòu)分析[J];計(jì)算機(jī)工程與應(yīng)用;2004年16期
4 張汛淶;搜索引擎的設(shè)計(jì)剖析[J];計(jì)算機(jī)工程與科學(xué);2002年04期
5 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計(jì)算機(jī)應(yīng)用;2009年S1期
6 劉朋;林泓;高德威;;基于內(nèi)容和鏈接分析的主題爬蟲策略[J];計(jì)算機(jī)與數(shù)字工程;2009年01期
7 李衛(wèi);劉建毅;何華燦;王樅;;基于主題的智能Web信息采集系統(tǒng)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用研究;2006年02期
8 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年10期
9 王蘭波,張積友,范冰冰;國內(nèi)信息導(dǎo)航系統(tǒng)中搜索引擎Robot的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2001年03期
10 張保富;施化吉;馬素琴;;基于TFIDF文本特征加權(quán)方法的改進(jìn)研究[J];計(jì)算機(jī)應(yīng)用與軟件;2011年02期
本文編號(hào):2278602
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2278602.html