礦山設(shè)備領(lǐng)域主題爬蟲的設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-10-18 08:45

【摘要】：隨著社會(huì)和互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展，人們獲取信息的途徑逐漸由傳統(tǒng)的方式向互聯(lián)網(wǎng)中的搜索引擎過渡。在浩瀚如海的網(wǎng)絡(luò)信息中，人們開始關(guān)注能夠快速獲取準(zhǔn)確有效的特定信息的主題搜索引擎。主題搜索是面對(duì)某一個(gè)特定的行業(yè)的搜索引擎，主題搜索引擎中主題爬蟲是其重要的組成部分，主題爬蟲爬取信息效率以及信息的準(zhǔn)確性的好與不好都會(huì)影響到搜索結(jié)果的質(zhì)量。一個(gè)優(yōu)質(zhì)的主題爬蟲可以快速準(zhǔn)確的爬取互聯(lián)網(wǎng)中的有效信息，本文以主題爬蟲為對(duì)象，對(duì)其相關(guān)技術(shù)做出了分析和研究，目的在于建立一個(gè)礦山設(shè)備領(lǐng)域的主題爬蟲系統(tǒng)。本文介紹了搜索引擎的結(jié)構(gòu)原理和發(fā)展、網(wǎng)絡(luò)爬蟲的搜索策略和工作原理等，以網(wǎng)絡(luò)爬蟲的工作流程為脈路對(duì)主題網(wǎng)絡(luò)爬蟲重點(diǎn)技術(shù)做了研究和分析，包括對(duì)基于關(guān)鍵字主題表示方法進(jìn)行詳細(xì)設(shè)計(jì)說明；對(duì)網(wǎng)頁消噪和網(wǎng)頁去重的方法進(jìn)行分類研究；并對(duì)系統(tǒng)中關(guān)鍵技術(shù)點(diǎn)頁面信息提取中的鏈接提取和內(nèi)容提取進(jìn)行了研究和設(shè)計(jì)；總結(jié)了三種分詞方法的優(yōu)缺點(diǎn)；計(jì)算文本相似度的方法重點(diǎn)介紹了向量空間模型和PageRank算法，向量空間模型的計(jì)算中涉及到權(quán)重的計(jì)算和特征選取。文中可體現(xiàn)出礦山設(shè)備領(lǐng)域主題爬蟲系統(tǒng)實(shí)現(xiàn)的全過程，通過分析研究主題爬蟲的理論知識(shí)，對(duì)爬蟲系統(tǒng)進(jìn)行流程和結(jié)構(gòu)設(shè)計(jì)，根據(jù)系統(tǒng)設(shè)計(jì)需求選擇初始URL，并設(shè)計(jì)了該系統(tǒng)的數(shù)據(jù)庫等。在系統(tǒng)相關(guān)性計(jì)算的算法中引入經(jīng)典的向量空間模型算法，以此提高系統(tǒng)精確性能。系統(tǒng)實(shí)現(xiàn)中還介紹了該系統(tǒng)實(shí)現(xiàn)的相關(guān)細(xì)節(jié)，，并展示了系統(tǒng)運(yùn)行時(shí)的相關(guān)界面。最終實(shí)現(xiàn)了礦山設(shè)備領(lǐng)域主題爬蟲系統(tǒng)。
[Abstract]:With the rapid development of society and Internet technology, people's access to information gradually from the traditional way to the Internet search engine transition. In the vast network information, people begin to pay attention to the topic search engine which can obtain accurate and effective information quickly. Subject search is a search engine facing a specific industry. Theme crawler is an important part of theme search engine. Topic crawler crawling information efficiency and information accuracy will affect the quality of search results. A high quality topic crawler can quickly and accurately crawl the effective information in the Internet. This paper analyzes and studies the related technology of the topic crawler in order to set up a subject crawler system in the field of mine equipment. In this paper, the structure and development of search engine, search strategy and working principle of web crawler are introduced. It includes the detailed design and description of the method based on keyword theme, the classification and research of the methods of web page denoising and web page denoising. And the key technology in the system point page information extraction link extraction and content extraction research and design; summed up the advantages and disadvantages of three word segmentation methods; the text similarity calculation method focuses on the introduction of vector space model and PageRank algorithm, The calculation of vector space model involves the calculation of weights and feature selection. The whole process of realizing themed crawler system in the field of mine equipment can be embodied in this paper. By analyzing and studying the theoretical knowledge of themed crawler, the process and structure of crawler system are designed. According to the system design requirements, the initial URL, is selected and the database of the system is designed. In order to improve the accuracy of the system, the classical vector space model algorithm is introduced in the algorithm of system correlation calculation. In the implementation of the system, the details of the system implementation are also introduced, and the interface of the system running time is shown. Finally, the subject crawler system in the field of mine equipment is realized.
【學(xué)位授予單位】：河北工程大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 于成龍;于洪波;;網(wǎng)絡(luò)爬蟲技術(shù)研究[J];東莞理工學(xué)院學(xué)報(bào);2011年03期

2 薛惠;何棟;馬靜媛;;基于AHP方法構(gòu)建教學(xué)評(píng)價(jià)指標(biāo)的研究[J];電腦知識(shí)與技術(shù);2009年12期

3 常育紅,姜哲,朱小燕;基于標(biāo)記樹表示方法的頁面結(jié)構(gòu)分析[J];計(jì)算機(jī)工程與應(yīng)用;2004年16期

4 張汛淶;搜索引擎的設(shè)計(jì)剖析[J];計(jì)算機(jī)工程與科學(xué);2002年04期

5 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計(jì)算機(jī)應(yīng)用;2009年S1期

6 劉朋;林泓;高德威;;基于內(nèi)容和鏈接分析的主題爬蟲策略[J];計(jì)算機(jī)與數(shù)字工程;2009年01期

7 李衛(wèi);劉建毅;何華燦;王樅;;基于主題的智能Web信息采集系統(tǒng)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用研究;2006年02期

8 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年10期

9 王蘭波,張積友,范冰冰;國內(nèi)信息導(dǎo)航系統(tǒng)中搜索引擎Robot的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2001年03期

10 張保富;施化吉;馬素琴;;基于TFIDF文本特征加權(quán)方法的改進(jìn)研究[J];計(jì)算機(jī)應(yīng)用與軟件;2011年02期

本文編號(hào)：2278602

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2278602.html

上一篇：基于Web3.0的數(shù)字資源整合研究
下一篇：程序可視化表示中指針信息抽

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

礦山設(shè)備領(lǐng)域主題爬蟲的設(shè)計(jì)與實(shí)現(xiàn)