面向檢驗(yàn)檢疫領(lǐng)域主題爬蟲的研究及系統(tǒng)實(shí)現(xiàn)
發(fā)布時(shí)間:2018-04-17 13:56
本文選題:網(wǎng)絡(luò)爬蟲 + 數(shù)據(jù)檢索。 參考:《浙江大學(xué)》2017年碩士論文
【摘要】:近年來,全球信息數(shù)據(jù)總量在互聯(lián)網(wǎng)的推動(dòng)下急劇地增長,據(jù)國際數(shù)據(jù)公司(IDC)預(yù)計(jì),至2020年,全球的數(shù)據(jù)總量將以每年50%的增長率達(dá)到40ZB,其中文件、視頻、音頻等非結(jié)構(gòu)化信息占數(shù)據(jù)生產(chǎn)總量的90%。在這樣的背景下,用戶在數(shù)據(jù)海洋中對(duì)信息精度和深度的要求日益提高,特別是針對(duì)專業(yè)領(lǐng)域內(nèi)的特殊查詢需求,通用搜索引擎收集的信息駁雜而不精確。有鑒于此,本文把文本分類問題作為主要的研究對(duì)象,從垂直搜索引擎出發(fā),深度探究了其中的數(shù)據(jù)采集、關(guān)鍵詞搜索等技術(shù),并以實(shí)際項(xiàng)目為依托,基于檢驗(yàn)檢疫這一特定的主題領(lǐng)域?qū)崿F(xiàn)了具體的數(shù)據(jù)采集、搜索子系統(tǒng)。本文主要的貢獻(xiàn)如下:1、概述了爬蟲系統(tǒng)實(shí)現(xiàn)過程中的用到的關(guān)鍵技術(shù),如網(wǎng)頁去噪、正文提取、海量URL和文檔的去重、NoSQL數(shù)據(jù)庫等。此外為了應(yīng)對(duì)網(wǎng)頁中動(dòng)態(tài)內(nèi)容的解析和下載,本文提出了基于協(xié)議控制的JavaScript解析策略。2、分別列舉討論了基于網(wǎng)絡(luò)拓?fù)洹⒕W(wǎng)頁正文、用戶訪問行為的網(wǎng)頁抓取策略,對(duì)比其優(yōu)缺點(diǎn)后,本文提出了基于URL密度聚類的網(wǎng)頁抓取策略,通過聚集簇的方式來對(duì)相關(guān)網(wǎng)頁進(jìn)行劃分和抓取。3、對(duì)比傳統(tǒng)的文本分類器的優(yōu)缺點(diǎn),本文結(jié)合詞向量Word2vec和深度學(xué)習(xí)的方法,提出了基于Attention機(jī)制的層次化長短時(shí)分類網(wǎng)絡(luò)用于文本分類任務(wù),分別從單詞和句子的維度提取結(jié)構(gòu)化特征來將整個(gè)文本表征為特征向量。4、結(jié)合“973計(jì)劃”中的子課題,本文實(shí)現(xiàn)了面向檢驗(yàn)檢疫領(lǐng)域的數(shù)據(jù)采集子系統(tǒng)和數(shù)據(jù)搜索子系統(tǒng),數(shù)據(jù)采集、清洗、存儲(chǔ)、分類和索引等服務(wù)部署在多臺(tái)服務(wù)器構(gòu)成的分布式環(huán)境中,有效地提高了計(jì)算性能和系統(tǒng)的穩(wěn)定性。
[Abstract]:In recent years, the total amount of global information data has increased dramatically under the impetus of the Internet. According to IDC, an international data company, by 2020, the global data volume will reach 40ZB at an annual rate of 50%, including documents and videos.Unstructured information such as audio accounts for 90% of total data production.In this context, users in the data ocean of information accuracy and depth requirements are increasing, especially for the special query requirements in the specialized field, the information collected by the general search engine is complex and inaccurate.In view of this, this paper takes the text classification as the main research object, from the vertical search engine, deeply explores the technology of data collection, keyword search and so on, and relies on the actual project.Based on inspection and quarantine, a specific data acquisition and search subsystem is implemented.The main contributions of this paper are as follows: 1. The key technologies used in the implementation of the crawler system are summarized, such as page de-noising, text extraction, massive URL and document de-noSQL database, etc.In addition, in order to deal with the dynamic content parsing and downloading in web pages, this paper puts forward the JavaScript parsing strategy based on protocol control, and enumerates and discusses the web crawling strategy based on network topology, page text and user access behavior, respectively.After comparing its advantages and disadvantages, this paper proposes a web page grab strategy based on URL density clustering, classifies and grabs the relevant pages by clustering, compares the advantages and disadvantages of the traditional text classifier.Combined with word vector Word2vec and depth learning method, this paper proposes a hierarchical long-time classification network based on Attention mechanism for text classification tasks.The structured features are extracted from the dimension of words and sentences to represent the whole text as feature vectors. Combined with the subtopics of 973 Plan, this paper implements a data acquisition subsystem and a data search subsystem oriented to inspection and quarantine.Data acquisition, cleaning, storage, classification and indexing services are deployed in a distributed environment composed of multiple servers, which effectively improves computing performance and system stability.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1;TP393.092
,
本文編號(hào):1763880
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1763880.html
最近更新
教材專著