天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

面向檢驗(yàn)檢疫領(lǐng)域主題爬蟲的研究及系統(tǒng)實(shí)現(xiàn)

發(fā)布時(shí)間:2018-04-17 13:56

  本文選題:網(wǎng)絡(luò)爬蟲 + 數(shù)據(jù)檢索。 參考:《浙江大學(xué)》2017年碩士論文


【摘要】:近年來,全球信息數(shù)據(jù)總量在互聯(lián)網(wǎng)的推動(dòng)下急劇地增長,據(jù)國際數(shù)據(jù)公司(IDC)預(yù)計(jì),至2020年,全球的數(shù)據(jù)總量將以每年50%的增長率達(dá)到40ZB,其中文件、視頻、音頻等非結(jié)構(gòu)化信息占數(shù)據(jù)生產(chǎn)總量的90%。在這樣的背景下,用戶在數(shù)據(jù)海洋中對(duì)信息精度和深度的要求日益提高,特別是針對(duì)專業(yè)領(lǐng)域內(nèi)的特殊查詢需求,通用搜索引擎收集的信息駁雜而不精確。有鑒于此,本文把文本分類問題作為主要的研究對(duì)象,從垂直搜索引擎出發(fā),深度探究了其中的數(shù)據(jù)采集、關(guān)鍵詞搜索等技術(shù),并以實(shí)際項(xiàng)目為依托,基于檢驗(yàn)檢疫這一特定的主題領(lǐng)域?qū)崿F(xiàn)了具體的數(shù)據(jù)采集、搜索子系統(tǒng)。本文主要的貢獻(xiàn)如下:1、概述了爬蟲系統(tǒng)實(shí)現(xiàn)過程中的用到的關(guān)鍵技術(shù),如網(wǎng)頁去噪、正文提取、海量URL和文檔的去重、NoSQL數(shù)據(jù)庫等。此外為了應(yīng)對(duì)網(wǎng)頁中動(dòng)態(tài)內(nèi)容的解析和下載,本文提出了基于協(xié)議控制的JavaScript解析策略。2、分別列舉討論了基于網(wǎng)絡(luò)拓?fù)洹⒕W(wǎng)頁正文、用戶訪問行為的網(wǎng)頁抓取策略,對(duì)比其優(yōu)缺點(diǎn)后,本文提出了基于URL密度聚類的網(wǎng)頁抓取策略,通過聚集簇的方式來對(duì)相關(guān)網(wǎng)頁進(jìn)行劃分和抓取。3、對(duì)比傳統(tǒng)的文本分類器的優(yōu)缺點(diǎn),本文結(jié)合詞向量Word2vec和深度學(xué)習(xí)的方法,提出了基于Attention機(jī)制的層次化長短時(shí)分類網(wǎng)絡(luò)用于文本分類任務(wù),分別從單詞和句子的維度提取結(jié)構(gòu)化特征來將整個(gè)文本表征為特征向量。4、結(jié)合“973計(jì)劃”中的子課題,本文實(shí)現(xiàn)了面向檢驗(yàn)檢疫領(lǐng)域的數(shù)據(jù)采集子系統(tǒng)和數(shù)據(jù)搜索子系統(tǒng),數(shù)據(jù)采集、清洗、存儲(chǔ)、分類和索引等服務(wù)部署在多臺(tái)服務(wù)器構(gòu)成的分布式環(huán)境中,有效地提高了計(jì)算性能和系統(tǒng)的穩(wěn)定性。
[Abstract]:In recent years, the total amount of global information data has increased dramatically under the impetus of the Internet. According to IDC, an international data company, by 2020, the global data volume will reach 40ZB at an annual rate of 50%, including documents and videos.Unstructured information such as audio accounts for 90% of total data production.In this context, users in the data ocean of information accuracy and depth requirements are increasing, especially for the special query requirements in the specialized field, the information collected by the general search engine is complex and inaccurate.In view of this, this paper takes the text classification as the main research object, from the vertical search engine, deeply explores the technology of data collection, keyword search and so on, and relies on the actual project.Based on inspection and quarantine, a specific data acquisition and search subsystem is implemented.The main contributions of this paper are as follows: 1. The key technologies used in the implementation of the crawler system are summarized, such as page de-noising, text extraction, massive URL and document de-noSQL database, etc.In addition, in order to deal with the dynamic content parsing and downloading in web pages, this paper puts forward the JavaScript parsing strategy based on protocol control, and enumerates and discusses the web crawling strategy based on network topology, page text and user access behavior, respectively.After comparing its advantages and disadvantages, this paper proposes a web page grab strategy based on URL density clustering, classifies and grabs the relevant pages by clustering, compares the advantages and disadvantages of the traditional text classifier.Combined with word vector Word2vec and depth learning method, this paper proposes a hierarchical long-time classification network based on Attention mechanism for text classification tasks.The structured features are extracted from the dimension of words and sentences to represent the whole text as feature vectors. Combined with the subtopics of 973 Plan, this paper implements a data acquisition subsystem and a data search subsystem oriented to inspection and quarantine.Data acquisition, cleaning, storage, classification and indexing services are deployed in a distributed environment composed of multiple servers, which effectively improves computing performance and system stability.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1;TP393.092
,

本文編號(hào):1763880

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1763880.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶baf19***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
狠色婷婷久久一区二区三区| 久热99中文字幕视频在线| 欧美精品日韩精品一区| 日本熟女中文字幕一区| 美女激情免费在线观看| 91插插插外国一区二区婷婷| 夫妻激情视频一区二区三区| 色哟哟精品一区二区三区| 99久久精品国产麻豆| 黄片三级免费在线观看| 在线懂色一区二区三区精品| 久草视频这里只是精品| 黄色污污在线免费观看| 欧美人妻少妇精品久久性色| 欧美区一区二在线播放| 国产免费自拍黄片免费看| 日本免费一级黄色录像| 久久精品蜜桃一区二区av| 国产中文字幕一区二区| 欧美一级片日韩一级片| 日韩18一区二区三区| 国产又粗又长又大的视频| 亚洲一区二区精品久久av| 日本一级特黄大片国产| 欧美黑人巨大一区二区三区| 精品高清美女精品国产区| 成年女人下边潮喷毛片免费| 好吊色免费在线观看视频| 日本女优一色一伦一区二区三区 | 懂色一区二区三区四区| 国产大屁股喷水在线观看视频| 欧美不卡一区二区在线视频| 日本午夜精品视频在线观看| 成人精品日韩专区在线观看| 日本不卡视频在线观看| 日本不卡在线视频你懂的 | 国内尹人香蕉综合在线| 成人综合网视频在线观看| 国产一级二级三级观看| 欧美日韩国产自拍亚洲| 亚洲欧美日本视频一区二区|