基于集群計算的網(wǎng)絡(luò)信息采集系統(tǒng)的設(shè)計與實(shí)現(xiàn)

發(fā)布時間：2018-06-09 12:42

本文選題：網(wǎng)絡(luò)信息采集 + 雙語網(wǎng)絡(luò)信息更新�。� 參考：《哈爾濱工業(yè)大學(xué)》2012年碩士論文

【摘要】：隨著Web信息技術(shù)的不斷發(fā)展，網(wǎng)絡(luò)信息采集技術(shù)也日趨完善，作為許多Web信息服務(wù)的基礎(chǔ)及重要組成部分，它被廣泛的應(yīng)用于搜索引擎、機(jī)器翻譯等自然語言處理的各個方面。面對互聯(lián)網(wǎng)上各種信息資源，，有針對性的網(wǎng)絡(luò)信息采集系統(tǒng)不斷推陳出新，為獲取網(wǎng)絡(luò)信息提供極大的便利，同時，海量增長的網(wǎng)絡(luò)信息也給信息的獲取帶來了新的挑戰(zhàn)。對于統(tǒng)計機(jī)器翻譯、機(jī)器輔助翻譯以及翻譯知識獲取等研究來說，網(wǎng)絡(luò)信息采集的任務(wù)是從海量的Web網(wǎng)頁中發(fā)現(xiàn)大規(guī)模、含有多語言平行網(wǎng)頁文本的網(wǎng)站中搜集平行網(wǎng)頁文本，建設(shè)大規(guī)模多語言平行語料庫，這也正是本文的研究目標(biāo)。本文深入研究了一個針對大規(guī)模數(shù)據(jù)處理的分布式計算集群框架—Hadoop，并在此基礎(chǔ)上設(shè)計并實(shí)現(xiàn)了一個可配置、可擴(kuò)展的面向Web的分布式網(wǎng)絡(luò)信息采集系統(tǒng)，此外，本文還設(shè)計并實(shí)現(xiàn)了一個增量式的網(wǎng)絡(luò)信息更新采集系統(tǒng)，用來對雙語平行網(wǎng)頁進(jìn)行增量式更新采集。本文首先介紹了網(wǎng)絡(luò)信息采集系統(tǒng)的研究背景、當(dāng)期的發(fā)展現(xiàn)狀，并調(diào)研了當(dāng)前非常熱門的分布式計算集群框架—Hadoop，深入的理解其子系統(tǒng)Hadoop分布式文件系統(tǒng)(HDFS)及其重要的并行計算模型MapReduce的設(shè)計原理、體系結(jié)構(gòu)等，分析了網(wǎng)絡(luò)信息采集中URLs去重、任務(wù)調(diào)度、網(wǎng)頁更新的識別等的關(guān)鍵技術(shù)，在此基礎(chǔ)上設(shè)計并實(shí)現(xiàn)了面向Web的分布式網(wǎng)絡(luò)信息采集系統(tǒng)和面向雙語網(wǎng)站的增量式更新采集系統(tǒng)。最后通過對實(shí)驗(yàn)結(jié)果的分析，驗(yàn)證了本文提出的面向Web的分布式網(wǎng)絡(luò)信息采集系統(tǒng)的高可配置、穩(wěn)定、高可擴(kuò)展等的特性，能夠完成采集大規(guī)模、多語言網(wǎng)頁的任務(wù)，對于面向雙語網(wǎng)站的增量式更新采集系統(tǒng)，能夠高效的完成對雙語網(wǎng)站的增量式更新采集網(wǎng)頁的任務(wù)，最終實(shí)現(xiàn)了課題的研究目標(biāo)。
[Abstract]:With the continuous development of Web information technology, network information collection technology is becoming more and more perfect. As the foundation and important component of many Web information services, it is widely used in various aspects of natural language processing, such as search engine, machine translation and so on. In the face of all kinds of information resources on the Internet, the targeted network information collection system is constantly emerging, which provides great convenience for obtaining network information, at the same time, For the research of statistical machine translation, machine assisted translation and translation knowledge acquisition, the task of network information collection is to find a large scale from a large number of Web pages. Web sites containing multilingual parallel page text collect parallel page text and build a large scale multilingual parallel corpus. This is exactly the research goal of this paper. A distributed computing cluster framework named Hadoop for large-scale data processing is studied in this paper, and a configurable computing cluster framework is designed and implemented on this basis. An extensible Web-oriented distributed network information acquisition system is designed and implemented in this paper. This paper first introduces the research background of the network information collection system, the current development of the current situation, It also investigates the popular distributed computing cluster framework -Hadoop. and deeply understands the design principle and architecture of Hadoop distributed file system (HDFSs) and its important parallel computing model MapReduce. This paper analyzes the key technologies of URLs removal, task scheduling and web page updating in network information collection. On this basis, a Web-oriented distributed network information acquisition system and an incremental update collection system for bilingual websites are designed and implemented. Finally, the experimental results are analyzed. It is verified that the Web-oriented distributed network information acquisition system is highly configurable, stable and scalable, and can accomplish the task of collecting large-scale and multi-language web pages. For the incremental update acquisition system for bilingual websites, the task of incremental updating and collecting web pages of bilingual websites can be accomplished efficiently, and the research goal of the subject is finally realized.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP274.2;TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文前4條

1 蔡欣寶;郭若飛;趙朋朋;崔志明;;Web論壇數(shù)據(jù)源增量爬蟲的研究[J];計算機(jī)工程;2010年09期

2 孟濤;王繼民;閆宏飛;;網(wǎng)頁變化與增量搜集技術(shù)[J];軟件學(xué)報;2006年05期

3 徐尚瑜;;基于泊松過程的爬蟲調(diào)度策略分析[J];現(xiàn)代計算機(jī)(專業(yè)版);2009年12期

4 陳麗君;林懷忠;;搜索引擎頁面刷新策略研究綜述[J];計算機(jī)系統(tǒng)應(yīng)用;2009年07期

本文編號：1999763

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1999763.html

上一篇：支持科研和學(xué)術(shù)發(fā)現(xiàn)的語義網(wǎng)應(yīng)用實(shí)例研究
下一篇：基于Ontology的TBT文檔搜索系統(tǒng)研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于集群計算的網(wǎng)絡(luò)信息采集系統(tǒng)的設(shè)計與實(shí)現(xiàn)