異構(gòu)學(xué)術(shù)資源分布式爬取系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2019-04-02 20:22
【摘要】:隨著學(xué)術(shù)信息的快速膨脹和互聯(lián)網(wǎng)技術(shù)的快速發(fā)展,近年來,網(wǎng)絡(luò)中的學(xué)術(shù)資源呈現(xiàn)出規(guī)模大、增長(zhǎng)速度快、來源和組織結(jié)構(gòu)不統(tǒng)一的特征,給學(xué)術(shù)資源的獲取帶來了困難。同時(shí),本項(xiàng)目組一直針對(duì)互聯(lián)網(wǎng)中的學(xué)術(shù)資源進(jìn)行信息挖掘工作,通過挖掘?qū)W術(shù)信息,進(jìn)行學(xué)術(shù)建模和學(xué)術(shù)推薦,這對(duì)海量、實(shí)時(shí)有效的學(xué)術(shù)資源數(shù)據(jù)的獲取提出了更高的要求。因此,從不同的學(xué)術(shù)資源搜索網(wǎng)站快速高效地爬取學(xué)術(shù)資源,抽取有用的學(xué)術(shù)資源信息,建立統(tǒng)一的學(xué)術(shù)資源數(shù)據(jù)庫,顯得尤為重要。本論文的主要工作包括了解網(wǎng)絡(luò)爬蟲相關(guān)技術(shù)、分布式計(jì)算的工作原理、網(wǎng)頁解析的方法及海量數(shù)據(jù)存儲(chǔ)技術(shù)等。在此基礎(chǔ)上,基于分布式爬取框架Nutch,本文設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)異構(gòu)學(xué)術(shù)資源分布式爬取系統(tǒng),包括設(shè)計(jì)和實(shí)現(xiàn)異構(gòu)學(xué)術(shù)資源網(wǎng)頁的解析和存儲(chǔ),給出基于Nutch的分布式爬取系統(tǒng)的整體結(jié)構(gòu)、物理框架和存儲(chǔ)結(jié)構(gòu),以及對(duì)Nutch的擴(kuò)展方法和方案,然后基于系統(tǒng)的設(shè)計(jì)進(jìn)行詳細(xì)的編碼實(shí)現(xiàn)和系統(tǒng)測(cè)試。本文設(shè)計(jì)和實(shí)現(xiàn)的異構(gòu)學(xué)術(shù)資源分布式爬取系統(tǒng)目前已經(jīng)在實(shí)驗(yàn)室環(huán)境得到部署應(yīng)用。本文基于Nutch和Hadoop設(shè)計(jì)和實(shí)現(xiàn)的異構(gòu)學(xué)術(shù)資源分布式爬取系統(tǒng),解決了單機(jī)爬取速度緩慢、擴(kuò)展性差的問題,提高了學(xué)術(shù)資源信息采集的速度,擴(kuò)大了采集規(guī)模,為學(xué)術(shù)資源的挖掘和研究提供了學(xué)術(shù)數(shù)據(jù)。
[Abstract]:With the rapid expansion of academic information and the rapid development of Internet technology, in recent years, the academic resources in the network have the characteristics of large scale, rapid growth rate and inconsistent source and organizational structure, which has brought difficulties to the acquisition of academic resources. At the same time, the project team has been working on information mining for academic resources on the Internet, through mining academic information, academic modeling and academic recommendations, which are massive, The acquisition of real-time and effective academic resource data puts forward higher requirements. Therefore, it is very important to crawl academic resources quickly and efficiently from different academic resource search websites, extract useful information of academic resources, and establish a unified academic resource database. The main work of this paper is to understand the related technology of web crawler, the working principle of distributed computing, the method of web page parsing and the technology of mass data storage, etc. On this basis, based on the distributed crawling framework Nutch, this paper designs and implements a heterogeneous academic resources distributed crawling system, including the design and implementation of heterogeneous academic resources web page parsing and storage. This paper presents the whole structure, physical framework and storage structure of the distributed crawling system based on Nutch, as well as the method and scheme of extending Nutch, and then carries out detailed coding implementation and system testing based on the design of the system. The distributed crawling system of heterogeneous academic resources designed and implemented in this paper has been deployed in laboratory environment. Based on the design and implementation of heterogeneous academic resources distributed crawling system based on Nutch and Hadoop, this paper solves the problems of slow crawling speed and poor expansibility of single machine crawling, improves the speed of collecting academic resources information and expands the collection scale. It provides academic data for the mining and research of academic resources.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.52
本文編號(hào):2452888
[Abstract]:With the rapid expansion of academic information and the rapid development of Internet technology, in recent years, the academic resources in the network have the characteristics of large scale, rapid growth rate and inconsistent source and organizational structure, which has brought difficulties to the acquisition of academic resources. At the same time, the project team has been working on information mining for academic resources on the Internet, through mining academic information, academic modeling and academic recommendations, which are massive, The acquisition of real-time and effective academic resource data puts forward higher requirements. Therefore, it is very important to crawl academic resources quickly and efficiently from different academic resource search websites, extract useful information of academic resources, and establish a unified academic resource database. The main work of this paper is to understand the related technology of web crawler, the working principle of distributed computing, the method of web page parsing and the technology of mass data storage, etc. On this basis, based on the distributed crawling framework Nutch, this paper designs and implements a heterogeneous academic resources distributed crawling system, including the design and implementation of heterogeneous academic resources web page parsing and storage. This paper presents the whole structure, physical framework and storage structure of the distributed crawling system based on Nutch, as well as the method and scheme of extending Nutch, and then carries out detailed coding implementation and system testing based on the design of the system. The distributed crawling system of heterogeneous academic resources designed and implemented in this paper has been deployed in laboratory environment. Based on the design and implementation of heterogeneous academic resources distributed crawling system based on Nutch and Hadoop, this paper solves the problems of slow crawling speed and poor expansibility of single machine crawling, improves the speed of collecting academic resources information and expands the collection scale. It provides academic data for the mining and research of academic resources.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.52
【參考文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前2條
1 鄭博文;基于Hadoop的分布式網(wǎng)絡(luò)爬蟲技術(shù)[D];哈爾濱工業(yè)大學(xué);2011年
2 朱良峰;主題網(wǎng)絡(luò)爬蟲的研究與設(shè)計(jì)[D];南京理工大學(xué);2008年
,本文編號(hào):2452888
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2452888.html
最近更新
教材專著