當(dāng)前位置：主頁(yè) > 管理論文 > 移動(dòng)網(wǎng)絡(luò)論文 >

基于Hadoop的大規(guī)模中文網(wǎng)站聚類的設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-11-04 19:11

【摘要】：文本聚類分析是數(shù)據(jù)挖掘范疇內(nèi)的一項(xiàng)重要研究,在統(tǒng)計(jì)學(xué)、金融、生物、醫(yī)學(xué)、信息檢索及文檔分類等業(yè)內(nèi)都已普及,同時(shí)比較熱門的還有網(wǎng)站導(dǎo)航欄、論文相似性檢測(cè)及用戶推薦等應(yīng)用。隨著網(wǎng)絡(luò)的快速普及,各種中文網(wǎng)站的數(shù)量都呈現(xiàn)了巨大的增長(zhǎng),人們從網(wǎng)頁(yè)上獲取的數(shù)據(jù)信息量也越來(lái)越多。由于不同的人有不同的需要和標(biāo)準(zhǔn),導(dǎo)致了數(shù)據(jù)的多樣性和質(zhì)量要求。那么,怎樣快速且高效率的從網(wǎng)頁(yè)上挖掘出我們所需的信息已成現(xiàn)階段的一個(gè)巨大挑戰(zhàn)。對(duì)文本聚類的研究應(yīng)用為此提供了一個(gè)很好的解決途徑。也正是由于數(shù)據(jù)具有海量、多樣性等特征,使得傳統(tǒng)的聚類分析在對(duì)文本進(jìn)行聚類處理的時(shí)候往往在時(shí)間空間上達(dá)不到理想的效果。隨著云計(jì)算的興起,采用分布式并行框架進(jìn)行聚類處理,已被越來(lái)越多的學(xué)者研究應(yīng)用。Hadoop是由Apache基金會(huì)開發(fā)的一個(gè)分布式系統(tǒng)基礎(chǔ)架構(gòu),它有兩個(gè)核心的框架設(shè)計(jì)：HDFS和MapReduce。HDFS框架主要承擔(dān)著為海量的數(shù)據(jù)提供存儲(chǔ)的任務(wù),而框架MapReduce的任務(wù)就是計(jì)算,且這種對(duì)海量數(shù)據(jù)的計(jì)算是并行的。本文正是基于Hadoop平臺(tái)上設(shè)計(jì)的對(duì)中文網(wǎng)站進(jìn)行聚類分析的系統(tǒng),下面是本文的主要研究工作。1.對(duì)經(jīng)常使用的經(jīng)典聚類算法思想及相關(guān)理論知識(shí)進(jìn)行介紹。詳細(xì)介紹了文本聚類的整個(gè)流程過(guò)程及常見的相似性度量方法等等。2.深入理解Hadoop平臺(tái)的兩大核心框架及關(guān)鍵技術(shù),闡述它們間的相互聯(lián)系及運(yùn)行機(jī)制,說(shuō)明相比傳統(tǒng)單機(jī)環(huán)境下作聚類實(shí)驗(yàn)的優(yōu)勢(shì)。3.搭建Hadoop分布式環(huán)境,配置使用eclipse開發(fā)工具,采用k-means聚類算法,編寫程序?qū)χ形木W(wǎng)站網(wǎng)頁(yè)數(shù)據(jù)進(jìn)行系統(tǒng)測(cè)試,得到聚類結(jié)果,實(shí)驗(yàn)成功對(duì)所有網(wǎng)頁(yè)進(jìn)行劃分；對(duì)實(shí)驗(yàn)結(jié)果整理、進(jìn)行分析,證明Hadoop在處理大規(guī)模數(shù)據(jù)上的強(qiáng)大計(jì)算能力,且在一定程度下,隨著集群節(jié)點(diǎn)的增加,計(jì)算能力增強(qiáng)。
[Abstract]:Text clustering analysis is an important research in the field of data mining. It has been widely used in the fields of statistics, finance, biology, medicine, information retrieval and document classification. Similarity detection and user recommendation are used in this paper. With the rapid popularity of the Internet, the number of various Chinese websites has shown a huge growth, people get more and more data from the web pages. Because different people have different needs and standards, resulting in data diversity and quality requirements. Therefore, how to quickly and efficiently mine the information we need from web pages has become a huge challenge at this stage. The research and application of text clustering provide a good way to solve this problem. It is precisely because the data has the characteristics of magnanimity and diversity that the traditional clustering analysis often can not achieve the ideal effect in time and space when clustering the text. With the rise of cloud computing, cluster processing using distributed parallel framework has been studied and applied by more and more scholars. Hadoop is a distributed system infrastructure developed by Apache Foundation. It has two core framework design: HDFS and MapReduce.HDFS framework mainly undertake the task of providing storage for massive data, and the task of frame MapReduce is to compute, and this kind of computation of mass data is parallel. This paper is based on the Hadoop platform to design the Chinese website clustering analysis system, the following is the main research work. 1. This paper introduces the idea of classical clustering algorithm and related theoretical knowledge. In this paper, the whole process of text clustering and the common similarity measurement methods are introduced in detail. 2. In this paper, we deeply understand the two core frameworks and key technologies of Hadoop platform, expound their interrelation and operation mechanism, and explain the advantages of clustering experiment in traditional single machine environment. 3. Build the Hadoop distributed environment, configure the use of eclipse development tools, use k-means clustering algorithm, write a program to test the Chinese web page data, get the clustering results, the experiment successfully divided all the pages; The analysis of the experimental results shows that Hadoop has powerful computing power in dealing with large scale data, and to a certain extent, with the increase of cluster nodes, the computing power is enhanced.
【學(xué)位授予單位】：華中師范大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP311.13;TP393.092

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 逄利華;張錦春;;基于Hadoop的分布式數(shù)據(jù)庫(kù)系統(tǒng)[J];辦公自動(dòng)化;2014年05期

2 鄭瑋;;Hadoop釋放大數(shù)據(jù)潛能[J];軟件和信息服務(wù);2012年10期

3 劉爾凱;崔振東;;基于HADOOP技術(shù) 實(shí)現(xiàn)銀行歷史數(shù)據(jù)線上化研究[J];金融電子化;2014年01期

4 鄒群;;一種基于Hadoop的數(shù)字圖書存儲(chǔ)系統(tǒng)設(shè)計(jì)方案[J];黑龍江史志;2014年01期

5 諶章義;畢偉;向萬(wàn)紅;王國(guó)安;吳愛國(guó);;基于Hadoop的海量電費(fèi)數(shù)據(jù)處理模型[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2014年05期

6 ;大數(shù)據(jù)不等于Hadoop[J];辦公自動(dòng)化;2014年06期

7 ;保障Hadoop數(shù)據(jù)安全的十大措施[J];計(jì)算機(jī)與網(wǎng)絡(luò);2013年08期

8 王峰;雷葆華;;Hadoop分布式文件系統(tǒng)的模型分析[J];電信科學(xué);2010年12期

9 蘇小會(huì);何婧媛;;Hadoop中任務(wù)調(diào)度算法的改進(jìn)[J];電子設(shè)計(jì)工程;2012年22期

10 林偉偉;;一種改進(jìn)的Hadoop數(shù)據(jù)放置策略[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2012年01期

相關(guān)重要報(bào)紙文章前8條

1 本報(bào)記者郭濤;機(jī)器大數(shù)據(jù)也離不開Hadoop[N];中國(guó)計(jì)算機(jī)報(bào);2013年

2 本報(bào)記者王星;Hadoop引發(fā)大數(shù)據(jù)之戰(zhàn)[N];電腦報(bào);2012年

3 本報(bào)記者鄒大斌;Hadoop一體機(jī)降低大數(shù)據(jù)門檻[N];計(jì)算機(jī)世界;2012年

4 孫定;云計(jì)算、大數(shù)據(jù)與Hadoop[N];計(jì)算機(jī)世界;2011年

5 樂(lè)天　編譯;Hadoop：打開大數(shù)據(jù)之門的金鑰匙[N];計(jì)算機(jī)世界;2012年

6 范范　編譯;Hadoop用戶可以使用多種搜索引擎[N];網(wǎng)絡(luò)世界;2013年

7 波波　編譯;Hadoop、Web 2.0為磁帶帶來(lái)新商機(jī)[N];網(wǎng)絡(luò)世界;2013年

8 本報(bào)記者郭濤;讓更多人能夠使用Hadoop[N];中國(guó)計(jì)算機(jī)報(bào);2012年

相關(guān)博士學(xué)位論文前1條

1 宋亞奇;云平臺(tái)下電力設(shè)備監(jiān)測(cè)大數(shù)據(jù)存儲(chǔ)優(yōu)化與并行處理技術(shù)研究[D];華北電力大學(xué)(北京);2016年

相關(guān)碩士學(xué)位論文前10條

1 劉君;基于Hadoop技術(shù)的氣象數(shù)據(jù)采集及數(shù)據(jù)挖掘平臺(tái)的研究[D];天津理工大學(xué);2015年

2 譚旭;基于物流數(shù)據(jù)的快遞網(wǎng)絡(luò)分析與建模[D];浙江大學(xué);2015年

3 趙偉;基于Hadoop的數(shù)據(jù)挖掘算法并行化研究[D];西南交通大學(xué);2015年

4 趙振崇;基于Hadoop的決策樹挖掘算法的研究[D];蘭州大學(xué);2015年

5 郭凱振;基于Hadoop的分布式計(jì)算系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];大連海事大學(xué);2015年

6 白亮;基于Hadoop的民航高價(jià)值旅客發(fā)現(xiàn)方法研究[D];中國(guó)民航大學(xué);2015年

7 席屏;基于Hadoop的視頻大數(shù)據(jù)智能預(yù)警系統(tǒng)應(yīng)用研究[D];江蘇科技大學(xué);2015年

8 董立明;基于HADOOP的分布式推薦引擎[D];復(fù)旦大學(xué);2013年

9 陸藝達(dá);基于Hadoop分布式計(jì)算框架的垃圾短信群發(fā)檢測(cè)系統(tǒng)[D];復(fù)旦大學(xué);2013年

10 沈德利;基于Hadoop的密文檢索關(guān)鍵技術(shù)研究[D];西安電子科技大學(xué);2014年

，

本文編號(hào)：2310881

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/2310881.html

上一篇：一種面向網(wǎng)格計(jì)算的自適應(yīng)動(dòng)態(tài)冗余預(yù)留策略
下一篇：基于虛擬專用網(wǎng)絡(luò)對(duì)礦區(qū)資源的管理

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop的大規(guī)模中文網(wǎng)站聚類的設(shè)計(jì)與實(shí)現(xiàn)