天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于MapReduce的Web文本挖掘系統(tǒng)的研究與實(shí)現(xiàn)

發(fā)布時(shí)間:2018-02-28 19:14

  本文關(guān)鍵詞: Web挖掘 MapReduce MongoDB 社會(huì)網(wǎng)絡(luò)分析 命名實(shí)體 出處:《北京郵電大學(xué)》2013年碩士論文 論文類型:學(xué)位論文


【摘要】:隨著互聯(lián)網(wǎng)媒體時(shí)代的成熟和完善,越來越多的媒體信息開始在通過這種快捷、廉價(jià)的方式進(jìn)行發(fā)布傳輸,網(wǎng)絡(luò)上的信息數(shù)量異常龐大,并且伴隨著對互聯(lián)網(wǎng)應(yīng)用的深入,正在以驚人的速度增長。搜索引擎可以幫助我們從互聯(lián)網(wǎng)上獲取較為準(zhǔn)確的相關(guān)信息的網(wǎng)頁,但是獲取的信息比較初級、寬泛,無法確認(rèn)這些信息的內(nèi)在關(guān)聯(lián)和實(shí)體模型,仍然需要進(jìn)行進(jìn)一步的分析加工。這時(shí)候一個(gè)可選的方法就是借鑒通用的網(wǎng)絡(luò)分析的方法,對實(shí)體化后的異構(gòu)web信息進(jìn)行關(guān)系挖掘以及模型分析,以發(fā)掘出其潛在的、有價(jià)值的知識(shí)。 本文主要研究MongoDB分布式數(shù)據(jù)庫和Hadoop分布式計(jì)算框架,并基于MongoDB的數(shù)據(jù)建模和Hadoop的MapReduce計(jì)算設(shè)計(jì)高效的Web新聞實(shí)體分析方案,具體的研究工作以及內(nèi)容包括: 1、采取基于XML分析的方法,對搜狗實(shí)驗(yàn)室的Web新聞數(shù)據(jù)進(jìn)行半結(jié)構(gòu)化分析,提取相應(yīng)的信息,并在MapReduce框架下對文本內(nèi)容進(jìn)行分詞處理,并利用TF-IDF算法計(jì)算關(guān)鍵詞權(quán)重,提取文本特征表達(dá)式。 2、基于MongoDB的數(shù)據(jù)模型以及并行處理,結(jié)合關(guān)系網(wǎng)絡(luò)分析算法,使用點(diǎn)度中心性算法分析單個(gè)實(shí)體節(jié)點(diǎn)在實(shí)體關(guān)系網(wǎng)絡(luò)中的中心勢,以實(shí)現(xiàn)對新聞主題實(shí)現(xiàn)核心挖掘;結(jié)合凝聚子群分析,挖掘出相互之間聯(lián)系比較緊密的小團(tuán)體,構(gòu)建實(shí)體間的塊模型。 3、應(yīng)用基于文檔的非關(guān)系型數(shù)據(jù)庫MongoDB,利用其強(qiáng)大的建模能力,設(shè)計(jì)能夠描述文本特征的數(shù)據(jù)模型,并結(jié)合Hadoop的MapReduce并行計(jì)算框架,在J2EE的架構(gòu)下,完成對Web新聞的分布式存儲(chǔ)和計(jì)算平臺(tái)的設(shè)計(jì)和搭建,并對所獲取的分析結(jié)果利用JUNG技術(shù)進(jìn)行展示。
[Abstract]:With the maturity and perfection of the Internet media era, more and more media information begin to be released and transmitted through this kind of quick and cheap way. The amount of information on the network is extremely large, and with the deepening of the Internet application, It's growing at an alarming rate. Search engines can help us get more accurate pages of relevant information from the Internet, but the information we get is rudimentary, broad and unable to confirm the intrinsic relevance and physical model of that information. At this time, an alternative method is to use the general network analysis method to mine the heterogeneous web information and analyze the model, so as to find out its potential. Valuable knowledge. This paper mainly studies MongoDB distributed database and Hadoop distributed computing framework, and designs an efficient Web news entity analysis scheme based on MongoDB data modeling and Hadoop MapReduce computing. The specific research work and content include:. 1. Based on the method of XML analysis, semi-structured analysis of Web news data in Sogou laboratory is carried out, and the corresponding information is extracted, and the word segmentation of text content is processed under the framework of MapReduce, and the keyword weight is calculated by TF-IDF algorithm. Extract the text feature expression. 2. Based on the data model of MongoDB and parallel processing, combining with the analysis algorithm of relational network, the point centrality algorithm is used to analyze the central potential of a single entity node in the entity relational network, in order to realize the core mining of news topic; Based on the condensed subgroup analysis, small groups with close relationship are mined, and the block model between entities is constructed. 3. Using MongoDB, a non-relational database based on documents, and using its powerful modeling ability, we design a data model that can describe the text features, and combine with the MapReduce parallel computing framework of Hadoop, under the framework of J2EE. The distributed storage and computing platform of Web news is designed and built, and the analysis results obtained are displayed by JUNG technology.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前5條

1 張華強(qiáng);;關(guān)系型數(shù)據(jù)庫與NoSQL數(shù)據(jù)庫[J];電腦知識(shí)與技術(shù);2011年20期

2 張妍;許云峰;張立全;;基于云計(jì)算的中文分詞研究[J];河北科技大學(xué)學(xué)報(bào);2012年03期

3 劉群,張華平,俞鴻魁,程學(xué)旗;基于層疊隱馬模型的漢語詞法分析[J];計(jì)算機(jī)研究與發(fā)展;2004年08期

4 韓真;;基于共詞分析的主題類型劃分方法比較研究[J];圖書館;2009年02期

5 劉德貴;XML發(fā)展綜述[J];微型機(jī)與應(yīng)用;2000年05期

相關(guān)碩士學(xué)位論文 前1條

1 李玉峰;中文垃圾郵件過濾技術(shù)的研究與應(yīng)用[D];內(nèi)蒙古大學(xué);2009年

,

本文編號(hào):1548588

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1548588.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶7b3e6***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com