天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

分布式環(huán)境下基于文本的海量數(shù)據(jù)挖掘

發(fā)布時(shí)間:2018-06-28 15:40

  本文選題:大數(shù)據(jù) + 數(shù)據(jù)挖掘 ; 參考:《上海交通大學(xué)》2013年碩士論文


【摘要】:數(shù)據(jù)挖掘一直以來都是計(jì)算機(jī)領(lǐng)域的一個(gè)研究熱點(diǎn)。近年來,隨著Web2.0應(yīng)用的普及和云計(jì)算的發(fā)展,互聯(lián)網(wǎng)已經(jīng)進(jìn)入了大數(shù)據(jù)時(shí)代,數(shù)據(jù)的產(chǎn)生、傳輸、存儲(chǔ)、訪問和處理方式產(chǎn)生了明顯的變化。傳統(tǒng)的數(shù)據(jù)挖掘方法在數(shù)據(jù)源異構(gòu)、數(shù)據(jù)規(guī)模急劇膨脹的大數(shù)據(jù)時(shí)代,正面臨嚴(yán)峻的挑戰(zhàn)。本文提出了一套完整的分布式環(huán)境下基于文本的數(shù)據(jù)挖掘方法,實(shí)現(xiàn)了海量文本數(shù)據(jù)從數(shù)據(jù)抽取、預(yù)處理、搭建數(shù)據(jù)倉庫到數(shù)據(jù)挖掘的全過程,并將該方法應(yīng)用于解決微博用戶推薦問題進(jìn)行驗(yàn)證,取得良好效果。 廣義的數(shù)據(jù)挖掘工作通常包含兩個(gè)部分,搭建數(shù)據(jù)倉庫和進(jìn)行數(shù)據(jù)挖掘。數(shù)據(jù)挖掘的對象通常是來自多個(gè)異構(gòu)數(shù)據(jù)源的大規(guī)模數(shù)據(jù),從數(shù)據(jù)一致性、訪問效率等因素考慮,需要有一個(gè)統(tǒng)一的管理系統(tǒng)對數(shù)據(jù)進(jìn)行集成、維護(hù),即數(shù)據(jù)倉庫。數(shù)據(jù)倉庫的搭建包含了數(shù)據(jù)的抽取、轉(zhuǎn)換和加載,即ETL過程。傳統(tǒng)的數(shù)據(jù)倉庫設(shè)計(jì)是基于RDBMS設(shè)計(jì)思想的,需要整合所有數(shù)據(jù)源的數(shù)據(jù)類型和數(shù)據(jù)結(jié)構(gòu),設(shè)計(jì)一個(gè)統(tǒng)一的模式(Schema),包括表結(jié)構(gòu)和外鍵等。這樣做的優(yōu)勢在于可以保證數(shù)據(jù)的ACID性質(zhì)。但是在大數(shù)據(jù)背景下,數(shù)據(jù)源復(fù)雜,異構(gòu)性強(qiáng)、數(shù)據(jù)規(guī)模擴(kuò)展迅速,從而對基于RDBMS數(shù)據(jù)倉庫的可擴(kuò)展性、靈活性以及效率提出了新的挑戰(zhàn)。 在完成數(shù)據(jù)倉庫搭建的基礎(chǔ)上,傳統(tǒng)的數(shù)據(jù)挖掘已經(jīng)形成了一整套較為成熟的算法體系,典型的算法包括分類、聚類、關(guān)聯(lián)、預(yù)測等,此外還與其他學(xué)科交叉產(chǎn)生了包括機(jī)器學(xué)習(xí)、神經(jīng)網(wǎng)絡(luò)等技術(shù)。這些數(shù)據(jù)挖掘技術(shù)應(yīng)用場景具備一些鮮明的特點(diǎn):數(shù)據(jù)一次寫入,,頻繁讀,運(yùn)算密集,而數(shù)據(jù)更新操作較少。針對這些特點(diǎn),基于RDBMS設(shè)計(jì)方法保證的ACID性質(zhì)的優(yōu)勢不僅得不到充分體現(xiàn),反而成為了性能上的制約。 針對以上問題,本文提出了一套分布式環(huán)境下,基于文本的數(shù)據(jù)倉庫搭建與數(shù)據(jù)挖掘的方案。首先,在數(shù)據(jù)倉庫搭建方面,本文提出一種在分布式環(huán)境下快速搭建數(shù)據(jù)倉庫的方法,利用MapReduce完成整個(gè)ETL過程;同時(shí)摒棄了RDBMS而使用NoSQL數(shù)據(jù)庫集群作為數(shù)據(jù)倉庫的基礎(chǔ),從而保證了系統(tǒng)的可擴(kuò)展性和運(yùn)行效率。其次,借鑒搜索引擎的思想,提出一種MongoDB+Lucene+MapReduce的針對文本數(shù)據(jù)的數(shù)據(jù)挖掘解決方案,通過并行訪問,提高對分布式環(huán)境下海量文本數(shù)據(jù)的訪問效率;采用計(jì)算TFIDF值評估文本信息量,而非傳統(tǒng)的詞法、語法分析。最后,應(yīng)用這一整套方法,解決了一個(gè)具有Web2.0特征的數(shù)據(jù)挖掘問題:微博的用戶推薦問題,從而驗(yàn)證了這一方法的可行性,并取得良好效果。
[Abstract]:Data mining has always been a research hotspot in computer field. In recent years, with the popularity of Web 2.0 applications and the development of cloud computing, the Internet has entered the big data era, the generation, transmission, storage, access and processing of data has changed significantly. Traditional data mining methods are facing severe challenges in the era of heterogeneous data sources and rapidly expanding data scale in the big data era. In this paper, a complete method of text-based data mining in distributed environment is proposed. The whole process of massive text data extraction, preprocessing, data warehouse and data mining is realized. The method is applied to solve the problem of Weibo user recommendation and good results are obtained. The generalized data mining work usually consists of two parts, data warehouse and data mining. The objects of data mining are usually large-scale data from multiple heterogeneous data sources. Considering the data consistency, access efficiency and other factors, it is necessary to have a unified management system to integrate and maintain the data, that is, data warehouse. The construction of data warehouse includes extraction, transformation and loading of data, that is, ETL process. The traditional design of data warehouse is based on the RDBMS design idea. It is necessary to integrate the data types and data structures of all data sources and design a unified schema, including table structure and foreign key, etc. The advantage of this is that the acid nature of the data can be guaranteed. However, under the background of big data, the data source is complex, heterogeneous, and the data scale expands rapidly, which brings a new challenge to the extensibility, flexibility and efficiency of RDBMS-based data warehouse. On the basis of data warehouse construction, traditional data mining has formed a set of more mature algorithm system, typical algorithms include classification, clustering, association, prediction and so on. In addition, it also intersects with other disciplines, including machine learning, neural networks and other technologies. The application scenes of these data mining techniques have some distinct characteristics: the data is written at one time, read frequently, and the operation is dense, but the operation of data update is less. In view of these characteristics, the advantages of acid properties guaranteed by RDBMS design method are not fully reflected, but also become a performance constraint. In order to solve the above problems, this paper proposes a method of data warehouse building and data mining based on text in distributed environment. First of all, in the aspect of data warehouse construction, this paper puts forward a method to build data warehouse quickly in distributed environment, using MapReduce to complete the whole ETL process, and abandoning RDBMS and using NoSQL database cluster as the basis of data warehouse. Thus, the expansibility and efficiency of the system are guaranteed. Secondly, using the idea of search engine for reference, this paper proposes a data mining solution for text data based on MongoDB Lucene MapReduce, which can improve the efficiency of accessing massive text data in distributed environment through parallel access. The calculation of TFIDF value is used to evaluate the amount of text information, rather than the traditional lexical and grammatical analysis. Finally, a Web 2.0 characteristic data mining problem is solved by using this method, which is the user recommendation problem of Weibo, which verifies the feasibility of this method and achieves good results.
【學(xué)位授予單位】:上海交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前3條

1 李崇民;王海霞;張熙;汪東升;;一種基于數(shù)據(jù)訪問特征的層次化緩存優(yōu)化設(shè)計(jì)[J];計(jì)算機(jī)學(xué)報(bào);2011年11期

2 孫吉貴;劉杰;趙連宇;;聚類算法研究[J];軟件學(xué)報(bào);2008年01期

3 俞鴻魁;張華平;劉群;呂學(xué)強(qiáng);施水才;;基于層疊隱馬爾可夫模型的中文命名實(shí)體識(shí)別[J];通信學(xué)報(bào);2006年02期



本文編號:2078500

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2078500.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶f1fe1***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com