天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 管理論文 > 營(yíng)銷論文 >

基于Spark的社交主題分析與應(yīng)用

發(fā)布時(shí)間:2018-01-06 08:31

  本文關(guān)鍵詞:基于Spark的社交主題分析與應(yīng)用 出處:《電子科技大學(xué)》2016年碩士論文 論文類型:學(xué)位論文


  更多相關(guān)文章: 自然語(yǔ)言處理 主題模型 Spark LDA 大規(guī)模數(shù)據(jù)計(jì)算


【摘要】:自然語(yǔ)言處理被認(rèn)為是大數(shù)據(jù)時(shí)代十分關(guān)鍵的技術(shù)之一,尤其對(duì)于互聯(lián)網(wǎng)上的“用戶生成內(nèi)容”進(jìn)行文本分析蘊(yùn)含著巨大的商業(yè)價(jià)值。主題模型是一類無(wú)監(jiān)督的文本處理方法,其發(fā)展經(jīng)歷了從LSI模型到p LSI模型,再到LDA模型的研究階段。盡管用LDA模型進(jìn)行主題挖掘已經(jīng)得到了廣泛的實(shí)際應(yīng)用,但數(shù)據(jù)規(guī)模變大后效率明顯降低,在數(shù)據(jù)處理過(guò)程中,有效數(shù)據(jù)覆蓋度和執(zhí)行效率難以兼顧。隨著分布式系統(tǒng)的發(fā)展,大規(guī)模數(shù)據(jù)計(jì)算已經(jīng)得到廣泛的運(yùn)用。近兩年發(fā)展起來(lái)的Spark平臺(tái)憑借著基于內(nèi)存計(jì)算的優(yōu)勢(shì),在大規(guī)模數(shù)據(jù)機(jī)器學(xué)習(xí)領(lǐng)域受到了廣泛的青睞。原因是將中間計(jì)算結(jié)果保留在緩存,這種做法非常適合運(yùn)用到機(jī)器學(xué)習(xí)模型的反復(fù)迭代過(guò)程之中。這一技術(shù)為解決大規(guī)模數(shù)據(jù)主題挖掘的低效率問(wèn)題奠定了基礎(chǔ)。但LDA模型中Gibbs采樣的每一步執(zhí)行都強(qiáng)依賴于其他步的執(zhí)行結(jié)果,如果簡(jiǎn)單地將其分塊后并行處理,過(guò)程中并行修改同一統(tǒng)計(jì)量直接破壞了變量的一致性,而若將變量異步更新則失去了并行化的意義?梢(jiàn),強(qiáng)依賴每步執(zhí)行狀態(tài)的算法模型較難并行化,這也是為何發(fā)展迅速的Spark平臺(tái)上,機(jī)器學(xué)習(xí)庫(kù)MLlib中的算法依然十分稀少的主要原因。因此,LDA模型的并行化過(guò)程存在較大的難度。為了解決上述問(wèn)題,本文利用LDA模型中各文檔及各詞語(yǔ)獨(dú)立分布的假設(shè)條件,和Gibbs采樣過(guò)程各變量依賴更新的特點(diǎn),創(chuàng)新性提出了解決方案,降低了LDA模型并行化過(guò)程中不一致性帶來(lái)的影響,明顯的提高了LDA模型的效率。該解決方案包含:(1)對(duì)原始數(shù)據(jù)集重構(gòu)方法;(2)對(duì)執(zhí)行過(guò)程的階段性劃分方法;(3)階段內(nèi)計(jì)算和階段間變量同步的策略。具體的做法是:根據(jù)設(shè)定的并行度P和建立的詞匯表,將數(shù)據(jù)集分塊,進(jìn)而將其劃分到計(jì)算過(guò)程的P個(gè)階段之中,保證每一個(gè)階段選擇P個(gè)依賴度最小的數(shù)據(jù)塊,然后階段內(nèi)并行采樣,階段間變量同步。通過(guò)以上的方案計(jì)算直至模型收斂,得到主題分布結(jié)果。本文工作有效的解決了LDA模型在并行化中遇到的理論瓶頸,極大地改善了并行運(yùn)算中數(shù)據(jù)塊間的變量不一致性情況,為L(zhǎng)DA模型的并行化提供了理論依據(jù)。該方法也給同類強(qiáng)依賴每一步狀態(tài)的算法實(shí)現(xiàn)并行化提供了思路。此外,本文利用Spark平臺(tái)實(shí)現(xiàn)了LDA主題模型的并行化。在這基礎(chǔ)之上,考慮新浪微博文本內(nèi)容特征,采用以用戶為單元將微博內(nèi)容聚合為長(zhǎng)文本、清洗轉(zhuǎn)發(fā)內(nèi)容、TF-IDF過(guò)濾無(wú)效詞等多種處理方法提升模型效果,最終形成了一套高效的社交主題分析系統(tǒng),其性能與使用標(biāo)準(zhǔn)LDA模型進(jìn)行主題分析相比大幅提升,可供企業(yè)進(jìn)行高效的微博社交數(shù)據(jù)主題挖掘。進(jìn)一步地,可泛化用以分析其他社交平臺(tái)數(shù)據(jù)。該分析系統(tǒng)的主題產(chǎn)出結(jié)果在品牌營(yíng)銷的應(yīng)用場(chǎng)景中也能提供數(shù)據(jù)支持,助力品牌商企業(yè)科學(xué)發(fā)展。
[Abstract]:Natural language processing (NLP) is considered to be one of the key technologies in the big data era. Especially for the "user-generated content" text analysis on the Internet contains great commercial value. Topic model is a kind of unsupervised text processing method. Its development has gone through the research stage from LSI model to p LSI model, and then to LDA model. Although using LDA model for topic mining has been widely used in practice. However, the efficiency of the data becomes larger and the efficiency decreases obviously. In the process of data processing, the effective data coverage and the execution efficiency are difficult to be taken into account. With the development of distributed system. Large-scale data computing has been widely used. The Spark platform developed in recent two years is based on the advantage of memory-based computing. In the field of large-scale data machine learning, the reason is that the intermediate computing results are kept in the cache. This approach is very suitable for repeated iterations of machine learning models. This technique lays the foundation for solving the inefficient problem of large scale data topic mining. But Gibbs sampling in LDA model. Each step execution is strongly dependent on the execution results of the other steps. If it is simply partitioned into blocks and processed in parallel, the parallel modification of the same statistics directly destroys the consistency of variables, and if the variables are updated asynchronously, it loses the significance of parallelization. It is difficult to parallelize the algorithm model which strongly depends on the execution state of each step, which is the main reason why the algorithms in the machine learning library (MLlib) are still very rare on the rapidly developing Spark platform. The parallelization of LDA model is difficult. In order to solve the above problems, this paper makes use of the hypothesis of independent distribution of documents and words in LDA model. And Gibbs sampling process variables dependent on the characteristics of update, innovative solutions to reduce the LDA model in the parallelization process caused by inconsistency. The efficiency of LDA model is improved obviously. The solution includes: 1) reconstruction of raw data set; (2) the method of dividing the stages of the execution process; The method is to divide the data sets into blocks according to the set degree of parallelism P and the established vocabulary, and then divide the data sets into P stages of the calculation process. Make sure that each stage selects P data blocks with the least dependence, and then samples in parallel, synchronizes the variables between stages. Through the calculation of the above scheme, the model converges until the model is converged. The results of topic distribution are obtained. In this paper, the theoretical bottleneck of LDA model in parallelization is solved effectively, and the inconsistency of variables between data blocks in parallel operation is greatly improved. This method provides a theoretical basis for the parallelization of LDA model. This method also provides a way to realize parallelization of similar algorithms strongly dependent on each step of the state. In this paper, we use Spark platform to realize the parallelization of LDA theme model. On this basis, considering the content features of Sina Weibo text, we aggregate Weibo content into long text with user as the unit. Cleaning and forwarding content TF-IDF filter invalid words and other processing methods to improve the effectiveness of the model, and finally formed a set of efficient social theme analysis system. Compared with using standard LDA model for topic analysis, its performance is greatly improved, which can provide enterprises with efficient Weibo social data topic mining. It can be used to analyze the data of other social platforms. The thematic output of the analysis system can also provide data support in the application of brand marketing, which can help the scientific development of brand companies.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前5條

1 邸亮;杜永萍;;LDA模型在微博用戶推薦中的應(yīng)用[J];計(jì)算機(jī)工程;2014年05期

2 張晨逸;孫建伶;丁軼群;;基于MB-LDA模型的微博主題挖掘[J];計(jì)算機(jī)研究與發(fā)展;2011年10期

3 曹娟;張勇東;李錦濤;唐勝;;一種基于密度的自適應(yīng)最優(yōu)LDA模型選擇方法[J];計(jì)算機(jī)學(xué)報(bào);2008年10期

4 石晶;胡明;石鑫;戴國(guó)忠;;基于LDA模型的文本分割[J];計(jì)算機(jī)學(xué)報(bào);2008年10期

5 李文波;孫樂(lè);張大鯤;;基于Labeled-LDA模型的文本分類新算法[J];計(jì)算機(jī)學(xué)報(bào);2008年04期

,

本文編號(hào):1387102

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/yingxiaoguanlilunwen/1387102.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶e1783***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
最新午夜福利视频偷拍| 少妇人妻中出中文字幕| 天海翼高清二区三区在线| 国产精品99一区二区三区| 亚洲视频一区二区久久久| 国产精品超碰在线观看| 色婷婷视频国产一区视频| 亚洲国产综合久久天堂| 国产成人av在线免播放观看av | 国产欧美亚洲精品自拍| 91人妻丝袜一区二区三区| 蜜桃传媒视频麻豆第一区| 99久久国产综合精品二区| 欧美人与动牲交a精品| 国产欧美日韩一级小黄片| 五月天丁香婷婷狠狠爱| 日韩精品中文在线观看| 亚洲熟女诱惑一区二区| 国产传媒精品视频一区| 在线免费国产一区二区| 精品人妻久久一品二品三品| 亚洲精品蜜桃在线观看| 欧美人妻免费一区二区三区| 国产午夜精品福利免费不| 国产日韩中文视频一区| 亚洲国产91精品视频| 亚洲av又爽又色又色| 日韩成人动作片在线观看| 国产精品一区二区有码| 亚洲av日韩av高潮无打码| 91超频在线视频中文字幕| 精品al亚洲麻豆一区| 国产又猛又大又长又粗| 国产成人精品午夜福利av免费| 亚洲欧美中文字幕精品| 国产精品免费视频视频| 国产成人亚洲精品青草天美| 欧美一级内射一色桃子| 亚洲一区二区三区三州| 高清一区二区三区四区五区| 亚洲天堂国产精品久久精品|