基于Labeled LDA的微博用戶(hù)興趣識(shí)別系統(tǒng)的研究與實(shí)現(xiàn)
本文選題:文本分類(lèi) + 興趣識(shí)別; 參考:《北京交通大學(xué)》2014年碩士論文
【摘要】:微博是一個(gè)基于用戶(hù)關(guān)系的信息分享、傳播以及獲取平臺(tái),內(nèi)容簡(jiǎn)單、用戶(hù)之間的交互性強(qiáng)、使用門(mén)檻低是微博的特點(diǎn),近年在我國(guó)取得了快速發(fā)展。微博作為當(dāng)下最流行的社會(huì)化網(wǎng)絡(luò)服務(wù)媒體,基于微博的用戶(hù)興趣挖掘研究也迅速成為一個(gè)新興的研究課題:首先發(fā)現(xiàn)感興趣的微博賬戶(hù)與信息是微博用戶(hù)最重要的活動(dòng),微博平臺(tái)則需要準(zhǔn)確地基于用戶(hù)的興趣進(jìn)行相關(guān)信息的推薦;其次用戶(hù)興趣識(shí)別系統(tǒng)是實(shí)現(xiàn)精準(zhǔn)廣告投放的基礎(chǔ),興趣挖掘的準(zhǔn)確度直接關(guān)系到廣告投放的效果,關(guān)系到微博平臺(tái)的盈利。 本文作者在分析新浪微博的信息特點(diǎn)和用戶(hù)行為特點(diǎn)的基礎(chǔ)上,學(xué)習(xí)了傳統(tǒng)的使用詞向量進(jìn)行文本特征表示的文本分類(lèi)算法,并基于無(wú)監(jiān)督、無(wú)層次結(jié)構(gòu)的主題模型LDA (Latent Dirichlet Allocation,隱含狄利克雷分配),擴(kuò)展實(shí)現(xiàn)了有監(jiān)督、無(wú)層次結(jié)構(gòu)的主題模型Labeled LDA,用于對(duì)微博用戶(hù)興趣分布的識(shí)別。本文主要研究了用戶(hù)興趣識(shí)別過(guò)程中涉及到的關(guān)鍵問(wèn)題,主要包括以下三個(gè)方面的工作:(1)使用Python開(kāi)發(fā)針對(duì)新浪微博的定制網(wǎng)絡(luò)爬蟲(chóng),繞開(kāi)微博API的限制,實(shí)現(xiàn)微博文本的并發(fā)快速獲取,為研究工作提供了及其豐富的實(shí)驗(yàn)數(shù)據(jù);(2)學(xué)習(xí)文本分類(lèi)技術(shù),使用有監(jiān)督、無(wú)層次結(jié)構(gòu)的主題模型Labeled LDA,用微博主題賬號(hào)的微博文本進(jìn)行模型的訓(xùn)練,用于對(duì)其他微博用戶(hù)興趣的預(yù)測(cè);(3)考慮海量數(shù)據(jù)的場(chǎng)景,使用Hadoop、Hive等分布式框架,實(shí)現(xiàn)海量中文文本數(shù)據(jù)的分布式分詞與預(yù)處理。最終,通過(guò)用戶(hù)興趣識(shí)別系統(tǒng)得到的用戶(hù)興趣分布的數(shù)據(jù)在實(shí)際中成功的應(yīng)用于用戶(hù)個(gè)性化詞云的產(chǎn)生與展現(xiàn)、搜索結(jié)果的調(diào)整與優(yōu)化、廣告的個(gè)人興趣化定向投放等方面。
[Abstract]:Weibo is a platform for information sharing, dissemination and acquisition based on user relationship. It is characterized by simple content, strong interaction between users and low threshold of use of Weibo. In recent years, it has achieved rapid development in China. Weibo as the most popular social network service media, the research of user interest mining based on Weibo has quickly become a new research topic: first of all, it is the most important activity of Weibo users to discover the interesting Weibo account and information. The Weibo platform needs to recommend the relevant information accurately based on the user's interest. Secondly, the user interest recognition system is the basis for the implementation of accurate advertising, and the accuracy of interest mining is directly related to the effect of advertising. Related to the Weibo platform profit. On the basis of analyzing the information characteristics and user behavior characteristics of Sina Weibo, the author of this paper studies the traditional text classification algorithm using word vector for text feature representation, and based on unsupervised, The topic model LDA / Latent Dirichlet allocation without hierarchy is extended to implement a supervised and unhierarchical topic model, Labeled LDA. it is used to identify the distribution of interest of Weibo users. This paper mainly studies the key problems involved in the process of user interest identification, including the following three aspects: 1) using Python to develop customized web crawlers for Sina Weibo, circumventing the limitations of Weibo API. To realize the concurrency and fast acquisition of Weibo text, this paper provides an extremely rich experimental data for the research work and studies the text classification technology, which is supervised. The hierarchical topic model Labeled LDAuses the Weibo text of the Weibo theme account to train the model, which is used to predict the interest of other Weibo users. (3) considering the scene of massive data, using the distributed framework such as Hadoop Hive, etc. Distributed word segmentation and preprocessing of massive Chinese text data are realized. Finally, the user interest distribution data obtained through the user interest recognition system has been successfully applied to the generation and presentation of user personalized word cloud, the adjustment and optimization of search results, and the orientation and placement of personal interest in advertising.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 李靜梅,孫麗華,張巧榮,張春生;一種文本處理中的樸素貝葉斯分類(lèi)器[J];哈爾濱工程大學(xué)學(xué)報(bào);2003年01期
2 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲(chóng):研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期
3 石晶;李萬(wàn)龍;;基于LDA模型的主題詞抽取方法[J];計(jì)算機(jī)工程;2010年19期
4 王晶;朱珂;汪斌強(qiáng);;基于信息數(shù)據(jù)分析的微博研究綜述[J];計(jì)算機(jī)應(yīng)用;2012年07期
5 蓋杰,王怡,武港山;潛在語(yǔ)義分析理論及其應(yīng)用[J];計(jì)算機(jī)應(yīng)用研究;2004年03期
6 史晶蕊,鄭玉明,韓希;人工神經(jīng)網(wǎng)絡(luò)在文本分類(lèi)中的應(yīng)用[J];計(jì)算機(jī)應(yīng)用研究;2005年10期
7 馬躍淵,徐勇勇;Gibbs抽樣算法及軟件設(shè)計(jì)的初步研究[J];計(jì)算機(jī)應(yīng)用與軟件;2005年02期
8 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報(bào);2007年03期
9 王力;李培峰;朱巧明;;一種基于LDA模型的主題句抽取方法[J];計(jì)算機(jī)工程與應(yīng)用;2013年02期
10 王振振;何明;杜永萍;;基于LDA主題模型的文本相似度計(jì)算[J];計(jì)算機(jī)科學(xué);2013年12期
,本文編號(hào):1955743
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1955743.html