基于Labeled LDA的微博用戶興趣識別系統(tǒng)的研究與實(shí)現(xiàn)
本文選題:文本分類 + 興趣識別。 參考:《北京交通大學(xué)》2014年碩士論文
【摘要】:微博是一個基于用戶關(guān)系的信息分享、傳播以及獲取平臺,內(nèi)容簡單、用戶之間的交互性強(qiáng)、使用門檻低是微博的特點(diǎn),近年在我國取得了快速發(fā)展。微博作為當(dāng)下最流行的社會化網(wǎng)絡(luò)服務(wù)媒體,基于微博的用戶興趣挖掘研究也迅速成為一個新興的研究課題:首先發(fā)現(xiàn)感興趣的微博賬戶與信息是微博用戶最重要的活動,微博平臺則需要準(zhǔn)確地基于用戶的興趣進(jìn)行相關(guān)信息的推薦;其次用戶興趣識別系統(tǒng)是實(shí)現(xiàn)精準(zhǔn)廣告投放的基礎(chǔ),興趣挖掘的準(zhǔn)確度直接關(guān)系到廣告投放的效果,關(guān)系到微博平臺的盈利。 本文作者在分析新浪微博的信息特點(diǎn)和用戶行為特點(diǎn)的基礎(chǔ)上,學(xué)習(xí)了傳統(tǒng)的使用詞向量進(jìn)行文本特征表示的文本分類算法,并基于無監(jiān)督、無層次結(jié)構(gòu)的主題模型LDA (Latent Dirichlet Allocation,隱含狄利克雷分配),擴(kuò)展實(shí)現(xiàn)了有監(jiān)督、無層次結(jié)構(gòu)的主題模型Labeled LDA,用于對微博用戶興趣分布的識別。本文主要研究了用戶興趣識別過程中涉及到的關(guān)鍵問題,主要包括以下三個方面的工作:(1)使用Python開發(fā)針對新浪微博的定制網(wǎng)絡(luò)爬蟲,繞開微博API的限制,實(shí)現(xiàn)微博文本的并發(fā)快速獲取,為研究工作提供了及其豐富的實(shí)驗(yàn)數(shù)據(jù);(2)學(xué)習(xí)文本分類技術(shù),使用有監(jiān)督、無層次結(jié)構(gòu)的主題模型Labeled LDA,用微博主題賬號的微博文本進(jìn)行模型的訓(xùn)練,用于對其他微博用戶興趣的預(yù)測;(3)考慮海量數(shù)據(jù)的場景,使用Hadoop、Hive等分布式框架,實(shí)現(xiàn)海量中文文本數(shù)據(jù)的分布式分詞與預(yù)處理。最終,通過用戶興趣識別系統(tǒng)得到的用戶興趣分布的數(shù)據(jù)在實(shí)際中成功的應(yīng)用于用戶個性化詞云的產(chǎn)生與展現(xiàn)、搜索結(jié)果的調(diào)整與優(yōu)化、廣告的個人興趣化定向投放等方面。
[Abstract]:Weibo is a platform for information sharing, dissemination and acquisition based on user relationship. It is characterized by simple content, strong interaction between users and low threshold of use of Weibo. In recent years, it has achieved rapid development in China. Weibo as the most popular social network service media, the research of user interest mining based on Weibo has quickly become a new research topic: first of all, it is the most important activity of Weibo users to discover the interesting Weibo account and information. The Weibo platform needs to recommend the relevant information accurately based on the user's interest. Secondly, the user interest recognition system is the basis for the implementation of accurate advertising, and the accuracy of interest mining is directly related to the effect of advertising. Related to the Weibo platform profit. On the basis of analyzing the information characteristics and user behavior characteristics of Sina Weibo, the author of this paper studies the traditional text classification algorithm using word vector for text feature representation, and based on unsupervised, The topic model LDA / Latent Dirichlet allocation without hierarchy is extended to implement a supervised and unhierarchical topic model, Labeled LDA. it is used to identify the distribution of interest of Weibo users. This paper mainly studies the key problems involved in the process of user interest identification, including the following three aspects: 1) using Python to develop customized web crawlers for Sina Weibo, circumventing the limitations of Weibo API. To realize the concurrency and fast acquisition of Weibo text, this paper provides an extremely rich experimental data for the research work and studies the text classification technology, which is supervised. The hierarchical topic model Labeled LDAuses the Weibo text of the Weibo theme account to train the model, which is used to predict the interest of other Weibo users. (3) considering the scene of massive data, using the distributed framework such as Hadoop Hive, etc. Distributed word segmentation and preprocessing of massive Chinese text data are realized. Finally, the user interest distribution data obtained through the user interest recognition system has been successfully applied to the generation and presentation of user personalized word cloud, the adjustment and optimization of search results, and the orientation and placement of personal interest in advertising.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 李靜梅,孫麗華,張巧榮,張春生;一種文本處理中的樸素貝葉斯分類器[J];哈爾濱工程大學(xué)學(xué)報;2003年01期
2 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期
3 石晶;李萬龍;;基于LDA模型的主題詞抽取方法[J];計(jì)算機(jī)工程;2010年19期
4 王晶;朱珂;汪斌強(qiáng);;基于信息數(shù)據(jù)分析的微博研究綜述[J];計(jì)算機(jī)應(yīng)用;2012年07期
5 蓋杰,王怡,武港山;潛在語義分析理論及其應(yīng)用[J];計(jì)算機(jī)應(yīng)用研究;2004年03期
6 史晶蕊,鄭玉明,韓希;人工神經(jīng)網(wǎng)絡(luò)在文本分類中的應(yīng)用[J];計(jì)算機(jī)應(yīng)用研究;2005年10期
7 馬躍淵,徐勇勇;Gibbs抽樣算法及軟件設(shè)計(jì)的初步研究[J];計(jì)算機(jī)應(yīng)用與軟件;2005年02期
8 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報;2007年03期
9 王力;李培峰;朱巧明;;一種基于LDA模型的主題句抽取方法[J];計(jì)算機(jī)工程與應(yīng)用;2013年02期
10 王振振;何明;杜永萍;;基于LDA主題模型的文本相似度計(jì)算[J];計(jì)算機(jī)科學(xué);2013年12期
,本文編號:1955743
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1955743.html