基于文本分類與主題模型的用戶偏好分析

發(fā)布時間：2018-06-16 20:14

本文選題：用戶偏好分析 + 文本分類�。� 參考：《青島科技大學(xué)》2017年碩士論文

【摘要】：用戶偏好是指用戶通過對商品或服務(wù)的考量后,所做出的理性的具有傾向性的選擇。對用戶偏好進行分析的主要目的是為了從海量的信息中,篩選出用戶感興趣的信息,從而為用戶提供更個性化的服務(wù)。因此用戶偏好分析是構(gòu)建個性化服務(wù)的基礎(chǔ)。然而,現(xiàn)有的用戶偏好分析方法還存在著許多問題。一方面,現(xiàn)有的方法大多是對用戶的固有屬性進行分析,很難挖掘出用戶更細粒度的偏好;另一方面,現(xiàn)有的方法在對用戶細粒度偏好進行分析時,其算法準確率和算法效率上都有所不足。用戶偏好可以通過挖掘用戶的行為得到,通過對用戶瀏覽的內(nèi)容進行細粒度的分類、聚類,就可以得到用戶的細粒度偏好。首先,標簽是一種比類別更加細粒度的表示,并且一個內(nèi)容可以對應(yīng)有多個標簽,在對內(nèi)容進行不同層面的標簽標注可以為用戶偏好分析提供不同層面的偏好特征;其次,根據(jù)用戶的主動意圖進行聚類,從用戶角度出發(fā),根據(jù)用戶的潛在認知,把同類內(nèi)容聚合到一起,為用戶偏好分析提供用戶行為層面的偏好特征。基于上述分析,本文提出了兩種對文本進行標簽標注的算法和一種無向圖層次聚類優(yōu)化算法:首先,提出一種加權(quán)的有監(jiān)督LDA算法(WLLDA),該算法采用卡方校驗的方法對文本特征進行降維。采用一種新的加權(quán)詞袋模型,對原有詞袋中對主題分類有意義的詞進行提權(quán),增大主題間的分歧,提高分類準確率。采用多模型集成的方法,針對不同頻次的主題進行采樣訓(xùn)練,解決單一模型因語料不均勻造成的互相干擾。提出一種新的主題貼近度計算方法,在原有主題概率的基礎(chǔ)上,綜合考慮了關(guān)鍵詞命中頻率、頻次以及標簽支持度這三個方面的因素來計算主題貼近度,從而提高主題預(yù)測的準確度。其次,提出一種基于word2vec的標簽標注算法,該算法利用CRF對文本進行關(guān)鍵詞提取,使用word2vec產(chǎn)生的詞向量和LR對關(guān)鍵詞進行關(guān)鍵詞聚類并構(gòu)建標簽集合,避免了人工標簽庫歸納覆蓋不全的問題。最后通過對文本進行去噪提取文本主干,通過比較文本主干詞的詞向量和標簽詞向量的相似度為文本進行標簽標注。第三,提出一種無向圖層次聚類并行化優(yōu)化算法,該算法把用戶主動搜索意圖行為抽象為無向圖。通過對多邊節(jié)點進行分裂,減弱了衰減因子對多邊節(jié)點帶來的負面影響,同時使無向圖圖聚類可以以并行的方式進行計算,在準確率和計算效率上都有了大幅度提升。本文通過上述三種算法,把用戶對內(nèi)容的偏好程度轉(zhuǎn)變?yōu)橛脩魧撕灥钠?最終刻畫出用戶細粒度的偏好特征,從而達到對用戶偏好進行分析的目的。
[Abstract]:User preference refers to the rational and tendentious choice made by the user through the consideration of goods or services. The main purpose of analyzing users' preferences is to screen out the information that users are interested in from a large amount of information, so as to provide users with more personalized services. Therefore, user preference analysis is the basis of building personalized services. However, there are still many problems in the existing methods of user preference analysis. On the one hand, most of the existing methods analyze the inherent properties of the user, so it is difficult to mine the user's finer grained preferences. On the other hand, the existing methods are used to analyze the user's fine-grained preferences. Its algorithm accuracy and algorithm efficiency are insufficient. The user preference can be obtained by mining the user's behavior, and the user's fine-grained preference can be obtained by the fine-grained classification and clustering of the content viewed by the user. First, tags are a more granular representation than categories, and a content can correspond to multiple tags. Label tagging at different levels of content can provide different levels of preferences for user preference analysis. Clustering according to the active intention of users, from the point of view of users, according to the potential cognition of users, the same content is aggregated together to provide user preference analysis with user preference characteristics at behavioral level. Based on the above analysis, this paper proposes two algorithms for tagging text and an undirected graph hierarchical clustering optimization algorithm. A weighted supervised LDA algorithm (WLLDAA) is proposed. The algorithm uses chi-square check to reduce the dimension of text features. A new weighted lexical bag model is used to raise the weight of the words in the original lexical bag to increase the differences between themes and to improve the accuracy of classification. The method of multi-model integration is used to train samples for different frequency topics to solve the interferences caused by uneven corpus in a single model. A new method for calculating topic closeness is proposed. Based on the original topic probability, the key word hit frequency, frequency and label support are considered comprehensively to calculate the subject closeness. In order to improve the accuracy of topic prediction. Secondly, a label tagging algorithm based on word2vec is proposed, in which the keywords are extracted from the text, the word vectors and LR generated by word2vec are used to cluster the keywords and the tag set is constructed. Avoid the problem of incomplete inductive coverage of human tag library. Finally, the text trunk is extracted by de-noising the text, and the similarity between the word vector of the main word and the label vector is compared to label the text. Thirdly, an undirected graph hierarchical clustering parallel optimization algorithm is proposed, which abstracts the user's active search intention behavior into undirected graph. By splitting the multilateral nodes, the negative effects of the attenuation factor on the multilateral nodes are reduced, and the undirected graph clustering can be computed in parallel, which greatly improves the accuracy and computational efficiency. In this paper, the degree of user's preference for content is transformed into user's preference for label by the three algorithms mentioned above, and the fine granularity of user's preference is depicted finally, so as to achieve the purpose of analyzing user's preference.
【學(xué)位授予單位】：青島科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.1

【參考文獻】

相關(guān)期刊論文前1條

1 劉齊平;;電子商務(wù)領(lǐng)域用戶偏好研究綜述[J];湖北第二師范學(xué)院學(xué)報;2015年02期

相關(guān)碩士學(xué)位論文前1條

1 張友強;基于選擇性集成學(xué)習(xí)的離群點檢測研究[D];青島科技大學(xué);2016年

，

本文編號：2027968

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2027968.html

上一篇：基于iOS平臺的文本型圖像的檢索與實現(xiàn)
下一篇：培訓(xùn)機構(gòu)教務(wù)管理平臺的設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于文本分類與主題模型的用戶偏好分析