面向客服互動微博的短文本分類方法研究
發(fā)布時間:2018-11-13 17:23
【摘要】:文本分類是數(shù)據(jù)挖掘領(lǐng)域內(nèi)一個重要的研究課題。隨著twitter逐漸占領(lǐng)國外社交領(lǐng)域,越來越多的研究開始集中到微博短文本上,對微博進行分類在輿情分析、垃圾信息過濾以及微博社區(qū)等方面有著重要的意義。對比國內(nèi),以新浪為代表的微博也開始逐漸占領(lǐng)人們的日常生活,由于中文的特殊性,針對中文微博短文本進行分類也提出了更大的挑戰(zhàn)。 本文主要做了以下工作: 1.調(diào)研了文本分類(包括數(shù)據(jù)預(yù)處理、特征選擇、文本表示、分類算法等)的相關(guān)技術(shù),并針對信息增益的特征選擇方法的不足之處進行了改進; 2.利用LDA將微博短文本以文檔-語義分布矩陣的形式進行表示; 3.設(shè)計了信息增益與LDA相結(jié)合的微博短文本分類方法,并且實現(xiàn)了面向客服互動微博的分類系統(tǒng)。 本文在帶有類別標(biāo)簽的客服互動微博數(shù)據(jù)上進行了驗證。分別以信息增益和LDA做對比,實驗結(jié)果顯示,本文設(shè)計的方法在分類準確率上獲得了一定的提升,說明本文的方法適用于針對客服互動微博進行分類。
[Abstract]:Text classification is an important research topic in the field of data mining. With the twitter gradually occupying the foreign social field, more and more research began to focus on Weibo short text book. It is of great significance in the analysis of public opinion, spam filtering and the community of Weibo to classify Weibo. In contrast, Weibo, represented by Sina, began to occupy people's daily life gradually. Because of the particularity of Chinese, the classification of the short text of Chinese Weibo also posed a greater challenge. The main work of this paper is as follows: 1. The related technologies of text classification (including data preprocessing, feature selection, text representation, classification algorithm, etc.) are investigated, and the shortcomings of feature selection method of information gain are improved. 2. Using LDA to express Weibo short text in the form of document-semantic distribution matrix; 3. This paper designs a short text classification method of Weibo, which combines information gain and LDA, and implements a classification system for customer service interactive Weibo. This article carries on the verification on the customer service interaction Weibo data with the category label. By comparing the information gain and LDA, the experimental results show that the method designed in this paper has achieved a certain improvement in classification accuracy, indicating that this method is suitable for customer service interaction Weibo classification.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP391.1;TP393.092
本文編號:2329834
[Abstract]:Text classification is an important research topic in the field of data mining. With the twitter gradually occupying the foreign social field, more and more research began to focus on Weibo short text book. It is of great significance in the analysis of public opinion, spam filtering and the community of Weibo to classify Weibo. In contrast, Weibo, represented by Sina, began to occupy people's daily life gradually. Because of the particularity of Chinese, the classification of the short text of Chinese Weibo also posed a greater challenge. The main work of this paper is as follows: 1. The related technologies of text classification (including data preprocessing, feature selection, text representation, classification algorithm, etc.) are investigated, and the shortcomings of feature selection method of information gain are improved. 2. Using LDA to express Weibo short text in the form of document-semantic distribution matrix; 3. This paper designs a short text classification method of Weibo, which combines information gain and LDA, and implements a classification system for customer service interactive Weibo. This article carries on the verification on the customer service interaction Weibo data with the category label. By comparing the information gain and LDA, the experimental results show that the method designed in this paper has achieved a certain improvement in classification accuracy, indicating that this method is suitable for customer service interaction Weibo classification.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP391.1;TP393.092
【參考文獻】
相關(guān)期刊論文 前7條
1 張晨逸;孫建伶;丁軼群;;基于MB-LDA模型的微博主題挖掘[J];計算機研究與發(fā)展;2011年10期
2 黃秀麗;王蔚;;一種改進的文本分類特征選擇方法[J];計算機工程與應(yīng)用;2009年36期
3 樊興華;孫茂松;;一種高性能的兩類中文文本分類方法[J];計算機學(xué)報;2006年01期
4 任克強;張國萍;趙光甫;;基于相對文檔頻的平衡信息增益降維方法[J];江西理工大學(xué)學(xué)報;2008年05期
5 丁兆云;賈焰;周斌;;微博數(shù)據(jù)挖掘研究綜述[J];計算機研究與發(fā)展;2014年04期
6 蘇金樹;張博鋒;徐昕;;基于機器學(xué)習(xí)的文本分類技術(shù)研究進展[J];軟件學(xué)報;2006年09期
7 唐曉波;王洪艷;;基于潛在語義分析的微博主題挖掘模型研究[J];圖書情報工作;2012年24期
,本文編號:2329834
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2329834.html
最近更新
教材專著