面向短文本分類的特征擴展方法

發(fā)布時間：2018-08-31 20:35

【摘要】：近年來，各種各樣的網(wǎng)絡(luò)應(yīng)用(如Facebook, QQ, Twitter，新浪微博等）不斷涌現(xiàn)，伴隨著這些網(wǎng)絡(luò)應(yīng)用，各種各樣文本信息隨之而來，其中不少應(yīng)用產(chǎn)生的文本信息內(nèi)容一般都比較短，，我們稱之為短文本信息。短文本數(shù)據(jù)量異常龐大。短文本信息的研究在很多領(lǐng)域有其重要的用途，例如在社交網(wǎng)絡(luò)的推薦系統(tǒng)、互聯(lián)網(wǎng)信息安全、網(wǎng)絡(luò)信息數(shù)據(jù)挖掘，話題跟蹤與發(fā)現(xiàn)、網(wǎng)絡(luò)新詞發(fā)現(xiàn)、網(wǎng)絡(luò)輿論監(jiān)控等領(lǐng)域都具有廣泛的應(yīng)用場景。本文所研究的是面向短文本分類的特征擴展問題。短文本信息的特點主要體現(xiàn)在文本內(nèi)容較短、特征稀少、噪音影響大等方面，傳統(tǒng)的統(tǒng)計文本分類算法是基于bag-of-words范式的，由于短文本特點，這些文本分類方法對于短文本分類表現(xiàn)相對較差。針對這些問題，本文設(shè)計并實現(xiàn)了基于搜索引擎的特征擴展方法，將短文本通過檢索得到網(wǎng)絡(luò)信息，然后將這些相關(guān)的信息用于短文本擴展，最后再選擇合適的文本分類器對短文本分類，本文主要選用的三種常用的全監(jiān)督分類器，同時也嘗試將半監(jiān)督分類器應(yīng)用于短文本分類問題。然而基于特征擴展的短文本特征擴展方法，普遍存在一個問題，即擴展的網(wǎng)絡(luò)信息通常存在歧義內(nèi)容。有歧義的網(wǎng)絡(luò)信息很顯然是不合適用于特征擴展的。為了解決這一問題，本論文提出了一種基于圖的特征擴展約束方法，通過短文本擴展信息的不斷迭代過濾，最終得到用于擴展特征的高質(zhì)量信息。同時本文也提出一種短文本關(guān)鍵字提取算法，該算法的設(shè)計結(jié)合了短文本的統(tǒng)計信息，語義信息及關(guān)鍵字出現(xiàn)的位置與順序等特征，系統(tǒng)中使用這種算法提取可靠的短文本關(guān)鍵字，用于檢索網(wǎng)絡(luò)信息。本文采用的實驗數(shù)據(jù)為新浪微博語料，實驗中實現(xiàn)了短文本特征擴展方法、短文本關(guān)鍵字提取算法、擴展約束方法，在此基礎(chǔ)上結(jié)合多種分類器，設(shè)計了中文的短文本分類系統(tǒng)。在這個系統(tǒng)平臺通過實驗得出多組對比數(shù)據(jù)。最終的實驗結(jié)果表明，本文提出的特征擴展方法及特征擴展噪音消除方法能夠很好地提高短文本的分類效果，達(dá)到了預(yù)期的目標(biāo)。
[Abstract]:In recent years, a variety of network applications (such as Facebook, QQ, Twitter, Sina Weibo and so on) have been emerging. With these network applications, a variety of text information has followed, many of which have generally produced relatively short text information. We call it short text information. The volume of text is extremely large. The research of short text information has important applications in many fields, such as recommendation system of social network, Internet information security, network information data mining, topic tracking and discovery, network neologism discovery, etc. Network public opinion monitoring and other fields have a wide range of applications. In this paper, the problem of feature extension for short text classification is studied. The features of short text information are mainly reflected in short text content, few features and great noise impact. The traditional statistical text classification algorithm is based on bag-of-words paradigm, because of the characteristics of short text. These text classification methods are relatively poor for short text classification. In order to solve these problems, this paper designs and implements the feature extension method based on search engine. The short text book is retrieved to get the network information, and then the relevant information is used in the short text book extension. Finally, we choose the appropriate text classifier to classify short text. Three kinds of commonly used fully supervised classifiers are used in this paper. At the same time, we try to apply the semi-supervised classifier to the short text classification. However, there is a common problem in the feature extension method of short text based on feature expansion, that is, the extended network information usually has ambiguous content. Ambiguous network information is clearly not suitable for feature extension. In order to solve this problem, a graph-based feature extension constraint method is proposed in this paper. Through iterative filtering of short text extension information, high quality information for extended features is obtained. At the same time, this paper also proposes a short text keyword extraction algorithm, which combines the statistical information of short text, semantic information and the location and order of keywords, etc. The system uses this algorithm to extract reliable short text keyword, which is used to retrieve network information. The experimental data used in this paper are the corpus of Sina Weibo. In the experiment, we have implemented the methods of feature expansion of short text, keyword extraction algorithm of short text, extended constraint method, and combined with various classifiers on this basis. A Chinese text classification system is designed. In this system platform through the experiment to obtain a number of groups of comparative data. The final experimental results show that the proposed feature expansion method and the feature expansion noise elimination method can improve the classification effect of short text and achieve the desired goal.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前4條

1 閆瑞;曹先彬;李凱;;面向短文本的動態(tài)組合分類算法[J];電子學(xué)報;2009年05期

2 王立霞;淮曉永;;基于語義的中文文本關(guān)鍵詞提取算法[J];計算機工程;2012年01期

3 王細(xì)薇;樊興華;趙軍;;一種基于特征擴展的中文短文本分類方法[J];計算機應(yīng)用;2009年03期

4 韓忠明;張玉沙;張慧;萬月亮;黃今慧;;有效的中文微博短文本傾向性分類算法[J];計算機應(yīng)用與軟件;2012年10期

相關(guān)博士學(xué)位論文前1條

1 翟延冬;基于WordNet的短文本語義網(wǎng)挖掘算法研究[D];吉林大學(xué);2012年

本文編號：2216087

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2216087.html

上一篇：基于二級向量描述的搜索引擎?zhèn)€性化服務(wù)模型
下一篇：旅游網(wǎng)站信息流距離衰減的逆曲線擬合及其形式分析

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向短文本分類的特征擴展方法