基于協(xié)同訓(xùn)練的半監(jiān)督短文本分類方法研究

發(fā)布時間：2019-06-10 06:06

【摘要】：隨著互聯(lián)網(wǎng)的迅猛發(fā)展,信息正在以指數(shù)形式飛速增長。通過互聯(lián)網(wǎng)人們可以輕而易舉地獲取大量的信息,從而對自己的行為起著非常重要的指引作用。短文本是互聯(lián)網(wǎng)中一種非常重要的信息載體,短文本中蘊(yùn)含的信息早期是通過人工標(biāo)記的方式直接獲取,但是人工標(biāo)記的方式需要大量的專業(yè)技術(shù)人員參與,消耗了大量的人力物力,并且只能對少量的文本進(jìn)行標(biāo)記,而互聯(lián)網(wǎng)上的文本數(shù)量非常龐大,因此人工標(biāo)記的方式不適合互聯(lián)網(wǎng)上大規(guī)模文本進(jìn)行分類的需求。采用機(jī)器學(xué)習(xí)的方法對未標(biāo)注樣本進(jìn)行標(biāo)注,逐步成為互聯(lián)網(wǎng)上文本信息處理的一種趨勢,同時提高樣本標(biāo)記效率已經(jīng)成為當(dāng)前研究的熱點(diǎn)。與人工標(biāo)注的方法相比,使用機(jī)器學(xué)習(xí)技術(shù)對未標(biāo)注樣本的標(biāo)注,不僅準(zhǔn)確率高,而且算法非常穩(wěn)定。半監(jiān)督協(xié)同訓(xùn)練是方法目前機(jī)器學(xué)習(xí)中一種非常重要的文本分類方法。本文主要對基于協(xié)同訓(xùn)練的半監(jiān)督短文本分類進(jìn)行研究,主要包含以下幾個方面的內(nèi)容:1.對短文本分類問題進(jìn)行分析,給出了基于協(xié)同訓(xùn)練的半監(jiān)督短文本分類系統(tǒng)模型。短文本分類模型可以分成三個功能模塊:預(yù)處理模塊、訓(xùn)練模塊和測試模塊。預(yù)處理模塊,主要是對非結(jié)構(gòu)化的短文本進(jìn)行處理,通過對短文本去除格式標(biāo)記、分詞、去停用詞、特征提取、詞頻統(tǒng)計、文本向量化等一系列步驟得到結(jié)構(gòu)化的數(shù)據(jù)集。訓(xùn)練模塊,一方面是根據(jù)差異性原理構(gòu)造分類器,使用分類器對未標(biāo)注樣本進(jìn)行標(biāo)注;另一方面使用訓(xùn)練樣本集對分類器進(jìn)行協(xié)同訓(xùn)練,從而得到不斷優(yōu)化的分類器。測試模塊,使用測試樣本集對分類器進(jìn)行測試,驗(yàn)證協(xié)同訓(xùn)練方法的可行性和有效性。2.結(jié)合半監(jiān)督協(xié)同訓(xùn)練,給出了短文本分類方法,進(jìn)一步改進(jìn)了特征提取方法和協(xié)同訓(xùn)練方法。(1)特征提取方法的改進(jìn)。根據(jù)短文本中文字?jǐn)?shù)量較少的特點(diǎn),從詞語之間語義聯(lián)系的角度,來構(gòu)造短文本中詞語之間的鄰接矩陣,然后通過鄰接矩陣相似度的計算來構(gòu)造一個無向圖,再根據(jù)無向圖的鄰接度計算特征度,將特征度高的特征詞進(jìn)行提取。這種特征提取方法相比于傳統(tǒng)方法兼顧了詞語之間語義的相似關(guān)系,有助于對短文本進(jìn)行有效分類。(2)協(xié)同訓(xùn)練算法改進(jìn)。為了對未標(biāo)注樣本進(jìn)行標(biāo)注,通過多分類器“互助”方式訓(xùn)練分類器。在二分類問題中,對某個未標(biāo)注樣本進(jìn)行標(biāo)注如果三個分類器的標(biāo)注結(jié)果相同,代表標(biāo)注結(jié)果有較高的置信度,把標(biāo)注樣本放入到已標(biāo)注樣本集中;如果標(biāo)注結(jié)果不同,那么必有兩個分類器的標(biāo)注結(jié)果相同,使用兩個分類器的標(biāo)注結(jié)果訓(xùn)練第三個分類器。在標(biāo)注過程中,反復(fù)訓(xùn)練分類器,最終獲得性能較好的分類器。3.利用互聯(lián)網(wǎng)網(wǎng)站搜集到的短文本進(jìn)行對比實(shí)驗(yàn),驗(yàn)證了協(xié)同訓(xùn)練半監(jiān)督短文本分類方法的有效性。通過選取新浪、搜狐和網(wǎng)易等各大網(wǎng)站搜集到的短文本帖子作為數(shù)據(jù)集,將本文改進(jìn)后的方法與傳統(tǒng)的短文本分類方法進(jìn)行對比實(shí)驗(yàn),通過評估指標(biāo)準(zhǔn)確率、召回率和F1值對本文分類方法進(jìn)行評估,從而驗(yàn)證本文方法的可行性和有效性。因此,本文構(gòu)建了基于協(xié)同訓(xùn)練的半監(jiān)督短文本分類模型,給出了相應(yīng)的分類方法,同時對短文本特征提取方法和半監(jiān)督協(xié)同訓(xùn)練進(jìn)行了改進(jìn),并將改進(jìn)的方法與傳統(tǒng)的方法進(jìn)行了對比實(shí)驗(yàn)。實(shí)驗(yàn)結(jié)果表明,本文給出的方法能有效提高短文本分類的效率。
[Abstract]:With the rapid development of the Internet, information is growing exponentially. People can easily get a lot of information through the Internet so that they play a very important role in their behavior. The short text is a very important information carrier in the Internet, the information contained in the short text is acquired directly by means of manual marking, but the manual marking method requires a large number of professional and technical personnel to participate, and a large amount of manpower and material resources are consumed, And only a small amount of text can be marked, and the number of texts on the internet is very large, so that the method of manual marking is not suitable for the classification of large-scale text on the internet. The method of machine learning is used to mark the unlabeled samples and gradually become a trend of text information processing on the Internet, and the efficiency of sample marking has become the hot point of the current research. Compared with the method of the manual marking, the machine learning technology is used to mark the unlabeled sample, the accuracy is high, and the algorithm is very stable. The semi-supervised cooperative training is a very important text classification method in the current machine learning. This paper mainly studies the classification of semi-supervised short text based on cooperative training, which mainly includes the following aspects:1. This paper analyzes the classification of short text, and gives a semi-supervised short text classification system model based on cooperative training. The short text classification model can be divided into three functional modules: pre-processing module, training module and test module. The pre-processing module is mainly used for processing a non-structured short text book, and the structured data set is obtained through a series of steps such as a short text removal format mark, a participle, a stop word, a feature extraction, a word frequency statistic, a text-to-quantization and the like. The training module, on the one hand, constructs the classifier according to the difference principle, uses the classifier to mark the unlabeled sample, and on the other hand, uses the training sample set to perform the cooperative training on the classifier so as to obtain the continuously optimized classifier. The test module is used for testing the classifier by using the test sample set, and verifying the feasibility and the effectiveness of the cooperative training method. Combined with the semi-supervised cooperative training, the paper gives a short text classification method, and further improves the feature extraction method and the cooperative training method. (1) The improvement of the feature extraction method. according to the characteristics of fewer characters in the short text book, the adjacent matrix between the words in the short text is constructed from the angle of the semantic relation between the words, then a non-directional diagram is constructed by the calculation of the similarity of the adjacent matrix, and the characteristic degree is calculated according to the adjacency degree of the non-directional graph, And the characteristic words with high characteristic are extracted. Compared with the traditional method, the feature extraction method has the advantages that the similarity of the semantic between words is taken into account, and the short text is effectively classified. (2) Improved cooperative training algorithm. In order to dimension an unlabeled sample, the classifier is trained by a multi-classifier "mutual aid". in that two-classification problem, if the dimension result of the three classifiers is the same, the dimension result of the three classifiers is the same, the dimension sample is put into the marked sample set, and if the dimension result is different, Then the result of the annotation of the two classifiers is the same, and the third classifier is trained using the dimensional results of the two classifiers. In the process of labeling, the classifier is trained repeatedly and finally the classifier with better performance is obtained. A comparative experiment was carried out on the short text collected by the Internet website, and the effectiveness of the method of collaborative training and semi-supervised short text classification was verified. By selecting the short text posts collected by the major websites such as Sina, Sohu and NetEase as the data set, the improved method is compared with the traditional short text classification method, and the classification method is evaluated by the evaluation index accuracy, the recall rate and the F1 value. So as to verify the feasibility and the effectiveness of the method. Therefore, this paper constructs a semi-supervised short text classification model based on the cooperative training, and gives the corresponding classification method. At the same time, the feature extraction method and the semi-supervised cooperative training are improved, and the improved method is compared with the traditional method. The experimental results show that the proposed method can effectively improve the efficiency of short text classification.
【學(xué)位授予單位】：西南大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 程傳鵬;蘇安婕;;一種短文本特征詞提取的方法[J];計算機(jī)應(yīng)用與軟件;2014年06期

2 張倩;劉懷亮;;一種基于半監(jiān)督學(xué)習(xí)的短文本分類方法[J];現(xiàn)代圖書情報技術(shù);2013年02期

3 石靜;吳云芳;邱立坤;呂學(xué)強(qiáng);;基于大規(guī)模語料庫的漢語詞義相似度計算方法[J];中文信息學(xué)報;2013年01期

4 徐峻嶺;周毓明;陳林;徐寶文;;基于互信息的無監(jiān)督特征選擇[J];計算機(jī)研究與發(fā)展;2012年02期

5 李凱齊;刁興春;曹建軍;;基于信息增益的文本特征權(quán)重改進(jìn)算法[J];計算機(jī)工程;2011年01期

6 王細(xì)薇;沈云琴;;中文短文本分類方法研究[J];現(xiàn)代計算機(jī)(專業(yè)版);2010年07期

7 褚穎娜;廖敏;宋繼華;;一種基于統(tǒng)計的分詞標(biāo)注一體化方法[J];計算機(jī)系統(tǒng)應(yīng)用;2009年12期

8 楊曉恝;蔣維;郝文寧;;基于本體和句法分析的領(lǐng)域分詞的實(shí)現(xiàn)[J];計算機(jī)工程;2008年23期

9 于一;;K-近鄰法的文本分類算法分析與改進(jìn)[J];火力與指揮控制;2008年04期

10 李文波;孫樂;張大鯤;;基于Labeled-LDA模型的文本分類新算法[J];計算機(jī)學(xué)報;2008年04期

相關(guān)碩士學(xué)位論文前2條

1 李寧寧;基于半監(jiān)督協(xié)同訓(xùn)練的文本情感分類研究[D];合肥工業(yè)大學(xué);2015年

2 炎士濤;基于詞頻統(tǒng)計的文本分類模型研究[D];上海師范大學(xué);2007年

，

本文編號：2496231

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2496231.html

上一篇：多媒體信息檢索中的查詢與反饋技術(shù)
下一篇：新型TPO抑制劑的計算機(jī)輔助設(shè)計與虛擬篩選

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于協(xié)同訓(xùn)練的半監(jiān)督短文本分類方法研究