基于協(xié)同訓(xùn)練的半監(jiān)督短文本分類方法研究
[Abstract]:With the rapid development of the Internet, information is growing exponentially. People can easily get a lot of information through the Internet so that they play a very important role in their behavior. The short text is a very important information carrier in the Internet, the information contained in the short text is acquired directly by means of manual marking, but the manual marking method requires a large number of professional and technical personnel to participate, and a large amount of manpower and material resources are consumed, And only a small amount of text can be marked, and the number of texts on the internet is very large, so that the method of manual marking is not suitable for the classification of large-scale text on the internet. The method of machine learning is used to mark the unlabeled samples and gradually become a trend of text information processing on the Internet, and the efficiency of sample marking has become the hot point of the current research. Compared with the method of the manual marking, the machine learning technology is used to mark the unlabeled sample, the accuracy is high, and the algorithm is very stable. The semi-supervised cooperative training is a very important text classification method in the current machine learning. This paper mainly studies the classification of semi-supervised short text based on cooperative training, which mainly includes the following aspects:1. This paper analyzes the classification of short text, and gives a semi-supervised short text classification system model based on cooperative training. The short text classification model can be divided into three functional modules: pre-processing module, training module and test module. The pre-processing module is mainly used for processing a non-structured short text book, and the structured data set is obtained through a series of steps such as a short text removal format mark, a participle, a stop word, a feature extraction, a word frequency statistic, a text-to-quantization and the like. The training module, on the one hand, constructs the classifier according to the difference principle, uses the classifier to mark the unlabeled sample, and on the other hand, uses the training sample set to perform the cooperative training on the classifier so as to obtain the continuously optimized classifier. The test module is used for testing the classifier by using the test sample set, and verifying the feasibility and the effectiveness of the cooperative training method. Combined with the semi-supervised cooperative training, the paper gives a short text classification method, and further improves the feature extraction method and the cooperative training method. (1) The improvement of the feature extraction method. according to the characteristics of fewer characters in the short text book, the adjacent matrix between the words in the short text is constructed from the angle of the semantic relation between the words, then a non-directional diagram is constructed by the calculation of the similarity of the adjacent matrix, and the characteristic degree is calculated according to the adjacency degree of the non-directional graph, And the characteristic words with high characteristic are extracted. Compared with the traditional method, the feature extraction method has the advantages that the similarity of the semantic between words is taken into account, and the short text is effectively classified. (2) Improved cooperative training algorithm. In order to dimension an unlabeled sample, the classifier is trained by a multi-classifier "mutual aid". in that two-classification problem, if the dimension result of the three classifiers is the same, the dimension result of the three classifiers is the same, the dimension sample is put into the marked sample set, and if the dimension result is different, Then the result of the annotation of the two classifiers is the same, and the third classifier is trained using the dimensional results of the two classifiers. In the process of labeling, the classifier is trained repeatedly and finally the classifier with better performance is obtained. A comparative experiment was carried out on the short text collected by the Internet website, and the effectiveness of the method of collaborative training and semi-supervised short text classification was verified. By selecting the short text posts collected by the major websites such as Sina, Sohu and NetEase as the data set, the improved method is compared with the traditional short text classification method, and the classification method is evaluated by the evaluation index accuracy, the recall rate and the F1 value. So as to verify the feasibility and the effectiveness of the method. Therefore, this paper constructs a semi-supervised short text classification model based on the cooperative training, and gives the corresponding classification method. At the same time, the feature extraction method and the semi-supervised cooperative training are improved, and the improved method is compared with the traditional method. The experimental results show that the proposed method can effectively improve the efficiency of short text classification.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 程傳鵬;蘇安婕;;一種短文本特征詞提取的方法[J];計算機(jī)應(yīng)用與軟件;2014年06期
2 張倩;劉懷亮;;一種基于半監(jiān)督學(xué)習(xí)的短文本分類方法[J];現(xiàn)代圖書情報技術(shù);2013年02期
3 石靜;吳云芳;邱立坤;呂學(xué)強(qiáng);;基于大規(guī)模語料庫的漢語詞義相似度計算方法[J];中文信息學(xué)報;2013年01期
4 徐峻嶺;周毓明;陳林;徐寶文;;基于互信息的無監(jiān)督特征選擇[J];計算機(jī)研究與發(fā)展;2012年02期
5 李凱齊;刁興春;曹建軍;;基于信息增益的文本特征權(quán)重改進(jìn)算法[J];計算機(jī)工程;2011年01期
6 王細(xì)薇;沈云琴;;中文短文本分類方法研究[J];現(xiàn)代計算機(jī)(專業(yè)版);2010年07期
7 褚穎娜;廖敏;宋繼華;;一種基于統(tǒng)計的分詞標(biāo)注一體化方法[J];計算機(jī)系統(tǒng)應(yīng)用;2009年12期
8 楊曉恝;蔣維;郝文寧;;基于本體和句法分析的領(lǐng)域分詞的實(shí)現(xiàn)[J];計算機(jī)工程;2008年23期
9 于一;;K-近鄰法的文本分類算法分析與改進(jìn)[J];火力與指揮控制;2008年04期
10 李文波;孫樂;張大鯤;;基于Labeled-LDA模型的文本分類新算法[J];計算機(jī)學(xué)報;2008年04期
相關(guān)碩士學(xué)位論文 前2條
1 李寧寧;基于半監(jiān)督協(xié)同訓(xùn)練的文本情感分類研究[D];合肥工業(yè)大學(xué);2015年
2 炎士濤;基于詞頻統(tǒng)計的文本分類模型研究[D];上海師范大學(xué);2007年
,本文編號:2496231
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2496231.html