網(wǎng)絡信息文本挖掘若干問題研究
發(fā)布時間:2018-04-12 06:34
本文選題:文本挖掘 + 特征聚簇。 參考:《北京理工大學》2015年博士論文
【摘要】:面對規(guī)模龐大、維數(shù)極高的文本信息,如何設計合理的、便于擴展的文本挖掘算法已成為數(shù)據(jù)挖掘領域的熱點方向。針對這一方向,本文對文本挖掘所涉及的若干問題進行了深入研究,主要創(chuàng)新點包含如下五方面:1.針對傳統(tǒng)的向量空間模型維數(shù)過高并且不能處理同義詞與近義詞的問題,本文提出基于特征聚簇的向量空間模型,該模型首先將每個特征進行向量表示;然后將這些特征進行聚類,將得到的每一個聚簇整體作為一個特征;此外,針對專有名詞的非連續(xù)短語進行識別,使得文本表示向量中的特征信息變得更為豐富、精準。這種方法不但能夠有效降低文本向量的維度,而且能進一步體現(xiàn)文本特征之間的語義關系,因而能夠提高文本挖掘的質(zhì)量。實驗結果證明,使用該方法得到的文本表示向量具有較高的特征約簡率,聚類F值較傳統(tǒng)方法也有明顯的提升。2.傳統(tǒng)的K-means算法對于初始中心點的選擇是隨機的,容易引起分析結果的波動。針對這一問題,本文提出一種基于相似度矩陣的K-means算法。該方法不再隨機地選取初始聚簇中心,而是使用相似度矩陣有針對性地選擇更加有效的初始聚簇中心,這樣能為整個聚類過程產(chǎn)生一個良好的開端,也降低了初始中心點對最終的聚類結果所造成的不穩(wěn)定性影響,從而能夠取得較好的聚類質(zhì)量。實驗結果表明改進的算法使聚類的F值得到了明顯的提高,并且聚類結果也比較穩(wěn)定。3.針對文本挖掘應用面臨的標注數(shù)據(jù)不充足的現(xiàn)象,本文提出半監(jiān)督K-means算法。這種方法同時使用標注數(shù)據(jù)和未標注數(shù)據(jù),它充分利用已標注數(shù)據(jù)的特點來輔助未標注數(shù)據(jù)的標注。該方法在選擇初始點時,一部分使用標注數(shù)據(jù)的類別中心點,另一部分則使用距離已選的標注數(shù)據(jù)較遠的未標注數(shù)據(jù),這樣能夠保證初始點分屬于不同的聚簇,從而獲得較高準確率的結果。實驗結果表明該算法是一種有效的方法,在一定程度上解決了標注數(shù)據(jù)不充足的問題。4.不均衡訓練語料是一種普遍現(xiàn)象,它會造成分類質(zhì)量的下降。針對這種現(xiàn)象,本文提出混合加權KNN算法。這種方法通過分析訓練樣本的分布情況,使用比例倒數(shù)加權,使得每個訓練樣本落到待分類樣本區(qū)域中的可能性相等,從而不再受類別分布不均衡的影響,同時還結合距離加權,保證了訓練樣本距離待分類樣本越近,其權重就會越大,獲得比較理想的分類效果。實驗結果表明該算法可以獲得較好的分類準確率,是一種解決針對不均衡訓練語料分類問題的有效方法。5.為了提高運算效率和便于處理大數(shù)據(jù)集,對本文提出的文本聚類和文本分類算法進行基于MapReduce的并行化處理,并把這些算法作為模塊集成于一個完整的文本挖掘系統(tǒng),實現(xiàn)文本挖掘全流程的自動化處理。實驗結果表明對所改進算法的并行化處理,一方面沒有影響文本挖掘的準確率,另一方面還大大提高了運行效率。
[Abstract]:In the face of large scale and high dimension text information, how to design reasonable and easy to expand text mining algorithm has become a hot topic in the field of data mining.Aiming at this direction, this paper makes a deep research on some problems involved in text mining. The main innovations include the following five aspects: 1.Aiming at the problem that the dimension of traditional vector space model is too high to deal with synonyms and synonyms, this paper proposes a vector space model based on feature clustering.Then these features are clustered and each cluster is taken as a feature. In addition, the discontinuous phrases of proper nouns are recognized, which makes the feature information in the text representation vector more abundant and accurate.This method not only can effectively reduce the dimension of text vector, but also can further reflect the semantic relationship between text features, so it can improve the quality of text mining.The experimental results show that the text representation vector obtained by this method has a higher feature reduction rate, and the clustering F value also has a significant improvement of .2. compared with the traditional method.The traditional K-means algorithm is random for the selection of initial center points, which can easily cause fluctuation of the analysis results.To solve this problem, this paper proposes a K-means algorithm based on similarity matrix.Instead of randomly selecting initial cluster centers, the method uses similarity matrix to select more effective initial clustering centers, which can make a good start for the whole clustering process.The effect of the initial center on the instability of the final clustering results is also reduced, so that the better clustering quality can be achieved.The experimental results show that the improved algorithm can significantly improve the F value of the clustering, and the clustering results are also relatively stable. 3.In this paper, a semi-supervised K-means algorithm is proposed to solve the problem of insufficient annotated data in text mining applications.This method uses both annotated data and unannotated data, and makes full use of the characteristics of annotated data to assist in the tagging of unannotated data.When selecting the initial point, one part uses the class center of the annotated data, the other part uses the unlabeled data which is far away from the selected tagged data, which can ensure that the initial points belong to different clusters.Thus, the result of higher accuracy is obtained.Experimental results show that the algorithm is an effective method, to some extent, the problem of insufficient tagging data. 4.Unbalanced training corpus is a common phenomenon, which can lead to the decline of classification quality.In view of this phenomenon, a hybrid weighted KNN algorithm is proposed in this paper.By analyzing the distribution of training samples and using proportional reciprocal weighting, the probability of each training sample falling into the region to be classified is equal, so that it is no longer affected by the unbalanced distribution of categories.At the same time, the distance weighting ensures that the closer the training sample is to the sample to be classified, the greater the weight of the training sample is, and the better the classification effect is.Experimental results show that the algorithm can achieve better classification accuracy, and it is an effective method to solve the problem of uneven training corpus classification.In order to improve the operation efficiency and facilitate the processing of big data set, the text clustering and text classification algorithms proposed in this paper are parallelized based on MapReduce, and these algorithms are integrated into a complete text mining system as a module.The automatic processing of the whole process of text mining is realized.The experimental results show that the parallelization of the improved algorithm does not affect the accuracy of text mining on the one hand, and improves the running efficiency greatly on the other hand.
【學位授予單位】:北京理工大學
【學位級別】:博士
【學位授予年份】:2015
【分類號】:TP391.1
【共引文獻】
相關期刊論文 前5條
1 楊柳;于劍;景麗萍;;一種自適應的大間隔近鄰分類算法[J];計算機研究與發(fā)展;2013年11期
2 石鑫鑫;胡學鋼;林耀進;;融合互近鄰和可信度的K-近鄰分類算法[J];合肥工業(yè)大學學報(自然科學版);2014年09期
3 滕敏;衛(wèi)文學;滕寧;;K-最近鄰分類算法應用研究[J];軟件導刊;2015年03期
4 吳潤秀;;一種結合DS證據(jù)理論的改進KNN分類算法[J];統(tǒng)計與決策;2015年15期
5 林耀進;李進金;陳錦坤;馬周明;;融合鄰域信息的k-近鄰分類[J];智能系統(tǒng)學報;2014年02期
相關博士學位論文 前4條
1 李自強;大規(guī)模文本分類的若干問題研究[D];電子科技大學;2013年
2 于霄;基于間隔理論的序列數(shù)據(jù)挖掘研究[D];哈爾濱工業(yè)大學;2012年
3 劉志亮;基于數(shù)據(jù)驅(qū)動的行星齒輪箱故障診斷方法研究[D];電子科技大學;2013年
4 李子龍;智能交通系統(tǒng)中視頻目標檢測與識別的關鍵算法研究[D];華南理工大學;2014年
,本文編號:1738575
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/1738575.html
最近更新
教材專著