基于多指標融合的文本特征評價及選擇算法
發(fā)布時間:2019-03-13 09:20
【摘要】:在文本分類問題中,有多種評價特征優(yōu)劣的指標,其中主要有特征與類別的相關性、特征自身的冗余度和特征在語料中的稀疏程度。由于文本特征的優(yōu)劣直接影響分類效果,全方位考慮特征的各個因素很有必要。特征選擇常分為三步驟分別對相關性、冗余度和稀疏程度進行衡量,而在每一步的加權和篩選過程中都要耗費大量時間,在面對實時性和準確性要求較高的情況時,這種分步評價特征的方法很難適用。針對上述問題,首先建立坐標模型,將相關性、冗余度和稀疏程度映射到坐標系中,根據(jù)空間內(nèi)的點和原點構成的向量與坐標面或坐標軸的夾角對文本特征進行加權和篩選,從而將多個評價指標整合為一個評價指標,大幅節(jié)省了多次加權和篩選所耗費的時間,提高了特征選擇效率。在復旦大學中文文本語料庫和網(wǎng)易文本語料庫中的實驗結果表明,相比于分步法,基于多指標融合的文本特征評價及選擇算法能夠更快、更準地篩選詞匯和n-grams特征,并在支持向量機(Support Vector Machine,SVM)中驗證了特征在分類時的有效性。
[Abstract]:In the problem of text classification, there are a variety of indicators to evaluate the advantages and disadvantages of features, including the correlation between features and categories, the redundancy of features themselves and the sparse degree of features in the corpus. Because the advantages and disadvantages of the text features directly affect the classification effect, it is necessary to consider all the factors of the features in an all-round way. Feature selection is often divided into three steps to measure the correlation, redundancy and sparsity respectively. However, it takes a lot of time in each step of the weighting and screening process, and in the face of real-time and high accuracy requirements, This method of step-by-step evaluation of features is difficult to apply. In order to solve the above problems, firstly, the coordinate model is established, and the correlation, redundancy and sparsity are mapped to the coordinate system. The text features are weighted and screened according to the vector of the point and origin in the space and the angle between the coordinate plane or the coordinate axis. As a result, the multiple evaluation indexes are integrated into one evaluation index, which greatly saves the time of multiple weighting and screening, and improves the efficiency of feature selection. The experimental results in the Chinese text corpus of Fudan University and NetEase text corpus show that the multi-index fusion-based text feature evaluation and selection algorithm is faster and more accurate than the step-by-step method in selecting vocabulary and n-grams features. The validity of the feature in classification is verified in support vector machine (Support Vector Machine,SVM).
【作者單位】: 遼寧工程技術大學軟件學院;
【基金】:國家自然科學基金(No.70971059) 遼寧省創(chuàng)新團隊項目(No.2009T045) 遼寧省高等學校杰出青年學者成長計劃(No.LJQ2012027)
【分類號】:TP391.1
,
本文編號:2439266
[Abstract]:In the problem of text classification, there are a variety of indicators to evaluate the advantages and disadvantages of features, including the correlation between features and categories, the redundancy of features themselves and the sparse degree of features in the corpus. Because the advantages and disadvantages of the text features directly affect the classification effect, it is necessary to consider all the factors of the features in an all-round way. Feature selection is often divided into three steps to measure the correlation, redundancy and sparsity respectively. However, it takes a lot of time in each step of the weighting and screening process, and in the face of real-time and high accuracy requirements, This method of step-by-step evaluation of features is difficult to apply. In order to solve the above problems, firstly, the coordinate model is established, and the correlation, redundancy and sparsity are mapped to the coordinate system. The text features are weighted and screened according to the vector of the point and origin in the space and the angle between the coordinate plane or the coordinate axis. As a result, the multiple evaluation indexes are integrated into one evaluation index, which greatly saves the time of multiple weighting and screening, and improves the efficiency of feature selection. The experimental results in the Chinese text corpus of Fudan University and NetEase text corpus show that the multi-index fusion-based text feature evaluation and selection algorithm is faster and more accurate than the step-by-step method in selecting vocabulary and n-grams features. The validity of the feature in classification is verified in support vector machine (Support Vector Machine,SVM).
【作者單位】: 遼寧工程技術大學軟件學院;
【基金】:國家自然科學基金(No.70971059) 遼寧省創(chuàng)新團隊項目(No.2009T045) 遼寧省高等學校杰出青年學者成長計劃(No.LJQ2012027)
【分類號】:TP391.1
,
本文編號:2439266
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2439266.html
最近更新
教材專著