一種基于特征庫(kù)投影的文本分類算法
發(fā)布時(shí)間:2018-10-23 18:44
【摘要】:基于KNN的主流文本分類策略適合樣本容量較大的自動(dòng)分類,但存在時(shí)間復(fù)雜度偏高、特征降維和樣本剪裁易出現(xiàn)信息丟失等問(wèn)題,本文提出一種基于特征庫(kù)投影(FLP)的分類算法。該算法首先將所有訓(xùn)練樣本的特征按照一定的權(quán)重策略構(gòu)筑特征庫(kù),通過(guò)特征庫(kù)保留所有樣本特征信息;然后,通過(guò)投影函數(shù),根據(jù)待分類樣本的特征集合將每個(gè)分類的特征庫(kù)映射為投影樣本,通過(guò)計(jì)算新樣本與各分類投影樣本的相似度來(lái)完成分類。采用復(fù)旦大學(xué)國(guó)際數(shù)據(jù)庫(kù)中心自然語(yǔ)言處理小組整理的語(yǔ)料庫(kù)對(duì)所提出的分類算法進(jìn)行驗(yàn)證,分小量訓(xùn)練文本和大量訓(xùn)練文本2個(gè)場(chǎng)景進(jìn)行測(cè)試,并與基于聚類的KNN算法進(jìn)行對(duì)比。實(shí)驗(yàn)結(jié)果表明:FLP分類算法不會(huì)丟失分類特征,分類精確度較高;分類效率與樣本規(guī)模的增長(zhǎng)不直接關(guān)聯(lián),時(shí)間復(fù)雜度低。
[Abstract]:The mainstream text classification strategy based on KNN is suitable for automatic classification with large sample size, but it has some problems such as high time complexity, feature reduction and sample clipping, etc. In this paper, a classification algorithm based on feature base projection (FLP) is proposed. In this algorithm, the feature of all training samples is constructed according to a certain weight strategy, and the feature information of all samples is preserved through the feature library. According to the feature set of the samples to be classified, the feature bank of each classification is mapped to the projection sample, and the classification is completed by calculating the similarity between the new sample and the projection sample of each classification. The proposed classification algorithm is verified by the corpus compiled by the Natural language processing Group of the International Database Center of Fudan University. The proposed classification algorithm is tested in two scenarios: a small number of training texts and a large number of training texts. And compared with KNN algorithm based on clustering. The experimental results show that the FLP classification algorithm does not lose the classification features, and the classification accuracy is high, the classification efficiency is not directly related to the growth of sample size, and the time complexity is low.
【作者單位】: 湖南大學(xué)校園信息化建設(shè)與管理辦公室;湖南商學(xué)院旅游管理學(xué)院;湖南大學(xué)信息工程與科學(xué)學(xué)院;
【基金】:國(guó)家自然科學(xué)基金資助項(xiàng)目(61672221,61304184,61672156)~~
【分類號(hào)】:TP391.1
[Abstract]:The mainstream text classification strategy based on KNN is suitable for automatic classification with large sample size, but it has some problems such as high time complexity, feature reduction and sample clipping, etc. In this paper, a classification algorithm based on feature base projection (FLP) is proposed. In this algorithm, the feature of all training samples is constructed according to a certain weight strategy, and the feature information of all samples is preserved through the feature library. According to the feature set of the samples to be classified, the feature bank of each classification is mapped to the projection sample, and the classification is completed by calculating the similarity between the new sample and the projection sample of each classification. The proposed classification algorithm is verified by the corpus compiled by the Natural language processing Group of the International Database Center of Fudan University. The proposed classification algorithm is tested in two scenarios: a small number of training texts and a large number of training texts. And compared with KNN algorithm based on clustering. The experimental results show that the FLP classification algorithm does not lose the classification features, and the classification accuracy is high, the classification efficiency is not directly related to the growth of sample size, and the time complexity is low.
【作者單位】: 湖南大學(xué)校園信息化建設(shè)與管理辦公室;湖南商學(xué)院旅游管理學(xué)院;湖南大學(xué)信息工程與科學(xué)學(xué)院;
【基金】:國(guó)家自然科學(xué)基金資助項(xiàng)目(61672221,61304184,61672156)~~
【分類號(hào)】:TP391.1
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 景寧,劉雨,彭甫陽(yáng);一種實(shí)用外分類算法—快速分類-折半插入算法的研究及實(shí)現(xiàn)[J];小型微型計(jì)算機(jī)系統(tǒng);1988年09期
2 鄭智捷;幻序合并分類算法[J];計(jì)算機(jī)學(xué)報(bào);1984年05期
3 劉t,
本文編號(hào):2290153
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2290153.html
最近更新
教材專著