文檔中詞語權(quán)重計(jì)算方法的改進(jìn)
發(fā)布時(shí)間:2018-11-04 15:38
【摘要】:文本的形式化表示一直是文本檢索、自動(dòng)文摘和搜索引擎等信息檢索領(lǐng)域關(guān)注的基礎(chǔ)性問題。向量空間模型 (VectorSpaceModel)中的tf.idf文本表示是該領(lǐng)域里得到廣泛應(yīng)用并且取得較好效果的一種文本表示方法。詞語在文本集合中的分布比例量上的差異是決定詞語表達(dá)文本內(nèi)容的重要因素之一 ,但現(xiàn)在tf.idf方法無法把握這一因素。針對這個(gè)問題 ,本文引入信息論中信息增益的概念 ,提出一種對tf.idf的改進(jìn)方法tf.idf.IG文本表示方法。該方法將詞語的信息增益作為一個(gè)文本表示的一個(gè)因子 ,來衡量詞語在文本集合中分布比例在量上的差異。在文本分類實(shí)驗(yàn)中 ,tf.idf.IG文本表示的向量空間模型的分類效果要好于tf.idf方法 ,驗(yàn)證了改進(jìn)方法tf.idf.IG的有效性和可行性。
[Abstract]:The formal representation of text has always been a basic problem in information retrieval such as text retrieval, automatic abstracting and search engine. Tf.idf text representation in vector space model (VectorSpaceModel) is a widely used and effective text representation method in this field. The difference in the distribution of words in the text set is one of the important factors that determine the text content, but now the tf.idf method can not grasp this factor. To solve this problem, this paper introduces the concept of information gain in information theory, and proposes an improved tf.idf.IG text representation method for tf.idf. In this method, the information gain of words is regarded as a factor of text representation to measure the quantitative difference in the distribution ratio of words in the text set. In the text classification experiment, the classification effect of vector space model represented by tf.idf.IG text is better than that of tf.idf method, which verifies the effectiveness and feasibility of the improved tf.idf.IG method.
【作者單位】: 中國科學(xué)院計(jì)算技術(shù)研究所軟件研究室!北京100080 中國科學(xué)院計(jì)算技術(shù)研究所軟件研究室!北京100080 中國科學(xué)院計(jì)算技術(shù)研究所軟件研究室!北京100080 中國科學(xué)院計(jì)算技術(shù)研究所軟件研究室!北京100080
【基金】:973項(xiàng)目!(G19980 30 5 10 ) 國家自然基金!(6 97730 0 8) 國家 86 3項(xiàng)目!(86 3- 30 6 - 2D0 2 - 0 1- 3)
【分類號】:TP391
[Abstract]:The formal representation of text has always been a basic problem in information retrieval such as text retrieval, automatic abstracting and search engine. Tf.idf text representation in vector space model (VectorSpaceModel) is a widely used and effective text representation method in this field. The difference in the distribution of words in the text set is one of the important factors that determine the text content, but now the tf.idf method can not grasp this factor. To solve this problem, this paper introduces the concept of information gain in information theory, and proposes an improved tf.idf.IG text representation method for tf.idf. In this method, the information gain of words is regarded as a factor of text representation to measure the quantitative difference in the distribution ratio of words in the text set. In the text classification experiment, the classification effect of vector space model represented by tf.idf.IG text is better than that of tf.idf method, which verifies the effectiveness and feasibility of the improved tf.idf.IG method.
【作者單位】: 中國科學(xué)院計(jì)算技術(shù)研究所軟件研究室!北京100080 中國科學(xué)院計(jì)算技術(shù)研究所軟件研究室!北京100080 中國科學(xué)院計(jì)算技術(shù)研究所軟件研究室!北京100080 中國科學(xué)院計(jì)算技術(shù)研究所軟件研究室!北京100080
【基金】:973項(xiàng)目!(G19980 30 5 10 ) 國家自然基金!(6 97730 0 8) 國家 86 3項(xiàng)目!(86 3- 30 6 - 2D0 2 - 0 1- 3)
【分類號】:TP391
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 高金勇;徐朝軍;馮奕z,
本文編號:2310289
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2310289.html
最近更新
教材專著