天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 軟件論文 >

共現(xiàn)潛在語義向量空間模型的進一步研究

發(fā)布時間:2018-01-26 05:58

  本文關(guān)鍵詞: 向量空間模型 CLSVSM TCLSVSM 共現(xiàn)分析 聚類 出處:《情報雜志》2017年12期  論文類型:期刊論文


【摘要】:[目的/意義]文獻的向量表示是文獻聚類的首要任務(wù)。共現(xiàn)潛在語義向量空間模型(CLSVSM)通過共現(xiàn)分析挖掘特征詞對間的最大潛在語義信息對向量空間模型(VSM)進行了語義補充,與向量空間模型相比明顯提高了中文文獻的聚類性能。然而,對該模型的研究還有待深入:該模型對英文文獻的聚類適用性尚需檢驗;是否可以考慮利用除max統(tǒng)計量以外的其它統(tǒng)計量構(gòu)建模型?聚類效果又會如何?面對大量的文獻數(shù)據(jù),模型的維度往往較高,運算成本大,所以有必要對模型進行優(yōu)化處理。[方法/過程]首先將CLSVSM用于對英文文獻集(數(shù)據(jù)來源于Web of Science,簡記為WOS)的主題聚類并與VSM的聚類結(jié)果進行比較;然后利用除max統(tǒng)計量以外的三個常用統(tǒng)計量min,ave,med構(gòu)建相應(yīng)的CLSVSM模型,并用這四個統(tǒng)計量構(gòu)建的CLSVSM模型對中英文文獻進行聚類比較。更重要的是,我們提出了截尾共現(xiàn)潛在語義向量空間模型(TCLSVSM)并檢驗其聚類性能。[結(jié)果/結(jié)論]實驗顯示:CLSVSM對英文文獻聚類同樣適用;四種統(tǒng)計量構(gòu)建的模型中CLSVSM-max對中英文文獻的聚類效果最佳;TCLSVSM不僅能保證聚類性能,而且能顯著降低運算成本。
[Abstract]:[Objective / meaning] the vector representation of literature is the primary task of document clustering. The latent semantic Vector Space Model (CLSVSM). The maximum potential semantic information between feature pairs is extracted by co-occurrence analysis to complement the vector space model (VSM). Compared with the vector space model, the clustering performance of Chinese literature is improved obviously. However, the research on this model needs to be further studied: the applicability of the model to English literature clustering needs to be tested; Could you consider using statistics other than max statistics to build models? What is the effect of clustering? In the face of a large amount of literature data, the dimension of the model is often high and the operation cost is large, so it is necessary to optimize the model. [Methods / procedures] first, CLSVSM was used in the English literature set (data from Web of Science). The topic clustering is abbreviated as WOS) and compared with the clustering results of VSM. Then, the corresponding CLSVSM model was constructed by using the three commonly used statistics except max statistics. The CLSVSM model constructed by these four statistics is used to cluster and compare Chinese and English literature. We propose a truncated cooccurrence latent semantic vector space model (TCLSVSM) and test its clustering performance. [Results / conclusion] the experiment showed that: 1. CLSVSM was also applicable to English literature clustering. Among the four statistical models, CLSVSM-max has the best clustering effect on Chinese and English literature. TCLSVSM can not only guarantee the clustering performance, but also reduce the operation cost significantly.
【作者單位】: 山西大學數(shù)學科學學院;山西大學管理與決策研究所;
【基金】:國家自然科學基金項目“共現(xiàn)潛在語義向量空間模型及其語義核的構(gòu)建與應(yīng)用研究”(編號:71503151) 山西省高等學校創(chuàng)新人才支持計劃“基于潛在語義的文本信息主題深度聚類研究”(編號:2016052006)的研究成果之一
【分類號】:G353.1;TP391.1
【正文快照】: 0引言大數(shù)據(jù)時代使得信息資源空前豐富,其中絕大多數(shù)是文本信息資源。如何有效處理這些信息是文本挖掘、信息檢索等領(lǐng)域研究的重點問題。文本信息資源不同于一般的數(shù)據(jù)資源,其一,文本數(shù)據(jù)是一種半結(jié)構(gòu)或無結(jié)構(gòu)的數(shù)據(jù);其二,文本數(shù)據(jù)中包含大量的語義信息;傳統(tǒng)的數(shù)據(jù)挖掘算法無

【相似文獻】

相關(guān)期刊論文 前10條

1 丁月華,文貴華,郭煒強;基于核向量空間模型的專利分類[J];華南理工大學學報(自然科學版);2005年08期

2 王萌,何婷婷,張偉;基于概念向量空間模型的中文自動文摘系統(tǒng)[J];計算機工程與應(yīng)用;2005年01期

3 張玉連;張敏;張波;;一種擴展的向量空間模型-隱含語義索引模型研究[J];燕山大學學報;2006年01期

4 李雪峰;劉魯;張f,

本文編號:1464875


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1464875.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶feec2***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com