基于共現(xiàn)潛在語義向量空間模型的語義核構(gòu)建
發(fā)布時(shí)間:2018-10-29 20:28
【摘要】:實(shí)現(xiàn)數(shù)字圖書館資源聚合的知識(shí)發(fā)現(xiàn)離不開對(duì)知識(shí)的有效表示。作為經(jīng)典的文本表示模型,向量空間模型(VSM)及其衍生模型在信息檢索以及知識(shí)發(fā)現(xiàn)等研究中都有著重要的地位,但依然存在不足。共現(xiàn)潛在語義向量空間模型(CLSVSM)作為新的文本表示模型,與VSM相比明顯提高了文本聚類的精度。然而,面對(duì)文本大數(shù)據(jù)的應(yīng)用,共現(xiàn)矩陣維度往往較高,致使模型的計(jì)算復(fù)雜度也較大。因此,本文在CLSVSM基礎(chǔ)上構(gòu)建了語義核(CLSVSM_K),構(gòu)建的原理是基于潛在語義分析(LSA)的思想。CLSVSM_K不僅降低了共現(xiàn)矩陣的維度,而且實(shí)現(xiàn)了文本特征詞之間同義信息的合并。本文將該語義核模型應(yīng)用于文獻(xiàn)的主題聚類中,實(shí)驗(yàn)結(jié)果表明,該方法的確有效降低了特征詞空間的維度和計(jì)算的復(fù)雜度,提高了聚類算法的性能,且提高了文獻(xiàn)主題聚類的精確度。該模型的應(yīng)用將有助于數(shù)字圖書館信息資源組織、知識(shí)發(fā)現(xiàn)和知識(shí)優(yōu)化。
[Abstract]:The realization of digital library resources aggregation knowledge discovery can not be separated from the effective representation of knowledge. As a classical text representation model, vector space model (VSM) and its derivative model play an important role in the research of information retrieval and knowledge discovery, but there are still some shortcomings. As a new text representation model, the latent semantic vector space model (CLSVSM) improves the accuracy of text clustering obviously compared with VSM. However, in the face of the application of big data, the dimension of co-occurrence matrix is often high, which leads to the computational complexity of the model. Therefore, this paper constructs a semantic kernel (CLSVSM_K) on the basis of CLSVSM, which is based on the idea of latent semantic analysis of (LSA). CLSVSM_K not only reduces the dimension of co-occurrence matrix, but also reduces the dimension of co-occurrence matrix. Moreover, the synonymy information of text feature words is merged. In this paper, the semantic kernel model is applied to the topic clustering in literature. The experimental results show that the proposed method can effectively reduce the dimension and computational complexity of the feature space and improve the performance of the clustering algorithm. Moreover, the accuracy of topic clustering is improved. The application of this model will be helpful to information resource organization, knowledge discovery and knowledge optimization of digital library.
【作者單位】: 山西大學(xué)數(shù)學(xué)科學(xué)學(xué)院;山西大學(xué)管理與決策研究所;
【基金】:國(guó)家自然科學(xué)基金“共現(xiàn)潛在語義向量空間模型及其語義核的構(gòu)建與應(yīng)用研究”(71503151) 山西省高等學(xué)校創(chuàng)新人才支持計(jì)劃“基于潛在語義的文本信息主題深度聚類研究”(2016052006)
【分類號(hào)】:TP391.1
[Abstract]:The realization of digital library resources aggregation knowledge discovery can not be separated from the effective representation of knowledge. As a classical text representation model, vector space model (VSM) and its derivative model play an important role in the research of information retrieval and knowledge discovery, but there are still some shortcomings. As a new text representation model, the latent semantic vector space model (CLSVSM) improves the accuracy of text clustering obviously compared with VSM. However, in the face of the application of big data, the dimension of co-occurrence matrix is often high, which leads to the computational complexity of the model. Therefore, this paper constructs a semantic kernel (CLSVSM_K) on the basis of CLSVSM, which is based on the idea of latent semantic analysis of (LSA). CLSVSM_K not only reduces the dimension of co-occurrence matrix, but also reduces the dimension of co-occurrence matrix. Moreover, the synonymy information of text feature words is merged. In this paper, the semantic kernel model is applied to the topic clustering in literature. The experimental results show that the proposed method can effectively reduce the dimension and computational complexity of the feature space and improve the performance of the clustering algorithm. Moreover, the accuracy of topic clustering is improved. The application of this model will be helpful to information resource organization, knowledge discovery and knowledge optimization of digital library.
【作者單位】: 山西大學(xué)數(shù)學(xué)科學(xué)學(xué)院;山西大學(xué)管理與決策研究所;
【基金】:國(guó)家自然科學(xué)基金“共現(xiàn)潛在語義向量空間模型及其語義核的構(gòu)建與應(yīng)用研究”(71503151) 山西省高等學(xué)校創(chuàng)新人才支持計(jì)劃“基于潛在語義的文本信息主題深度聚類研究”(2016052006)
【分類號(hào)】:TP391.1
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 張玉峰;王志芳;;文本分類中的語義核函數(shù)研究[J];情報(bào)科學(xué);2010年07期
2 劉建舟;邵雄凱;;基于語義核的中文實(shí)體關(guān)系抽取[J];信息系統(tǒng)工程;2011年03期
3 杜家利;于屏方;;計(jì)算語義學(xué)視角下的文本風(fēng)格研究[J];計(jì)算機(jī)工程與應(yīng)用;2011年30期
4 丁月華,文貴華,郭煒強(qiáng);基于核向量空間模型的專利分類[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年08期
5 王萌,何婷婷,張偉;基于概念向量空間模型的中文自動(dòng)文摘系統(tǒng)[J];計(jì)算機(jī)工程與應(yīng)用;2005年01期
6 張玉連;張敏;張波;;一種擴(kuò)展的向量空間模型-隱含語義索引模型研究[J];燕山大學(xué)學(xué)報(bào);2006年01期
7 李雪峰;劉魯;張f,
本文編號(hào):2298727
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2298727.html
最近更新
教材專著