基于多重文本關(guān)系圖中clique子團聚類的主題識別方法研究
發(fā)布時間:2018-12-11 11:29
【摘要】:在網(wǎng)絡(luò)成為最主要科學(xué)交流和信息傳播渠道的今天,越來越多的機構(gòu)將其研究成果以電子化形式呈現(xiàn),這些電子化的文本資源中蘊涵著豐富的語義信息。面對這些海量的資源,科研人員很難在短時間內(nèi)快速捕獲文本中的主旨內(nèi)容。如何高效準(zhǔn)確地呈現(xiàn)文本資源中的核心主題,輔助科研人員對文本集中的重要關(guān)聯(lián)信息進行聚焦,提高科研效率,一直是文本挖掘研究中的一個重要問題。在對現(xiàn)有有益研究成果借鑒的基礎(chǔ)上,結(jié)合文本中術(shù)語和術(shù)語關(guān)系的特點,論文提出將文本中的術(shù)語和術(shù)語間的共現(xiàn)、句法和語義關(guān)系利用圖結(jié)構(gòu)進行表示,識別文本關(guān)系圖中的緊密關(guān)聯(lián)子團,基于所得到的緊密關(guān)聯(lián)子團聚類來揭示文本子主題的整體研究思路。開展了兩個方面的研究:①將文本集中的術(shù)語和術(shù)語間各種關(guān)系屬性進行疊加歸并,構(gòu)建多重文本關(guān)系疊加模型;②基于clique子團間相似性距離和語義標(biāo)識,進行聚類識別文本集中所包含的重要子主題。論文采用"migraine disorders"主題中近五年的文獻構(gòu)建文本集,對提出的方法開展了2個有效性實驗。實驗1與文本中領(lǐng)域?qū)<宜o出的標(biāo)引詞按語義類型分組結(jié)果對比,結(jié)果表明論文提出的方法與領(lǐng)域?qū)<医o出的標(biāo)引詞語義類型分組結(jié)果具有一致性;實驗2與目前廣泛使用的LDA方法結(jié)果進行對比,在準(zhǔn)確率和召回率上都較LDA方法有所提高。2個實驗均證明了文中方法的有效性。
[Abstract]:Today, the network has become the most important channel for scientific communication and information dissemination, more and more institutions present their research results in the form of electronic. These electronic text resources contain rich semantic information. In the face of these huge resources, it is difficult for researchers to quickly capture the main content of the text in a short time. How to efficiently and accurately present the core theme of text resources, assist researchers to focus on the important related information in text collection, and improve the efficiency of scientific research, has been an important problem in text mining research. On the basis of reference to the existing useful research results, combined with the characteristics of the terminology and the relationship between terms in the text, this paper proposes that the syntactic and semantic relations in the text should be represented by the graph structure, and the syntactic and semantic relations in the text should be represented by the co-occurrence of the terms and the terms in the text. In this paper, the close association cluster in text relation graph is identified, and the whole research idea of text sub-topic is revealed based on the cluster class of closely related sub-cluster. Two aspects of the research are carried out: (1) the superposition and merging of the terms and the relational attributes in the text set to construct the superposition model of multiple text relations; (2) based on the similarity distance between clique clusters and semantic identification, the important sub-topics contained in the text set are identified by clustering. In this paper, a text collection is constructed by using the literature of "migraine disorders" in the past five years, and two effective experiments are carried out on the proposed method. Experiment 1 is compared with the result of grouping the indexing words according to the semantic types given by the domain experts in the text. The results show that the method proposed in this paper is consistent with the semantic grouping results of the indexing words given by the domain experts. Compared with the results of LDA method which is widely used at present, the accuracy and recall rate of experiment 2 are higher than that of LDA method, and the effectiveness of the proposed method is proved by two experiments.
【作者單位】: 中國科學(xué)院文獻情報中心;中國科學(xué)院武漢文獻情報中心;
【基金】:中國科學(xué)院文獻情報中心青年人才領(lǐng)域前沿項目“基于圖模式的科技文獻主題語義標(biāo)注方法研究”(G160081001)
【分類號】:G254
本文編號:2372467
[Abstract]:Today, the network has become the most important channel for scientific communication and information dissemination, more and more institutions present their research results in the form of electronic. These electronic text resources contain rich semantic information. In the face of these huge resources, it is difficult for researchers to quickly capture the main content of the text in a short time. How to efficiently and accurately present the core theme of text resources, assist researchers to focus on the important related information in text collection, and improve the efficiency of scientific research, has been an important problem in text mining research. On the basis of reference to the existing useful research results, combined with the characteristics of the terminology and the relationship between terms in the text, this paper proposes that the syntactic and semantic relations in the text should be represented by the graph structure, and the syntactic and semantic relations in the text should be represented by the co-occurrence of the terms and the terms in the text. In this paper, the close association cluster in text relation graph is identified, and the whole research idea of text sub-topic is revealed based on the cluster class of closely related sub-cluster. Two aspects of the research are carried out: (1) the superposition and merging of the terms and the relational attributes in the text set to construct the superposition model of multiple text relations; (2) based on the similarity distance between clique clusters and semantic identification, the important sub-topics contained in the text set are identified by clustering. In this paper, a text collection is constructed by using the literature of "migraine disorders" in the past five years, and two effective experiments are carried out on the proposed method. Experiment 1 is compared with the result of grouping the indexing words according to the semantic types given by the domain experts in the text. The results show that the method proposed in this paper is consistent with the semantic grouping results of the indexing words given by the domain experts. Compared with the results of LDA method which is widely used at present, the accuracy and recall rate of experiment 2 are higher than that of LDA method, and the effectiveness of the proposed method is proved by two experiments.
【作者單位】: 中國科學(xué)院文獻情報中心;中國科學(xué)院武漢文獻情報中心;
【基金】:中國科學(xué)院文獻情報中心青年人才領(lǐng)域前沿項目“基于圖模式的科技文獻主題語義標(biāo)注方法研究”(G160081001)
【分類號】:G254
【相似文獻】
相關(guān)期刊論文 前1條
1 曾艷;侯漢清;;古籍文本抽詞研究[J];圖書情報工作;2008年01期
,本文編號:2372467
本文鏈接:http://sikaile.net/tushudanganlunwen/2372467.html
最近更新
教材專著