稀疏地理實(shí)體關(guān)系的關(guān)鍵詞提取方法
發(fā)布時(shí)間:2018-04-01 22:27
本文選題:地理信息檢索 切入點(diǎn):地理實(shí)體關(guān)系 出處:《地球信息科學(xué)學(xué)報(bào)》2016年11期
【摘要】:網(wǎng)絡(luò)文本蘊(yùn)含地理實(shí)體關(guān)系抽取技術(shù),需要高時(shí)效、強(qiáng)魯棒的關(guān)鍵詞提取方法。與監(jiān)督學(xué)習(xí)方法相比,無監(jiān)督學(xué)習(xí)方法能捕獲文本的動(dòng)態(tài)變化特征并發(fā)現(xiàn)新增的關(guān)系類型,因此備受關(guān)注。其中,基于頻率的關(guān)鍵詞提取方法獲得廣泛研究,然而,網(wǎng)絡(luò)文本蘊(yùn)含的地理實(shí)體關(guān)系分布稀疏,基于頻率的方法難以直接應(yīng)用于地理實(shí)體關(guān)系的關(guān)鍵詞提取。為解決該問題,本文基于公開訪問的網(wǎng)絡(luò)資源,提出一種語境增強(qiáng)的關(guān)鍵詞提取方法。首先,基于在線百科和開放的同義詞詞典,通過語境合并和語義融合創(chuàng)建增強(qiáng)的語境,以降低語境中詞語的稀疏性。接著,Domain Frequency和Entropy頻率統(tǒng)計(jì)方法從增強(qiáng)語境中自動(dòng)構(gòu)建一個(gè)大規(guī)模語料。然后,基于該語料選擇詞法特征并統(tǒng)計(jì)其權(quán)值,用于擴(kuò)大語境中詞語間的差異。最后,使用選擇的詞法特征度量增強(qiáng)語境中詞語的重要性,將權(quán)值最大的詞語作為描述地理實(shí)體關(guān)系的關(guān)鍵詞,并基于大規(guī)模真實(shí)網(wǎng)絡(luò)文本開展實(shí)驗(yàn)。實(shí)驗(yàn)結(jié)果表明:對(duì)于地理實(shí)體關(guān)系的關(guān)鍵詞識(shí)別,本文方法的平均精度為85.5%,比Domain Frequency和Entropy方法分別提高41%和36%;對(duì)于新增關(guān)鍵詞識(shí)別,本文方法的精度達(dá)到60.3%。語境增強(qiáng)的關(guān)鍵詞提取方法能有效地處理地理實(shí)體關(guān)系分布的稀疏性,可服務(wù)于網(wǎng)絡(luò)文本蘊(yùn)含地理實(shí)體關(guān)系的抽取。
[Abstract]:Web text contains geographical entity relation extraction technology, which requires a highly time-efficient and robust keyword extraction method. Compared with supervised learning method, unsupervised learning method can capture the dynamic characteristics of text and find new relationship types. Among them, frequency-based keyword extraction methods have been widely studied. However, the geographical entity relationships in network texts are sparse. The frequency-based method is difficult to be directly applied to the keyword extraction of geographical entity relations. In order to solve this problem, this paper proposes a context-enhanced keyword extraction method based on publicly accessed network resources. Based on online encyclopedia and an open lexicon of synonyms, enhanced contexts are created through contextual merging and semantic fusion. In order to reduce the sparsity of words in context, the frequency statistics of domain Frequency and Entropy automatically construct a large scale corpus from the enhanced context. Then, the lexical features are selected and their weights are counted based on the lexical features. It is used to enlarge the differences between words in context. Finally, the selected lexical features are used to measure the importance of the words in the context, and the words with the highest weight are used as the keywords to describe the relationship between geographical entities. The experimental results show that the average accuracy of this method is 85.555, which is 41% and 36% higher than that of Domain Frequency and Entropy, respectively. The precision of this method is 60.3. The keyword extraction method with enhanced context can deal with the sparse distribution of geographical entity relationship effectively and can serve the extraction of geographical entity relationship implied in network text.
【作者單位】: 中國(guó)科學(xué)院地理科學(xué)與資源研究所資源與環(huán)境信息系統(tǒng)國(guó)家重點(diǎn)實(shí)驗(yàn)室;中國(guó)科學(xué)院大學(xué);南京師范大學(xué)虛擬地理環(huán)境教育部重點(diǎn)實(shí)驗(yàn)室;
【基金】:國(guó)家“863”計(jì)劃項(xiàng)目(2013AA120305) 國(guó)家自然科學(xué)基金項(xiàng)目(41401460、41271408、41601421)
【分類號(hào)】:TP391.1;P209
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 姜琳;李宇;盧漢;曹存根;;地理實(shí)體概念及其位置關(guān)系的獲取和驗(yàn)證[J];計(jì)算機(jī)科學(xué);2007年12期
2 龐森權(quán);;淺談對(duì)地理實(shí)體實(shí)施命名的方法[J];中國(guó)地名;2012年02期
3 馮曉,,李方;地理實(shí)體的定義與存在方式[J];計(jì)算機(jī)輔助工程;1995年01期
4 李四海;李艷雯;邢U
本文編號(hào):1697595
本文鏈接:http://sikaile.net/kejilunwen/dizhicehuilunwen/1697595.html
最近更新
教材專著