文獻(xiàn)被引片段特征分析與識別研究
發(fā)布時間:2017-12-28 03:35
本文關(guān)鍵詞:文獻(xiàn)被引片段特征分析與識別研究 出處:《數(shù)據(jù)分析與知識發(fā)現(xiàn)》2017年11期 論文類型:期刊論文
更多相關(guān)文章: 被引片段 識別方法 引文上下文 引用對象
【摘要】:【目的】對科技文獻(xiàn)領(lǐng)域的被引片段概念的特征進(jìn)行分析,并比較不同識別方法效果的差異!痉椒ā恳訡L-Sci Summ 2016比賽被引片段標(biāo)注數(shù)據(jù)為例,探索被引片段長度、位置與重要性特征,并分析與其對應(yīng)引文上下文在長度和位置上的相關(guān)性。之后以基于詞袋模型、主題模型、Word Net語義詞典的相似性算法為例,比較這些方法在被引片段識別中的效果差異!窘Y(jié)果】研究結(jié)果發(fā)現(xiàn):被標(biāo)注的被引片段有96%少于三句,且更多地出現(xiàn)在文章前部和章節(jié)內(nèi)的前部分,被引片段的Text Rank權(quán)重均值顯著高于其他片段;被引片段與引文上下文在長度上顯著相關(guān),但在出現(xiàn)位置上相關(guān)性不明顯;無論從MMR?還是句子與詞匯層面的匹配度來看,基于詞袋模型的識別方法效果均優(yōu)于基于語義詞典的方法,而后者明顯優(yōu)于基于主題模型的方法!揪窒蕖繉τ诒灰胃拍钆c特性的分析只停留在理論層面,對其特征的分析與有關(guān)識別方法的比較也只是在CL-Sci Summ 2016被引片段標(biāo)注數(shù)據(jù)上進(jìn)行的!窘Y(jié)論】科技文獻(xiàn)的用詞比較規(guī)范嚴(yán)謹(jǐn),所以詞匯特征在被引片段的識別過程中起到關(guān)鍵的作用。
[Abstract]:[Objective] to analyze the characteristics of the concept of cited fragments in the field of scientific and technological literature, and to compare the differences of the effect of different recognition methods. [Methods] taking the tagged data of CL-Sci Summ 2016 competition as an example, we explored the length, location and importance of the cited fragment, and analyzed the relevance between the corresponding context and its length and location. Then, based on the similarity algorithm of word bag model, topic model and Word Net semantic dictionary, we compare the effectiveness of these methods in the recognition of induced fragments. [result] the results showed that: labeled cited are 96% less than three, and more appear in the front part of the front and the section within the Text Rank weighted average citation fragment was significantly higher than that in other segments; cited and citation context fragments significantly correlated in length, but in the position correlation is not obvious; no matter from MMR? Or sentence and word level matching degree, the effect of recognition method based on bag of words model was better than the method based on semantic dictionary, the latter is obviously better than the method based on topic model. [limitations] the analysis of the concept and characteristics of the cited part stays at the theoretical level only. The comparison of its characteristics and the related recognition methods is only carried out on the tagged data of CL-Sci Summ 2016. [Conclusion] the use of words in scientific literature is more rigorous, so lexical features play a key role in the identification of the cited fragments.
【作者單位】: 武漢大學(xué)信息資源研究中心;華中師范大學(xué)信息管理學(xué)院;
【分類號】:G353.1
【正文快照】: 1引言文獻(xiàn)的被引頻次從一定程度上反映了其對學(xué)術(shù)界的貢獻(xiàn)與影響。然而,被引頻次僅能說明文獻(xiàn)整體的影響力與價值,只有對引用行為進(jìn)行更深入的分析才能揭示被引文獻(xiàn)內(nèi)部對學(xué)界有影響力的那部分內(nèi)容。隨著學(xué)術(shù)論文全文獲取難度的降低,引文上下文(Citation Context)的識別與抽取,
本文編號:1344441
本文鏈接:http://sikaile.net/tushudanganlunwen/1344441.html
最近更新
教材專著