基于聚團(tuán)詞的大規(guī)模文本轉(zhuǎn)載識(shí)別算法
發(fā)布時(shí)間:2018-03-28 07:03
本文選題:轉(zhuǎn)載識(shí)別 切入點(diǎn):聚團(tuán)詞 出處:《計(jì)算機(jī)應(yīng)用》2010年06期
【摘要】:文本轉(zhuǎn)載識(shí)別是指從大規(guī)模文本庫(kù)中檢測(cè)出內(nèi)容相同或相近的文檔集合,在熱門話題檢測(cè)、搜索引擎結(jié)果凝練、學(xué)術(shù)文章抄襲識(shí)別等諸多應(yīng)用上,存在普遍的需求。為適應(yīng)網(wǎng)絡(luò)文本轉(zhuǎn)載形式的日趨多樣化,并進(jìn)一步提升實(shí)用系統(tǒng)效率,對(duì)各種文本特征及比較算法進(jìn)行了研究分析,提出了基于聚團(tuán)詞的大規(guī)模文本轉(zhuǎn)載識(shí)別算法,即:依據(jù)詞語的分布屬性,識(shí)別并提取高得分聚團(tuán)詞用于表征文本,之后通過對(duì)文本集進(jìn)行擴(kuò)展線性比較與多維比較兩次操作,最終篩選出轉(zhuǎn)載識(shí)別結(jié)果。對(duì)比實(shí)驗(yàn)表明:該算法在準(zhǔn)確率、召回率與效率上有較高的綜合性能。
[Abstract]:Text reprint recognition refers to the collection of documents with the same or similar contents detected from the large-scale text library, in many applications such as hot topic detection, search engine results condensed, academic articles plagiarism recognition, and so on. There is a general demand. In order to adapt to the increasing diversification of network text reprint forms and to further improve the efficiency of practical systems, various text features and comparison algorithms are studied and analyzed. In this paper, a large scale text reprint recognition algorithm based on cluster words is proposed, that is, the high score cluster words are recognized and extracted to represent the text according to the distributed attributes of the words, and then two operations of extended linear comparison and multidimensional comparison are carried out on the text set. Finally, the reprint recognition results are screened out. The comparison experiment shows that the algorithm has high comprehensive performance in accuracy, recall rate and efficiency.
【作者單位】: 首都師范大學(xué)計(jì)算機(jī)科學(xué)聯(lián)合研究院;中國(guó)科學(xué)院計(jì)算技術(shù)研究所;北京理工大學(xué)計(jì)算機(jī)學(xué)院;
【基金】:國(guó)家863計(jì)劃項(xiàng)目(2007AA01Z438) 中國(guó)科學(xué)院計(jì)算技術(shù)研究所2008知識(shí)創(chuàng)新基金資助項(xiàng)目
【分類號(hào)】:TP391.1
,
本文編號(hào):1675242
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1675242.html
最近更新
教材專著