基于字集特征向量的網(wǎng)頁(yè)消重改進(jìn)算法
發(fā)布時(shí)間:2019-06-29 07:45
【摘要】:基于MD5算法計(jì)算數(shù)字指紋的網(wǎng)頁(yè)消重算法簡(jiǎn)單而高效,在網(wǎng)頁(yè)消重領(lǐng)域應(yīng)用比較廣泛。但是由于MD5算法是一種嚴(yán)格的信息加密算法,在文章內(nèi)容變動(dòng)很少的情況下得出的指紋結(jié)果完全不同,導(dǎo)致基于這種算法的網(wǎng)頁(yè)消重技術(shù)召回率不是很高。提出了兩種基于字集特征向量的網(wǎng)頁(yè)消重改進(jìn)算法,把文章內(nèi)容映射到字集空間中去,計(jì)算字集空間距離來(lái)判斷文章是否相似。提出的算法具有良好的泛化能力,段落中存在的調(diào)整語(yǔ)序和增刪改個(gè)別字不會(huì)影響到對(duì)相似段落的識(shí)別,大大提高了網(wǎng)頁(yè)消重算法的召回率。實(shí)驗(yàn)結(jié)果表明,算法的時(shí)間復(fù)雜度為O(n),空間復(fù)雜度為O(1),適合應(yīng)用于大規(guī)模網(wǎng)頁(yè)消重。
[Abstract]:The algorithm of web page weight elimination based on MD5 algorithm is simple and efficient, and it is widely used in the field of web page weight elimination. However, because MD5 algorithm is a strict information encryption algorithm, the fingerprint results are completely different when the content of the article changes little, which leads to the recall rate of web page weight cancellation technology based on this algorithm is not very high. In this paper, two improved algorithms of web page weight elimination based on character set eigenvector are proposed, in which the content of the article is mapped to the word set space, and the spatial distance of the word set is calculated to judge whether the article is similar or not. The proposed algorithm has good generalization ability. The adjustment of word order and the addition and deletion of words in paragraphs will not affect the recognition of similar paragraphs, and greatly improve the recall rate of web page weight elimination algorithm. The experimental results show that the time complexity of the algorithm is O (n), space complexity O (1), which is suitable for large-scale web page weight elimination.
【作者單位】: 中國(guó)石油大學(xué)(北京)計(jì)算機(jī)系;
【基金】:國(guó)家“十五”科技攻關(guān)項(xiàng)目(No.2001BA605A09)
【分類號(hào)】:TP393.092
本文編號(hào):2507666
[Abstract]:The algorithm of web page weight elimination based on MD5 algorithm is simple and efficient, and it is widely used in the field of web page weight elimination. However, because MD5 algorithm is a strict information encryption algorithm, the fingerprint results are completely different when the content of the article changes little, which leads to the recall rate of web page weight cancellation technology based on this algorithm is not very high. In this paper, two improved algorithms of web page weight elimination based on character set eigenvector are proposed, in which the content of the article is mapped to the word set space, and the spatial distance of the word set is calculated to judge whether the article is similar or not. The proposed algorithm has good generalization ability. The adjustment of word order and the addition and deletion of words in paragraphs will not affect the recognition of similar paragraphs, and greatly improve the recall rate of web page weight elimination algorithm. The experimental results show that the time complexity of the algorithm is O (n), space complexity O (1), which is suitable for large-scale web page weight elimination.
【作者單位】: 中國(guó)石油大學(xué)(北京)計(jì)算機(jī)系;
【基金】:國(guó)家“十五”科技攻關(guān)項(xiàng)目(No.2001BA605A09)
【分類號(hào)】:TP393.092
【相似文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前1條
1 張玉琴;一類數(shù)字集及直和數(shù)字集下自仿測(cè)度的譜性[D];陜西師范大學(xué);2013年
,本文編號(hào):2507666
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2507666.html
最近更新
教材專著