搜索引擎中基于密度聚類的混合編碼檢測(cè)算法
發(fā)布時(shí)間:2018-10-31 19:31
【摘要】:搜索引擎有很多的關(guān)鍵技術(shù),本文主要針對(duì)互聯(lián)網(wǎng)中文HTML混合編碼文件,研究了中文HTML文件的字符編碼組成結(jié)構(gòu),然后對(duì)混合編碼文件內(nèi)容進(jìn)行聚類,采用了數(shù)據(jù)挖掘領(lǐng)域的經(jīng)典算法DBSCAN,將HTML文件分成幾個(gè)大類,然后分別對(duì)各個(gè)類進(jìn)行了基于特征編碼檢測(cè)。實(shí)驗(yàn)結(jié)果顯示,當(dāng)選取合適的參數(shù)時(shí),對(duì)混合編碼文件的聚類后,每個(gè)類與中文字符特征編碼相符率達(dá)100%,可以廣泛應(yīng)用于搜索領(lǐng)域。
[Abstract]:There are many key technologies in search engine. This paper mainly focuses on the Chinese HTML mixed coding files on the Internet, studies the character encoding structure of Chinese HTML files, and then clusters the contents of the mixed encoding files. The classical algorithm of data mining, DBSCAN, is used to divide the HTML files into several classes, and then each class is detected based on feature encoding. The experimental results show that when the appropriate parameters are selected, the matching rate of each class with Chinese character feature coding is 100, which can be widely used in the search field.
【作者單位】: 浙江大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)學(xué)院;中國(guó)人民解放軍南京軍區(qū)73610部隊(duì);
【基金】:國(guó)家支撐計(jì)劃(2008BAH21B03)基金項(xiàng)目 浙江省公益性技術(shù)應(yīng)用研究計(jì)劃(2010C31003)基金項(xiàng)目
【分類號(hào)】:TP391.3
[Abstract]:There are many key technologies in search engine. This paper mainly focuses on the Chinese HTML mixed coding files on the Internet, studies the character encoding structure of Chinese HTML files, and then clusters the contents of the mixed encoding files. The classical algorithm of data mining, DBSCAN, is used to divide the HTML files into several classes, and then each class is detected based on feature encoding. The experimental results show that when the appropriate parameters are selected, the matching rate of each class with Chinese character feature coding is 100, which can be widely used in the search field.
【作者單位】: 浙江大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)學(xué)院;中國(guó)人民解放軍南京軍區(qū)73610部隊(duì);
【基金】:國(guó)家支撐計(jì)劃(2008BAH21B03)基金項(xiàng)目 浙江省公益性技術(shù)應(yīng)用研究計(jì)劃(2010C31003)基金項(xiàng)目
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 李繼鋒,劉群;基于N-Gram模型的高速漢字編碼識(shí)別系統(tǒng)[J];計(jì)算機(jī)工程與應(yīng)用;2004年03期
2 辛春生,孫玉芳;簡(jiǎn)繁漢字轉(zhuǎn)換系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];軟件學(xué)報(bào);2000年11期
3 王鑫;王洪國(guó);王s,
本文編號(hào):2303317
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2303317.html
最近更新
教材專著