基于拼音索引的中文模糊匹配算法
發(fā)布時(shí)間:2019-08-05 11:19
【摘要】:主流商業(yè)搜索引擎主要基于關(guān)鍵詞精確匹配技術(shù)。為提高在用戶的輸入錯(cuò)誤時(shí)的檢索效率,提出了有索引的漢語模糊匹配算法。該算法采用漢字、拼音和拼音改良的編輯距離這3種漢字相似程度的不同度量方式,對用戶查詢進(jìn)行擴(kuò)展,將模糊匹配轉(zhuǎn)化為多個(gè)精確匹配,對精確匹配的結(jié)果按與查詢串的相似程度進(jìn)行排序。在實(shí)驗(yàn)中,將該方法應(yīng)用于網(wǎng)頁文本語料庫中。在使用基于拼音改良的編輯距離度量方式時(shí),在時(shí)間和空間復(fù)雜度增長不大的情況下,該方法取得了60.42%的準(zhǔn)確率與50.41%召回率。
[Abstract]:Mainstream commercial search engines are mainly based on keyword accurate matching technology. In order to improve the retrieval efficiency in the case of user input errors, an indexed Chinese fuzzy matching algorithm is proposed. The algorithm uses three different measures of similarity degree of Chinese characters, Pinyin and Pinyin improved editing distance, to extend user query, to transform fuzzy matching into multiple accurate matches, and to sort the results of accurate matching according to the similarity degree with query string. In the experiment, this method is applied to the web text corpus. When the improved editing distance measurement based on pinyin is used, the accuracy of the method is 60.42% and the recall rate is 50.41% when the complexity of time and space increases little.
【作者單位】: 清華大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)系 清華信息科學(xué)技術(shù)國家實(shí)驗(yàn)室技術(shù)創(chuàng)新和開發(fā)部語音和語言技術(shù)中心
【基金】:國家自然科學(xué)基金資助項(xiàng)目(60703051)
【分類號(hào)】:TP391.1
[Abstract]:Mainstream commercial search engines are mainly based on keyword accurate matching technology. In order to improve the retrieval efficiency in the case of user input errors, an indexed Chinese fuzzy matching algorithm is proposed. The algorithm uses three different measures of similarity degree of Chinese characters, Pinyin and Pinyin improved editing distance, to extend user query, to transform fuzzy matching into multiple accurate matches, and to sort the results of accurate matching according to the similarity degree with query string. In the experiment, this method is applied to the web text corpus. When the improved editing distance measurement based on pinyin is used, the accuracy of the method is 60.42% and the recall rate is 50.41% when the complexity of time and space increases little.
【作者單位】: 清華大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)系 清華信息科學(xué)技術(shù)國家實(shí)驗(yàn)室技術(shù)創(chuàng)新和開發(fā)部語音和語言技術(shù)中心
【基金】:國家自然科學(xué)基金資助項(xiàng)目(60703051)
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 王靜帆;鄔曉鈞;夏云慶;鄭方;;中文信息檢索系統(tǒng)的模糊匹配算法研究和實(shí)現(xiàn)[J];中文信息學(xué)報(bào);2007年06期
【共引文獻(xiàn)】
相關(guān)期刊論文 前10條
1 楊朋;唐文玲;;實(shí)現(xiàn)異步交換機(jī)間話單稽核的自適應(yīng)窗口模糊匹配方法[J];中國新通信;2018年18期
2 吳振華;高瑞澤;;智能家居場景下改進(jìn)的中文字符串匹配算法[J];南昌航空大學(xué)學(xué)報(bào)(自然科學(xué)版);2018年02期
3 石永革;張毫;;基于BPM-BM過濾優(yōu)化的近似字符串匹配算法[J];青島科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2016年01期
4 吳茜;劉嘉勇;卿粼波;;基于VIPS算法和模糊字典匹配的網(wǎng)頁提取技術(shù)研究[J];信息網(wǎng)絡(luò)安全;2014年10期
5 施恒利;劉亮亮;王石;符建輝;張?jiān)佘S;曹存根;;漢字種子混淆集的構(gòu)建方法研究[J];計(jì)算機(jī)科學(xué);2014年08期
6 陳何峰;林柏鋼;楊e,
本文編號(hào):2523096
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2523096.html
最近更新
教材專著