基于垂直主題搜索的交通術(shù)語相似性比對研究

發(fā)布時間：2018-07-14 15:48

【摘要】：各研究領(lǐng)域內(nèi)的名詞和標(biāo)準(zhǔn)術(shù)語之間的相似度計算，是開展各個專業(yè)領(lǐng)域內(nèi)的數(shù)據(jù)挖掘、自然語言處理的前提和基礎(chǔ)。Web-PMI是一種基于搜索引擎的命中數(shù)計算術(shù)語相似度的算法，利用搜索引擎對術(shù)語檢索的返回命中數(shù)就可以對術(shù)語對的相似性進(jìn)行量化地計算。但基于大型的通用搜索引擎對特定的領(lǐng)域限定術(shù)語檢索命中數(shù)不足，這往往對術(shù)語的相似度計算造成影響，本文旨在通過建立交通主題的垂直搜索引擎系統(tǒng)，提高術(shù)語檢索命中效果，從而提高術(shù)語相似度的計算精度目的。論文首先研究并實(shí)現(xiàn)了基于交通主題的垂直搜索引擎的構(gòu)建。其主要工作是在交通領(lǐng)域內(nèi)對包含交通術(shù)語的網(wǎng)頁進(jìn)行抓取，論文在開源爬蟲程序Heritrix項(xiàng)目的架構(gòu)下自主開發(fā)了交通主題的網(wǎng)頁抓取程序，實(shí)現(xiàn)了交通主題限定的網(wǎng)頁抓取。其次完成了對抓取的網(wǎng)頁信息進(jìn)行格式解析，過濾掉網(wǎng)頁中的冗余信息，構(gòu)建了檢索系統(tǒng)的索引庫。本文建立的索引庫是在開源Lucene條件下編寫索引程序，對解析后的交通主題網(wǎng)頁建立有序的索引，并能實(shí)現(xiàn)交通術(shù)語在索引庫中的全文檢索，檢索后返回術(shù)語在索引庫中具體的命中數(shù)值。最后利用Web-PMI算法進(jìn)行交通標(biāo)準(zhǔn)術(shù)語的相似度計算的實(shí)驗(yàn)，在算法中重新構(gòu)造了基于交通術(shù)語的檢索式，，加入檢索運(yùn)算符，減少檢索結(jié)果中的歧義發(fā)生，提高檢索結(jié)果的領(lǐng)域相關(guān)度，提升算法效果。對實(shí)驗(yàn)結(jié)果進(jìn)行分析，改進(jìn)后檢索式提升了術(shù)語的檢索命中數(shù)，消除了一定的術(shù)語偶然共現(xiàn)情況對術(shù)語相似度計算的效果影響。本文提出的方法，在“交通信息一致性檢測研究”項(xiàng)目中進(jìn)行了應(yīng)用，應(yīng)用結(jié)果證明，基于本文建立的交通垂直主題的搜索引擎系統(tǒng)，對交通領(lǐng)域內(nèi)的生僻術(shù)語進(jìn)行相似度計算時能起到很好效果，較商業(yè)搜索引擎Alta Vista的計算準(zhǔn)確率也略高。本文提出的方法也同樣適用于其他專業(yè)領(lǐng)域內(nèi)術(shù)語相似度計算，同時也可對術(shù)語標(biāo)準(zhǔn)化、識別同義詞與近義詞、語義檢索、術(shù)語標(biāo)準(zhǔn)類比檢測等方面的工作進(jìn)行有效地支持。
[Abstract]:The similarity calculation between the nouns and the standard terms in each field is to carry out data mining in various professional fields. The premise and foundation of Natural Language Processing is an algorithm based on the number of terms of the hit number of the search engine to calculate the similarity of terms. The number of return hits by the search engine for the term retrieval can be used for the terminology. The similarity is quantified. However, based on a large general search engine, the number of terminology is limited to a specific domain, which often affects the similarity calculation of terms. This paper aims to improve the effect of terminology retrieval by establishing a vertical search engine system for traffic topics to improve the terms similarity. The purpose of precision is to be calculated.
The thesis first studies and realizes the construction of vertical search engine based on traffic theme. Its main work is to grab web pages containing traffic terms in the field of traffic. The paper develops the web crawling program of traffic subject under the framework of Heritrix project of open source crawler program. Take.
Secondly, the web page information was formatted and the redundant information was filtered out, and the index library of the retrieval system was constructed. The index library established in this paper is to write the index program under the condition of open source Lucene, to establish an orderly search for the parsed traffic topic web page, and to realize the full text of the traffic terms in the index library. Retrieves and retrieves the specific hit number of the term in the index base.
Finally, we use the Web-PMI algorithm to carry out the experiment of similarity calculation of traffic standard terms. In the algorithm, the retrieval formula based on traffic terms is re constructed, and the retrieval operator is added to reduce the occurrence of ambiguity in the retrieval results, improve the domain correlation of the retrieval results and improve the effect of the algorithm. The experimental results are analyzed and the improved retrieval formula is proposed. The retrieval number of terminology is increased, and the effect of term coincidence on the computation of terminology similarity is eliminated.
The method proposed in this paper is applied to the "traffic information consistency detection research" project. The application results show that the search engine system based on the traffic vertical theme based on this paper can play a very good effect on the similarity calculation of the unsocial terminology in the traffic field, compared with the calculation accuracy of the commercial search engine Alta Vista. The method proposed in this paper is also applicable to the calculation of terminology similarity in other specialized fields, and it can also effectively support the work of terminology standardization, identification of synonyms and synonyms, semantic retrieval, and Terminology Standard analogical detection.
【學(xué)位授予單位】：長安大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP391.1;U11-61

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 吳偉;陳建峽;;基于Heritrix的web信息抽取優(yōu)化與實(shí)現(xiàn)[J];湖北工業(yè)大學(xué)學(xué)報;2012年02期

2 付年鈞;彭昌水;王慰;;中文分詞技術(shù)及其實(shí)現(xiàn)[J];軟件導(dǎo)刊;2011年01期

3 劉淑梅;夏亮;許南山;;主題搜索引擎網(wǎng)絡(luò)爬蟲搜索策略的研究與實(shí)現(xiàn)[J];計算機(jī)系統(tǒng)應(yīng)用;2010年03期

4 孟祥成;;基于Lucene和Heritrix技術(shù)搜索引擎的設(shè)計與實(shí)現(xiàn)[J];中國現(xiàn)代教育裝備;2010年03期

5 陳蘭;金遠(yuǎn)平;;基于本體的垂直搜索引擎研究[J];計算機(jī)應(yīng)用與軟件;2009年11期

6 周薇;;常用中文搜索引擎的應(yīng)用、分析和比較[J];圖書情報工作;2009年S1期

7 鄒永斌;陳興蜀;王文賢;;基于貝葉斯分類器的主題爬蟲研究[J];計算機(jī)應(yīng)用研究;2009年09期

8 馬費(fèi)成;望俊成;吳克文;邱璇;;國外搜索引擎檢索效能研究述評[J];中國圖書館學(xué)報;2009年04期

9 周程遠(yuǎn);朱敏;楊云;;基于詞典的中文分詞算法研究[J];計算機(jī)與數(shù)字工程;2009年03期

10 張賢;周婭;;基于Lucene網(wǎng)頁排序算法的改進(jìn)[J];計算機(jī)系統(tǒng)應(yīng)用;2009年02期

相關(guān)碩士學(xué)位論文前6條

1 李新友;信息檢索中的查詢擴(kuò)展技術(shù)研究[D];廣西師范大學(xué);2010年

2 謝冬松;基于Web的主題搜索應(yīng)用技術(shù)研究[D];黑龍江大學(xué);2007年

3 王曉偉;垂直搜索引擎若干關(guān)鍵技術(shù)的研究[D];浙江大學(xué);2007年

4 許順;中文分詞規(guī)范可計算化的研究與實(shí)現(xiàn)[D];蘇州大學(xué);2006年

5 壽周翔;專業(yè)搜索引擎的研究與設(shè)計[D];浙江大學(xué);2005年

6 王亮;搜索引擎及其相關(guān)性排序研究[D];武漢大學(xué);2004年

本文編號：2122141

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2122141.html

上一篇：搜索引擎推廣——企業(yè)開展網(wǎng)絡(luò)營銷的利器
下一篇：應(yīng)用搜索引擎計算語義相關(guān)度的實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于垂直主題搜索的交通術(shù)語相似性比對研究