基于垂直主題搜索的交通術(shù)語相似性比對研究
[Abstract]:The similarity calculation between the nouns and the standard terms in each field is to carry out data mining in various professional fields. The premise and foundation of Natural Language Processing is an algorithm based on the number of terms of the hit number of the search engine to calculate the similarity of terms. The number of return hits by the search engine for the term retrieval can be used for the terminology. The similarity is quantified. However, based on a large general search engine, the number of terminology is limited to a specific domain, which often affects the similarity calculation of terms. This paper aims to improve the effect of terminology retrieval by establishing a vertical search engine system for traffic topics to improve the terms similarity. The purpose of precision is to be calculated.
The thesis first studies and realizes the construction of vertical search engine based on traffic theme. Its main work is to grab web pages containing traffic terms in the field of traffic. The paper develops the web crawling program of traffic subject under the framework of Heritrix project of open source crawler program. Take.
Secondly, the web page information was formatted and the redundant information was filtered out, and the index library of the retrieval system was constructed. The index library established in this paper is to write the index program under the condition of open source Lucene, to establish an orderly search for the parsed traffic topic web page, and to realize the full text of the traffic terms in the index library. Retrieves and retrieves the specific hit number of the term in the index base.
Finally, we use the Web-PMI algorithm to carry out the experiment of similarity calculation of traffic standard terms. In the algorithm, the retrieval formula based on traffic terms is re constructed, and the retrieval operator is added to reduce the occurrence of ambiguity in the retrieval results, improve the domain correlation of the retrieval results and improve the effect of the algorithm. The experimental results are analyzed and the improved retrieval formula is proposed. The retrieval number of terminology is increased, and the effect of term coincidence on the computation of terminology similarity is eliminated.
The method proposed in this paper is applied to the "traffic information consistency detection research" project. The application results show that the search engine system based on the traffic vertical theme based on this paper can play a very good effect on the similarity calculation of the unsocial terminology in the traffic field, compared with the calculation accuracy of the commercial search engine Alta Vista. The method proposed in this paper is also applicable to the calculation of terminology similarity in other specialized fields, and it can also effectively support the work of terminology standardization, identification of synonyms and synonyms, semantic retrieval, and Terminology Standard analogical detection.
【學位授予單位】:長安大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.1;U11-61
【參考文獻】
相關(guān)期刊論文 前10條
1 吳偉;陳建峽;;基于Heritrix的web信息抽取優(yōu)化與實現(xiàn)[J];湖北工業(yè)大學學報;2012年02期
2 付年鈞;彭昌水;王慰;;中文分詞技術(shù)及其實現(xiàn)[J];軟件導刊;2011年01期
3 劉淑梅;夏亮;許南山;;主題搜索引擎網(wǎng)絡(luò)爬蟲搜索策略的研究與實現(xiàn)[J];計算機系統(tǒng)應用;2010年03期
4 孟祥成;;基于Lucene和Heritrix技術(shù)搜索引擎的設(shè)計與實現(xiàn)[J];中國現(xiàn)代教育裝備;2010年03期
5 陳蘭;金遠平;;基于本體的垂直搜索引擎研究[J];計算機應用與軟件;2009年11期
6 周薇;;常用中文搜索引擎的應用、分析和比較[J];圖書情報工作;2009年S1期
7 鄒永斌;陳興蜀;王文賢;;基于貝葉斯分類器的主題爬蟲研究[J];計算機應用研究;2009年09期
8 馬費成;望俊成;吳克文;邱璇;;國外搜索引擎檢索效能研究述評[J];中國圖書館學報;2009年04期
9 周程遠;朱敏;楊云;;基于詞典的中文分詞算法研究[J];計算機與數(shù)字工程;2009年03期
10 張賢;周婭;;基于Lucene網(wǎng)頁排序算法的改進[J];計算機系統(tǒng)應用;2009年02期
相關(guān)碩士學位論文 前6條
1 李新友;信息檢索中的查詢擴展技術(shù)研究[D];廣西師范大學;2010年
2 謝冬松;基于Web的主題搜索應用技術(shù)研究[D];黑龍江大學;2007年
3 王曉偉;垂直搜索引擎若干關(guān)鍵技術(shù)的研究[D];浙江大學;2007年
4 許順;中文分詞規(guī)范可計算化的研究與實現(xiàn)[D];蘇州大學;2006年
5 壽周翔;專業(yè)搜索引擎的研究與設(shè)計[D];浙江大學;2005年
6 王亮;搜索引擎及其相關(guān)性排序研究[D];武漢大學;2004年
本文編號:2122141
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2122141.html