電子信息垂直搜索引擎的研究與實現(xiàn)
發(fā)布時間:2018-10-14 12:19
【摘要】:在互聯(lián)網(wǎng)高速發(fā)展的今天,網(wǎng)絡(luò)信息呈指數(shù)增長,,搜索引擎在互聯(lián)網(wǎng)的應(yīng)用中一直占據(jù)著主要的地位。但是通用搜索引擎給人們帶來便利的同時,也給人們帶來了檢索上的煩勞。具體表現(xiàn)為:檢索返回的信息量十分龐大,用戶需要花大量的時間在這些繁雜的信息中尋找到自己感興趣的信息;通用搜索引擎并沒有考慮到用戶專業(yè)知識需求,無差別的返回檢索結(jié)果,將造成檢索過程的不便。 垂直搜索引擎作為未來搜索引擎的一個發(fā)展趨勢,專注于某一個領(lǐng)域的搜索,在現(xiàn)代行業(yè)分工以及社會分工的逐漸細化的情況下,發(fā)揮著重要作用。用戶對某一個專業(yè)信息有著很強的需求,垂直搜索引擎就是為了解決某(這)類專業(yè)信息檢索的問題,其主要通過主題爬蟲技術(shù)等,使得垂直搜索引擎在解決某些專業(yè)問題的時候比通用搜索引擎更加實用。 本文在介紹了搜索引擎和垂直搜索引擎的基礎(chǔ)上,重點分析研究了heritrix網(wǎng)絡(luò)爬蟲,通過定制heritrix爬蟲達到了主題網(wǎng)絡(luò)信息的抓取、通過引入ELFHash算法,使得heritrix能夠多線程抓取網(wǎng)頁、通過消除對robots.txt的限制加快heritrix的抓取速率。 本文采用lucene來建立索引和檢索,在分析研究lucene基本框架結(jié)構(gòu)的基礎(chǔ)上,對lucene自帶的中分分詞和排序做了修改。在針對電子信息搜索引擎需求下,設(shè)計出基于電子信息專業(yè)詞典和統(tǒng)計結(jié)合的中文分詞算法和修改了lucene的自帶排序算法,使得檢索的結(jié)果更加符合用戶的需求。除此文章還對下載的網(wǎng)頁信息做了內(nèi)容的分析處理,以便lucene能夠建立索引。 最后通過實驗測試,驗證了垂直搜索引擎與通用搜索引擎的不同與優(yōu)劣、驗證了網(wǎng)絡(luò)爬蟲的高效性、驗證了中文分析的效果。整體的測試演示證明了系統(tǒng)具有一定的可靠性和實用性,對構(gòu)建垂直搜索引擎有一定的參考價值。
[Abstract]:With the rapid development of the Internet, Internet information is growing exponentially, and search engines have been playing a major role in the application of the Internet. But the general search engine brings convenience to people, but also brings people the trouble of searching. In particular, the amount of information returned by the search engine is very large, and users need to spend a lot of time searching for the information they are interested in, and the general search engine does not take into account the needs of the users' professional knowledge. Returning the retrieval results without distinction will cause inconvenience to the retrieval process. Vertical search engine, as a developing trend of future search engine, focuses on the search in a certain field and plays an important role in the gradual refinement of the division of labor and social division of labor in modern industries. Users have a strong demand for a certain professional information, vertical search engine is to solve the problem of a (this) kind of professional information retrieval, mainly through the subject crawler technology, etc. The vertical search engine is more practical than the general search engine in solving some professional problems. Based on the introduction of search engine and vertical search engine, this paper focuses on the analysis and research of heritrix web crawler. By customizing heritrix crawler, the subject network information is captured. By introducing ELFHash algorithm, heritrix can grab web pages by multi-thread. Speed up the heritrix capture rate by removing restrictions on robots.txt. In this paper, lucene is used to build index and retrieval. On the basis of analyzing and studying the basic frame structure of lucene, the middle partitioning and sorting of lucene are modified. Under the demand of electronic information search engine, a Chinese word segmentation algorithm based on electronic information professional dictionary and statistics is designed, and the self-sorting algorithm of lucene is modified to make the retrieval results more in line with the needs of users. In addition, this article analyzes the contents of the downloaded web pages so that lucene can index them. Finally, the differences and advantages between vertical search engine and general search engine are verified, the efficiency of web crawler is verified, and the effect of Chinese analysis is verified. The whole test demonstration proves that the system has certain reliability and practicability, and has certain reference value to the construction of vertical search engine.
【學(xué)位授予單位】:西華大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
本文編號:2270431
[Abstract]:With the rapid development of the Internet, Internet information is growing exponentially, and search engines have been playing a major role in the application of the Internet. But the general search engine brings convenience to people, but also brings people the trouble of searching. In particular, the amount of information returned by the search engine is very large, and users need to spend a lot of time searching for the information they are interested in, and the general search engine does not take into account the needs of the users' professional knowledge. Returning the retrieval results without distinction will cause inconvenience to the retrieval process. Vertical search engine, as a developing trend of future search engine, focuses on the search in a certain field and plays an important role in the gradual refinement of the division of labor and social division of labor in modern industries. Users have a strong demand for a certain professional information, vertical search engine is to solve the problem of a (this) kind of professional information retrieval, mainly through the subject crawler technology, etc. The vertical search engine is more practical than the general search engine in solving some professional problems. Based on the introduction of search engine and vertical search engine, this paper focuses on the analysis and research of heritrix web crawler. By customizing heritrix crawler, the subject network information is captured. By introducing ELFHash algorithm, heritrix can grab web pages by multi-thread. Speed up the heritrix capture rate by removing restrictions on robots.txt. In this paper, lucene is used to build index and retrieval. On the basis of analyzing and studying the basic frame structure of lucene, the middle partitioning and sorting of lucene are modified. Under the demand of electronic information search engine, a Chinese word segmentation algorithm based on electronic information professional dictionary and statistics is designed, and the self-sorting algorithm of lucene is modified to make the retrieval results more in line with the needs of users. In addition, this article analyzes the contents of the downloaded web pages so that lucene can index them. Finally, the differences and advantages between vertical search engine and general search engine are verified, the efficiency of web crawler is verified, and the effect of Chinese analysis is verified. The whole test demonstration proves that the system has certain reliability and practicability, and has certain reference value to the construction of vertical search engine.
【學(xué)位授予單位】:西華大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【參考文獻】
相關(guān)期刊論文 前7條
1 董守斌;趙鐵柱;;面向搜索引擎的分布式文件系統(tǒng)性能分析[J];華南理工大學(xué)學(xué)報(自然科學(xué)版);2011年04期
2 張國煊,王小華,周必水;快速書面漢語自動分詞系統(tǒng)及其算法設(shè)計[J];計算機研究與發(fā)展;1993年01期
3 林彤,江志軍;Internet的搜索引擎[J];計算機工程與應(yīng)用;2000年05期
4 劉琨,鄭有才;搜索引擎剖析[J];微機發(fā)展;2004年03期
5 朱敏;羅省賢;;基于Heritrix的面向特定主題的聚焦爬蟲研究[J];計算機技術(shù)與發(fā)展;2012年02期
6 李雙龍;劉群;王成耀;;基于條件隨機場的漢語分詞系統(tǒng)[J];微計算機信息;2006年28期
7 趙宏中;李亞;;垂直搜索引擎應(yīng)用研究[J];現(xiàn)代商貿(mào)工業(yè);2010年04期
本文編號:2270431
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2270431.html
最近更新
教材專著