大豆主題垂直搜索引擎關(guān)鍵技術(shù)的研究與設(shè)計(jì)
本文選題:大豆主題 + 垂直搜索引擎; 參考:《東北農(nóng)業(yè)大學(xué)》2013年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的快速發(fā)展,網(wǎng)絡(luò)信息資源呈現(xiàn)出爆炸性增長(zhǎng)態(tài)勢(shì),如何快速找到滿足用戶需求的信息成為越來(lái)越重要的問(wèn)題。目前,搜索引擎已成為互聯(lián)網(wǎng)最重要的應(yīng)用之一,傳統(tǒng)的通用搜索引擎為所有用戶提供統(tǒng)一接口,但隨著信息量的持續(xù)增長(zhǎng),其已不能滿足特定領(lǐng)域用戶對(duì)信息準(zhǔn)確性、實(shí)時(shí)性和深度等多方面的個(gè)性化需求,因而,專門(mén)用來(lái)查詢某一學(xué)科領(lǐng)域或主題的搜索引擎即“垂直搜索引擎”應(yīng)運(yùn)而生,并得到快速的發(fā)展和廣泛的應(yīng)用。 本課題來(lái)源于星火計(jì)劃項(xiàng)目,立足于糧食主產(chǎn)區(qū)農(nóng)業(yè)現(xiàn)實(shí)狀況,,針對(duì)農(nóng)業(yè)信息化中普遍存在的信息資源共享程度低的問(wèn)題,尤其是大豆產(chǎn)業(yè)信息化建設(shè),為從事大豆生產(chǎn)加工、科研及流通工作的人員提供共享數(shù)據(jù)資源。本文采用垂直搜索技術(shù)對(duì)互聯(lián)網(wǎng)上農(nóng)業(yè)領(lǐng)域中的大豆相關(guān)信息進(jìn)行采集、過(guò)濾,為以“中國(guó)大豆網(wǎng)”為標(biāo)志的門(mén)戶網(wǎng)站構(gòu)建大豆信息庫(kù),同時(shí),設(shè)計(jì)了面向大豆主題的垂直搜索引擎構(gòu)架,對(duì)其關(guān)鍵技術(shù)開(kāi)展研究,并實(shí)現(xiàn)了原型系統(tǒng)。本文的主要研究?jī)?nèi)容如下: (1)首先,明確本文研究目的和意義,分析垂直搜索引擎以及其在農(nóng)業(yè)領(lǐng)域中應(yīng)用的研究現(xiàn)狀和動(dòng)態(tài);其次,對(duì)通用搜索引擎和垂直搜索引擎的發(fā)展、結(jié)構(gòu)、原理以及各自的優(yōu)劣進(jìn)行分析、比較,并基于大豆主題,對(duì)主題搜索引擎的系統(tǒng)結(jié)構(gòu)進(jìn)行設(shè)計(jì)。 (2)網(wǎng)頁(yè)信息采集的核心是網(wǎng)絡(luò)蜘蛛,其自動(dòng)地在互聯(lián)網(wǎng)上按照一定的搜索策略進(jìn)行搜索爬行,并將搜集的信息存儲(chǔ)到本地。主題網(wǎng)絡(luò)蜘蛛與通用網(wǎng)絡(luò)蜘蛛最大的區(qū)別是,前者是有選擇地抓取主題相關(guān)的頁(yè)面,而后者則是“見(jiàn)網(wǎng)頁(yè)就抓”。本文對(duì)主題網(wǎng)絡(luò)蜘蛛的結(jié)構(gòu)、原理、搜索策略以及主題相關(guān)度分析算法進(jìn)行深入研究和分析,考慮鏈接錨文本和網(wǎng)頁(yè)標(biāo)題對(duì)相關(guān)度的影響以及鏈接陷阱問(wèn)題,對(duì)已有的鏈接分析算法進(jìn)行改進(jìn)。 (3)索引可以提高檢索效率,本文索引能夠有效提高管理與審核模塊加載數(shù)據(jù)的速度。索引對(duì)象是經(jīng)過(guò)中文分詞處理的網(wǎng)頁(yè)文檔,中文分詞就是將連續(xù)的字序列拆分成詞序列的過(guò)程。本文對(duì)已有的分詞算法和倒排索引技術(shù)以及開(kāi)源Lucene索引框架的索引過(guò)程和搜索過(guò)程進(jìn)行研究,由于Lucene自帶的中文分詞不夠精確,因此,采用基于IKAnalyzer分詞的Lucene索引框架。 (4)基于上述研究,按照軟件工程學(xué)的理論對(duì)面向大豆主題的垂直搜索引擎原型系統(tǒng)進(jìn)行實(shí)現(xiàn),主要是對(duì)該系統(tǒng)中的網(wǎng)頁(yè)信息采集、索引和管理與審核模塊進(jìn)行實(shí)現(xiàn),最終為大豆門(mén)戶網(wǎng)站提供大豆相關(guān)數(shù)據(jù)。 綜上所述,本文以國(guó)內(nèi)主要大豆網(wǎng)站為初始抓取的目標(biāo)網(wǎng)站(如中國(guó)農(nóng)產(chǎn)品交易網(wǎng)、中國(guó)糧油信息網(wǎng)、黑龍江省農(nóng)業(yè)信息網(wǎng)、天下糧倉(cāng)等),基于Java技術(shù)對(duì)面向大豆主題的垂直搜索引擎原型系統(tǒng)進(jìn)行實(shí)現(xiàn),為大豆門(mén)戶網(wǎng)站提供數(shù)據(jù)支撐,同時(shí),為面向大豆主題信息的查詢提供了理論基礎(chǔ),本文的研究也可作為其他農(nóng)業(yè)主題搜索引擎的參考。
[Abstract]:With the rapid development of Internet technology, network information resources show an explosive growth trend. How to quickly find information to meet the needs of users has become a more and more important problem. At present, the search engine has become one of the most important applications of the Internet. The traditional general search engine provides a unified interface for all users, but with the letter The continuous growth of interest rates has been unable to meet the personalized needs of users in specific fields such as information accuracy, real-time and depth. Therefore, the search engine called "vertical search engine", which is specially used to query a subject area or subject, has emerged as the times require, and has been rapidly developed and widely used.
This project is based on the project of star fire plan, based on the agricultural reality of the main grain producing area, aiming at the problem of low sharing of information resources in the agricultural informatization, especially in the construction of soybean industry, providing the sharing data resources for the people engaged in soybean production and processing, scientific research and circulation work. Cable technology collects and filters soybean related information in the field of agriculture on the Internet and filters, constructs soybean information base for the portal website marked by "China soybean network". At the same time, it designs a vertical search engine framework for soybean theme, studies its key technologies and implements a prototype system. The main contents of this paper are the main contents of this paper. As follows:
(1) first, make clear the purpose and significance of this study, analyze the research status and dynamics of vertical search engine and its application in agriculture; secondly, analyze the development, structure, principle and the advantages and disadvantages of the general search engine and vertical search engine, and compare the system structure of the subject search engine based on the subject of soybean. Design.
(2) the core of the web information collection is the web spider, which automatically searches and crawls according to a certain search strategy on the Internet, and stores the information to the local. The biggest difference between the theme network spider and the common web spider is that the former is the choice to grab the main questions related pages, and the latter is "see the web page to catch". This paper makes an in-depth study and analysis of the structure, principle, search strategy and topic correlation analysis algorithm of the theme network spider, considering the influence of the link anchor text and the page title on the correlation degree and the link trap problem, and improves the existing link analysis algorithm.
(3) index can improve the efficiency of retrieval. The index can effectively improve the speed of loading data in the management and audit modules. The index object is a web page document processed by Chinese word segmentation. The Chinese word segmentation is the process of splitting the serial word sequence into the word sequence. In this paper, the existing segmentation algorithm and inverted index technique and open source Lucene are used in this paper. The indexing process and search process of the index frame are studied. Because the Chinese word segmentation is not accurate enough for Lucene, the Lucene indexing framework based on IKAnalyzer segmentation is adopted.
(4) based on the above research, the prototype system of vertical search engine for soybean subject is realized in accordance with the theory of software engineering. It is mainly to implement the web information collection, index and management and audit module in the system, and finally provide soybean related data for soybean portal.
To sum up, this article takes the main soybean website in China as the initial target website (such as China's agricultural product trading network, China Grain and oil information network, Heilongjiang agricultural information network, the world grain barn, etc.), based on Java technology to achieve the soybean theme vertical search engine prototype system, providing data support for soybean portal sites. It provides a theoretical basis for the inquiry of soybean theme information. The research in this paper can also serve as a reference for other agricultural topic search engines.
【學(xué)位授予單位】:東北農(nóng)業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 吳yP昕;順風(fēng);;網(wǎng)絡(luò)搜索引擎的發(fā)展趨勢(shì)分析[J];當(dāng)代傳播;2007年03期
2 劉紅芝;;中文分詞技術(shù)的研究[J];電腦開(kāi)發(fā)與應(yīng)用;2010年03期
3 龍樹(shù)全;趙正文;唐華;;中文分詞算法概述[J];電腦知識(shí)與技術(shù);2009年10期
4 徐周昶;章美仁;;垂直搜索引擎系統(tǒng)的架構(gòu)研究[J];福建電腦;2011年11期
5 嚴(yán)安;;綜合搜索引擎和垂直搜索引擎的比較分析[J];湖北師范學(xué)院學(xué)報(bào)(哲學(xué)社會(huì)科學(xué)版);2012年01期
6 張雷;;基于Heritrix與Lucene的垂直搜索引擎研究[J];黑龍江科技信息;2011年29期
7 云健;王春霞;;搜索引擎技術(shù)綜述[J];河西學(xué)院學(xué)報(bào);2008年02期
8 戴新宇;尹存燕;陳家駿;鄭國(guó)梁;;機(jī)器翻譯研究現(xiàn)狀與展望[J];計(jì)算機(jī)科學(xué);2004年11期
9 章成敏,章成志;國(guó)外農(nóng)業(yè)搜索引擎評(píng)析[J];農(nóng)業(yè)網(wǎng)絡(luò)信息;2004年11期
10 謝志妮;;一種新的基于概念樹(shù)的主題網(wǎng)絡(luò)爬蟲(chóng)方法[J];計(jì)算機(jī)與現(xiàn)代化;2010年04期
相關(guān)博士學(xué)位論文 前1條
1 王曄;垂直搜索引擎若干問(wèn)題研究[D];復(fù)旦大學(xué);2011年
相關(guān)碩士學(xué)位論文 前10條
1 朱世猛;中文分詞算法的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2011年
2 金川明;垂直搜索引擎研究與實(shí)現(xiàn)[D];云南大學(xué);2011年
3 林偉;垂直搜索引擎關(guān)鍵技術(shù)的研究與實(shí)現(xiàn)[D];華南理工大學(xué);2011年
4 葉繼平;基于Lucene的全文信息檢索技術(shù)的研究與應(yīng)用[D];江南大學(xué);2012年
5 薛建春;垂直搜索引擎中網(wǎng)絡(luò)蜘蛛的設(shè)計(jì)與實(shí)現(xiàn)[D];中國(guó)地質(zhì)大學(xué)(北京);2007年
6 王曉偉;垂直搜索引擎若干關(guān)鍵技術(shù)的研究[D];浙江大學(xué);2007年
7 姚琪;垂直搜索引擎系統(tǒng)的研究與設(shè)計(jì)[D];上海交通大學(xué);2008年
8 朱良峰;主題網(wǎng)絡(luò)爬蟲(chóng)的研究與設(shè)計(jì)[D];南京理工大學(xué);2008年
9 文斌;新聞垂直搜索引擎的設(shè)計(jì)[D];華中科技大學(xué);2007年
10 周鵬;農(nóng)業(yè)搜索引擎系統(tǒng)的關(guān)鍵技術(shù)研究[D];首都師范大學(xué);2009年
本文編號(hào):2110529
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2110529.html