弓形蟲(chóng)Rhomboid基因重組卡介苗的研制
發(fā)布時(shí)間:2018-05-08 09:15
本文選題:垂直搜索引擎 + Lucene。 參考:《吉林大學(xué)》2012年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)迅速發(fā)展至今,搜索引擎的出現(xiàn)可謂是必然的。偌大的互聯(lián)網(wǎng)就好像一個(gè)巨型的圖書(shū)館,在這個(gè)網(wǎng)絡(luò)圖書(shū)館里存在著,并且時(shí)時(shí)刻刻都在產(chǎn)生著大量的信息。數(shù)以萬(wàn)計(jì)的信息遠(yuǎn)超出了我們的想象與掌控,如果沒(méi)有搜索引擎的出現(xiàn),也許我們根本無(wú)法找到我們想要的目標(biāo)信息。 網(wǎng)頁(yè)數(shù)據(jù)抓取指的是批量、快速?gòu)木W(wǎng)站上提取信息的一種計(jì)算機(jī)軟件技術(shù)。網(wǎng)頁(yè)數(shù)據(jù)抓取程序模擬瀏覽器的行為,能將可以在瀏覽器上顯示的任何數(shù)據(jù)提取出來(lái),,網(wǎng)頁(yè)數(shù)據(jù)抓取的最終目的是將非結(jié)構(gòu)化的信息從大量的網(wǎng)頁(yè)中抽取出來(lái)以結(jié)構(gòu)化的方式存儲(chǔ)。傳統(tǒng)搜索技術(shù)的如下缺陷使其很難滿足用戶的需要: 首先,這種技術(shù)對(duì)于關(guān)鍵字的選擇要求很高,如果所選關(guān)鍵字不當(dāng),這樣制約了非成熟用戶使用搜索引擎。其次,這種搜索引擎在結(jié)果頁(yè)面上能夠顯示的結(jié)果也非常有限,結(jié)果單一,通常充滿了冗余的信息。造成這種結(jié)果的原因是由于這種技術(shù)是一種簡(jiǎn)單的基于一維關(guān)鍵字的查詢(xún),搜索引擎并不主動(dòng)去“理解”文檔,只是被動(dòng)的進(jìn)行關(guān)鍵字匹配。這種技術(shù)的結(jié)果導(dǎo)致了用戶常常不能夠獲取有價(jià)值的信息。這種情況在時(shí)效性較強(qiáng),以及信息結(jié)構(gòu)化比較強(qiáng)的求職領(lǐng)域尤其明顯。 互聯(lián)網(wǎng)的信息冗余太過(guò)龐大,一篇文章被人轉(zhuǎn)載成百上千次。雖然就目前的技術(shù)來(lái)講有一定的識(shí)別技術(shù),但是仍然顯的比較無(wú)力。 垂直搜索簡(jiǎn)單點(diǎn)說(shuō),就是相對(duì)于通用搜索引擎對(duì)于特定行業(yè)的專(zhuān)業(yè)搜索引擎,是對(duì)專(zhuān)業(yè)網(wǎng)頁(yè)庫(kù)中得信息進(jìn)行細(xì)化、整合、分類(lèi),抽取特定數(shù)據(jù)返回給客戶,抓取的是的結(jié)構(gòu)化數(shù)據(jù)和元數(shù)據(jù),這也是和通用搜索存在的最大差別,通常由抓取系統(tǒng),索引系統(tǒng)和搜索系統(tǒng)三大部分組成。 本論文對(duì)垂直搜索引擎的發(fā)展及在發(fā)展中面臨的問(wèn)題進(jìn)行了理論性的分析,介紹了垂直搜索系統(tǒng)的關(guān)鍵技術(shù),具體介紹了垂直搜索引擎的分類(lèi)及相關(guān)知識(shí)。對(duì)網(wǎng)絡(luò)蜘蛛的運(yùn)行規(guī)則進(jìn)行設(shè)計(jì),提出了教育信息垂直搜索引擎系統(tǒng)的框架,分析了各部分功能模塊的作用,給出了教育信息垂直搜索引擎系統(tǒng)的體系結(jié)構(gòu),構(gòu)建了系統(tǒng)的處理流程,詳細(xì)研究了教育信息垂直搜索引擎系統(tǒng)的框架中涉及的信息抓取、中文抽取和檢索功能的實(shí)現(xiàn)。對(duì)管理模塊、頁(yè)面抓取、數(shù)據(jù)處理以及建立索引等進(jìn)行的設(shè)計(jì),實(shí)現(xiàn)對(duì)教育領(lǐng)域信息的垂直搜索框架的構(gòu)造。給出了系統(tǒng)體系架構(gòu),設(shè)定了系統(tǒng)的處理流程,從整體結(jié)構(gòu),前端、后端分別標(biāo)明處理過(guò)程,最后給出了UML用例分析。
[Abstract]:With the rapid development of the Internet, the emergence of search engines is inevitable. The huge Internet is like a huge library, in which a lot of information is produced all the time. Tens of thousands of messages are beyond our imagination and control. Without search engines, we might not be able to find the information we want. Web data capture refers to a computer software technology that can extract information from websites quickly and in batches. The webpage data grab program simulates the behavior of the browser and extracts any data that can be displayed on the browser. The ultimate purpose of web page data capture is to extract unstructured information from a large number of web pages and store it in a structured way. The following shortcomings of traditional search technology make it difficult to meet the needs of users: First of all, this technique requires a high level of keyword selection. If the keyword is not selected properly, it restricts the immature users to use search engines. Second, the search engine can display very limited results on the results page, the results are single, often full of redundant information. The reason for this result is that this technique is a simple query based on one-dimensional keywords, the search engine does not actively "understand" the document, but only passively carries out keyword matching. The result of this technology is that users are often unable to access valuable information. This situation is more effective in the field of job search, and information is more structured. The information redundancy of the Internet is so huge that an article is reproduced hundreds of times. Although there is a certain recognition technology in terms of current technology, but it is still relatively weak. To put it simply, vertical search is to refine, integrate, classify, and extract specific data to return to customers, as opposed to the general search engines for specialized search engines in specific industries. What is captured is structured data and metadata, which is also the biggest difference from general search. It usually consists of three parts: grab system, index system and search system. In this paper, the development of vertical search engine and the problems in the development are analyzed theoretically, the key technologies of vertical search system are introduced, and the classification and related knowledge of vertical search engine are introduced in detail. This paper designs the running rules of the web spider, puts forward the framework of the vertical search engine system of educational information, analyzes the function of each part of the function module, and gives the system structure of the vertical search engine system of education information. The processing flow of the system is constructed, and the realization of the functions of information capture, Chinese extraction and retrieval in the framework of the vertical search engine system of educational information is studied in detail. The design of management module, page capture, data processing and indexing is carried out to construct the vertical search framework for information in the field of education. The architecture of the system is given, and the processing flow is set up. The processing process is indicated from the whole structure, the front end and the back end. Finally, the UML use case analysis is given.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 龍樹(shù)全;趙正文;唐華;;中文分詞算法概述[J];電腦知識(shí)與技術(shù);2009年10期
2 劉彥平;;關(guān)于網(wǎng)絡(luò)搜索引擎及其優(yōu)化的討論[J];電子商務(wù);2011年04期
3 李學(xué)勇,歐陽(yáng)柳波,李國(guó)徽,鐘敏娟;網(wǎng)絡(luò)蜘蛛搜索策略比較研究[J];計(jì)算機(jī)工程與應(yīng)用;2004年04期
4 萬(wàn)紅新;彭云;;模糊策略下的搜索文本聚類(lèi)分析技術(shù)[J];計(jì)算機(jī)工程與應(yīng)用;2009年33期
5 陳紅濤;楊放春;陳磊;;基于大規(guī)模中文搜索引擎的搜索日志挖掘[J];計(jì)算機(jī)應(yīng)用研究;2008年06期
6 姚詠梅;;巧用目錄式搜索引擎[J];科學(xué)大眾;2009年07期
7 吳美清,沈惠玉;元搜索引擎在解決網(wǎng)絡(luò)信息檢索問(wèn)題上所具有的優(yōu)勢(shì)與不足[J];情報(bào)雜志;2004年08期
8 翁R土
本文編號(hào):1860819
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1860819.html
最近更新
教材專(zhuān)著