基于Nutch的醫(yī)學領(lǐng)域垂直搜索引擎系統(tǒng)的研究與實現(xiàn)
發(fā)布時間:2018-04-26 00:39
本文選題:垂直搜索引擎 + 主題爬蟲; 參考:《東華理工大學》2015年碩士論文
【摘要】:隨著近幾年互聯(lián)網(wǎng)的快速發(fā)展,人們獲取信息的方式越來越多,各種各樣的信息充斥在人們的生活中,給人們帶來了極大的便利,隨之而來的還有面對豐富信息的無所適從。搜索引擎的出現(xiàn)極大地緩解了這一現(xiàn)狀,然而,隨著互聯(lián)網(wǎng)上網(wǎng)頁數(shù)目呈現(xiàn)指數(shù)級速度增長,通用搜索引擎在提高搜索效率方面愈發(fā)艱難,而垂直搜索引擎憑借其高度的信息集中度和較強的專業(yè)領(lǐng)域知識,成為時下研究的熱點。因此,各領(lǐng)域內(nèi)垂直搜索引擎平臺相繼出現(xiàn),但在與人們生活健康息息相關(guān)的醫(yī)療衛(wèi)生領(lǐng)域卻依然沒有一個較好的搜索平臺,人們對于各種疾病的預防和治療信息大多只能通過醫(yī)生了解,信息渠道單一,而且由于地理環(huán)境、經(jīng)濟發(fā)展等因素限制,優(yōu)勢醫(yī)療資源發(fā)展不均衡。若能實現(xiàn)一個醫(yī)療領(lǐng)域的垂直搜索引擎,人們足不出戶就可獲取醫(yī)療信息,這將有利于緩解我國目前醫(yī)療意識和基礎(chǔ)設(shè)施薄弱的問題。本文基于Nutch開源搜索框架,針對垂直搜索引擎中的主題爬蟲模塊和信息檢索模塊進行分析和設(shè)計,并最終實現(xiàn)醫(yī)學領(lǐng)域的垂直搜索引擎。在該垂直搜索引擎的搭建中,主題爬蟲模塊的構(gòu)建一直是當下研究的熱點,本文通過對主題爬蟲爬行策略中的Fish-Search算法進行分析試驗,依據(jù)網(wǎng)頁鏈接和網(wǎng)頁內(nèi)容對網(wǎng)頁進行綜合相關(guān)度評價,采用彈性閾值機制,在限制“隧道現(xiàn)象”的基礎(chǔ)上對醫(yī)學領(lǐng)域相關(guān)的網(wǎng)頁進行爬取、下載。在抓取該醫(yī)學領(lǐng)域的網(wǎng)頁后,利用網(wǎng)頁解析工具和網(wǎng)頁分塊技術(shù)對網(wǎng)頁進行解析,并將解析后的網(wǎng)頁文本內(nèi)容進行中文分詞,構(gòu)建倒排索引結(jié)構(gòu)的倒排表。針對信息檢索中網(wǎng)頁的排序問題,本文通過對Lucene搜索結(jié)果評分機制的分析與研究,對Page Rank算法在權(quán)值傳遞過程中的平均分配問題加以優(yōu)化,并添加時間反饋因子,減少舊網(wǎng)頁的天然優(yōu)越性,并將優(yōu)化后的Page Rank算法與Lucene中的向量空間模型結(jié)合,在抑制“主題漂移”現(xiàn)象的基礎(chǔ)上提高網(wǎng)頁的主題相關(guān)性和權(quán)威性,最后將經(jīng)過排序處理后的結(jié)果網(wǎng)頁返回給用戶,實現(xiàn)醫(yī)學領(lǐng)域垂直搜索引擎的整個流程。通過對垂直搜索引擎系統(tǒng)的設(shè)計與實現(xiàn),用戶可以以快捷高效的方式獲得較為權(quán)威的醫(yī)學領(lǐng)域信息,對個人的健康與衛(wèi)生等行為有著積極的促進作用,同時為人們帶來更為合理健康的生活方式。
[Abstract]:With the rapid development of the Internet in recent years, there are more and more ways for people to obtain information. Various kinds of information are flooded in people's lives, which brings great convenience to people, and then faces the confusion of rich information. The emergence of search engines has greatly alleviated this situation, however, as the number of web pages on the Internet has grown exponentially, it has become increasingly difficult for universal search engines to improve their search efficiency. Vertical search engine, with its high degree of information concentration and strong professional knowledge, has become a hot research topic. Therefore, vertical search engine platforms have appeared one after another in various fields, but there is still not a better search platform in the field of medical and health, which is closely related to people's life and health. The information of prevention and treatment of various diseases can only be understood by doctors, the information channel is single, and because of geographical environment, economic development and other factors, the development of superior medical resources is not balanced. If we can realize a vertical search engine in medical field, people can get medical information from home, which will help to alleviate the problem of weak medical consciousness and infrastructure. Based on Nutch open source search framework, this paper analyzes and designs the topic crawler module and information retrieval module in vertical search engine, and finally realizes the vertical search engine in medical field. In the construction of the vertical search engine, the construction of the topic crawler module has been the focus of current research. This paper analyzes and tests the Fish-Search algorithm in the topic crawler crawling strategy. According to the comprehensive relevance evaluation of the web pages based on the link and the content of the web pages, the elastic threshold mechanism is adopted to crawl and download the medical related web pages on the basis of limiting the "tunnel phenomenon". After grabbing the web pages of the medical field, we use the web page analysis tools and web page partitioning technology to parse the web pages, and make the Chinese word segmentation of the analyzed page text content, and construct the inverted table of inverted index structure. Aiming at the ranking problem of web pages in information retrieval, this paper analyzes and studies the scoring mechanism of Lucene search results, optimizes the average allocation problem of Page Rank algorithm in the process of weight transfer, and adds a time feedback factor. In order to reduce the natural superiority of the old web pages, and combine the optimized Page Rank algorithm with the vector space model in Lucene, we can improve the relevance and authority of the web pages on the basis of suppressing the "topic drift" phenomenon. Finally, the result page after sorting is returned to the user to realize the whole process of vertical search engine in medical field. Through the design and implementation of vertical search engine system, users can obtain authoritative medical field information in a fast and efficient way, which has a positive effect on personal health and hygiene behavior. At the same time for people to bring a more reasonable and healthy way of life.
【學位授予單位】:東華理工大學
【學位級別】:碩士
【學位授予年份】:2015
【分類號】:TP391.3
【參考文獻】
相關(guān)期刊論文 前1條
1 李曉紅;李茂林;;用戶興趣模型在垂直搜索引擎檢索模塊中的應用[J];計算機時代;2012年12期
,本文編號:1803730
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1803730.html
最近更新
教材專著