主題搜索引擎搜索策略的研究及算法設計

發(fā)布時間：2018-05-03 03:37

本文選題：搜索引擎 + 主題爬蟲��；參考：《蘭州大學》2017年碩士論文

【摘要】：當前互聯(lián)網應用中網站的搜索正變得越來越普及,一個網站要想做大做強,其內容必定要豐富,用戶想要找到的內容,不管是最新的還是以前的(比如一段時間以前就見過的新聞報道,因為不再是最新的內容而沒有出現(xiàn)在首頁上),我們都可以借助搜索引擎來查找它。通過搜索引擎,用戶可以享受快速獲得資源的服務,幾乎足不出戶,搜索引擎就可以使人們更有效的從互聯(lián)網絡獲取各種信息了,所以一個搜索引擎的好壞直接決定了人們的互聯(lián)網生活。本文通過分析了主流搜索策略及算法,對搜索引擎的分類、技術架構及原理結構進行了深度的剖析,同時研究了基于主題爬蟲系統(tǒng)的設計和模型的建立,在現(xiàn)有的技術支持上融入了機器學習算法,具體的討論了文檔的特征選擇算法思想,并闡述了目前主流的TF-IDF改進算法,以Python 2.7為開發(fā)平臺,設計實現(xiàn)了基于Context Graph的主題爬蟲系統(tǒng)。最終以國內各大汽車網站為例,將“汽車”設為主題詞進行分類爬取,以查全率、查準率、F1值來評價所涉及的系統(tǒng)性能的好壞。通過實驗結果,說明本文設計的算法在文檔的主題詞分類及網頁爬取的效率上具有較好的性能。
[Abstract]:At present, the search for websites in Internet applications is becoming more and more popular. If a website wants to be large and strong, its content must be rich, the content users want to find, Whether it's the latest or the previous (for example, news stories that have been seen for some time, because they're no longer the latest content and not on the front page), we can use search engines to find them. Through search engines, users can enjoy quick access to resources, almost without leaving home, search engines can enable people to obtain information from the Internet more effectively. So the quality of a search engine directly determines people's Internet life. By analyzing the mainstream search strategies and algorithms, this paper deeply analyzes the classification, technical framework and principle structure of search engine, and studies the design and modeling of theme-based crawler system. The machine learning algorithm is integrated into the existing technical support, the idea of feature selection algorithm of document is discussed in detail, and the current mainstream TF-IDF improved algorithm is expounded, which takes Python 2.7 as the development platform. The theme crawler system based on Context Graph is designed and implemented. Finally, taking the domestic automobile websites as an example, the "automobile" is set up as the subject word for classification and crawling, and the system performance is evaluated by the recall rate, the precision rate and the F1 value. The experimental results show that the algorithm proposed in this paper has good performance in the classification of subject words and the efficiency of web crawling.
【學位授予單位】：蘭州大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP391.3

【參考文獻】

相關期刊論文前10條

1 黎邦群;;基于搜索引擎與用戶體驗優(yōu)化的OPAC研究[J];中國圖書館學報;2013年04期

2 義天鵬;陳啟安;;基于Lucene的中文分析器分詞性能比較研究[J];計算機工程;2012年22期

3 葉育鑫;歐陽丹彤;;基于語義的主題爬行策略[J];軟件學報;2011年09期

4 陳星;;基于Context Graphs的主題爬蟲的研究與實現(xiàn)[J];計算機工程與設計;2011年03期

5 曾廣樸;范會聯(lián);;基于遺傳算法的聚焦爬蟲搜索策略[J];計算機工程;2010年11期

6 陳慧;;中文搜索引擎的對比研究[J];現(xiàn)代情報;2010年04期

7 黃莉;王成良;楊錚;;面向主題網絡爬行的智能隧道穿越算法研究[J];計算機應用研究;2009年08期

8 王輝;劉艷威;左萬利;;使用分類器自動發(fā)現(xiàn)特定領域的深度網入口(英文)[J];軟件學報;2008年02期

9 劉金紅;陸余良;;主題網絡爬蟲研究綜述[J];計算機應用研究;2007年10期

10 方志堅;張瑞林;童小素;;搜索引擎綜合分析[J];計算機工程與設計;2007年16期

相關碩士學位論文前2條

1 付志超;基于Map/Reduce的分布式智能搜索引擎框架研究[D];武漢理工大學;2008年

2 姜華;基于Lucene面向主題搜索引擎的研究與設計[D];華東師范大學;2007年

，

本文編號：1836830

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1836830.html

上一篇：基于Lucene的垂直搜索引擎研究與實現(xiàn)
下一篇：政府信息搜集、利用渠道實證分析——以上海地區(qū)大學生為例

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

主題搜索引擎搜索策略的研究及算法設計