基于Nutch的安全漏洞垂直搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-05-09 00:35
本文選題:Nutch + 垂直搜索引擎; 參考:《北京郵電大學(xué)》2017年碩士論文
【摘要】:當(dāng)今社會,越來越多的人通過互聯(lián)網(wǎng)獲取信息資源,而面對海量的網(wǎng)絡(luò)信息,人們需要通過搜索引擎來快速檢索到所需的信息。傳統(tǒng)的搜索引擎技術(shù)是對整個(gè)互聯(lián)網(wǎng)資源進(jìn)行爬取,搜索范圍廣,但是搜索結(jié)果中包含了大量用戶不需要的信息,用戶體驗(yàn)感差。而垂直搜索引擎只檢索出用戶關(guān)心的、某一特定專業(yè)領(lǐng)域的相關(guān)信息,它的搜索范圍小,但是搜索結(jié)果更精準(zhǔn),符合用戶對特定領(lǐng)域的信息檢索需求。目前,人們的學(xué)習(xí)生活等各方面都離不開互聯(lián)網(wǎng),而個(gè)人、企業(yè)的信息泄露屢見不鮮,互聯(lián)網(wǎng)安全問題越來越引起人們的重視。而互聯(lián)網(wǎng)中大量的安全漏洞是構(gòu)成網(wǎng)絡(luò)安全威脅的重要原因,企業(yè)受到大規(guī)模ddos攻擊導(dǎo)致主機(jī)崩潰、用戶個(gè)人信息泄露等問題多是由安全漏洞所引發(fā)。安全漏洞導(dǎo)致的風(fēng)險(xiǎn)是巨大的,為了讓人們能夠了解到最新的安全漏洞信息,有必要構(gòu)建一個(gè)可以檢索安全漏洞信息的垂直搜索引擎。本文通過對垂直搜索引擎相關(guān)技術(shù)以及開源搜索引擎框架Nutch的研究,設(shè)計(jì)并實(shí)現(xiàn)了基于Nutch的安全漏洞垂直搜索引擎系統(tǒng)。該系統(tǒng)的主要功能模塊包括網(wǎng)絡(luò)爬蟲、特定主題信息過濾、索引、檢索排序以及第三方中文分詞器。本文的主要工作包括以下幾個(gè)方面:1、熟悉了搜索引擎的發(fā)展概況以及垂直搜索引擎的研究現(xiàn)狀,重點(diǎn)研究了垂直搜索引擎的各個(gè)模塊技術(shù),同時(shí)熟悉了開源Nutch框架的工作原理與插件機(jī)制。2、重點(diǎn)研究了垂直搜索引擎的主題過濾模塊,本文引入了分類器思想實(shí)現(xiàn)對信息的分類,從而實(shí)現(xiàn)面向特定領(lǐng)域信息的搜索。由于樸素貝葉斯分類器存在條件獨(dú)立性的天然缺陷,本文重點(diǎn)研究了二階AODE分類器,并在此基礎(chǔ)上改進(jìn)實(shí)現(xiàn)了基于屬性變量和類變量互信息加權(quán)的WAODE分類算法。同時(shí)將WAODE分類算法結(jié)合Nutch的插件機(jī)制實(shí)現(xiàn)本文的主題過濾模塊。3、改進(jìn)了 Nutch檢索排序算法模型,從內(nèi)容相關(guān)性、超鏈接分析網(wǎng)頁權(quán)威性以及時(shí)間因子三方面考慮,得到新的網(wǎng)頁排序評分模型并實(shí)驗(yàn)驗(yàn)證。4、在Nutch中加入第三方中文分詞器mmseg4j,實(shí)現(xiàn)了中文分詞功能。
[Abstract]:In today's society, more and more people obtain information resources through the Internet, and in the face of massive network information, people need to quickly retrieve the required information through search engines. Traditional search engine technology is to crawl the entire Internet resources, search a wide range, but the search results contain a large number of users do not need information, user experience is poor. The vertical search engine only retrieves the relevant information of a specific professional domain which is of concern to the user. Its search scope is small, but the search results are more accurate and meet the information retrieval needs of the user in a specific field. At present, people's study life and other aspects can not be separated from the Internet, and the information leakage of individuals and enterprises is common, Internet security issues have been paid more and more attention. However, a large number of security vulnerabilities in the Internet are the important reasons for the network security threats. Large scale ddos attacks on enterprises lead to the collapse of the host, and many other problems such as the disclosure of personal information of users are caused by security vulnerabilities. The risks caused by security vulnerabilities are enormous. In order to make people know the latest information of security vulnerabilities, it is necessary to build a vertical search engine which can retrieve the information of security vulnerabilities. Based on the research of vertical search engine technology and open source search engine framework Nutch, this paper designs and implements a security vulnerability vertical search engine system based on Nutch. The main function modules of the system include web crawler, specific topic information filtering, indexing, retrieval and sorting, and third party Chinese word segmentation. The main work of this paper includes the following aspects: 1, familiar with the development of the search engine and the status quo of the vertical search engine, focusing on the vertical search engine module technology, At the same time, we are familiar with the working principle of open source Nutch framework and plug-in mechanism. 2. We focus on the topic filtering module of vertical search engine. In this paper, we introduce the idea of classifier to realize the classification of information, so as to realize the search for specific domain information. Due to the natural defect of conditional independence of naive Bayesian classifier, the second order AODE classifier is studied in this paper, and an improved WAODE classification algorithm based on mutual information between attribute variables and class variables is implemented. At the same time, the WAODE classification algorithm is combined with the plug-in mechanism of Nutch to realize the topic filtering module. 3, which improves the sorting algorithm model of Nutch retrieval, considering from three aspects: content correlation, hyperlink analysis of web page authority and time factor. A new web page ranking scoring model was obtained and verified by experiments. The third party Chinese word particifier mmseg4jwas added to Nutch to realize the function of Chinese word segmentation.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 彭Z,
本文編號:1863811
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1863811.html
最近更新
教材專著