基于Nutch的聚類搜索引擎的研究與實現(xiàn)
發(fā)布時間:2018-11-11 14:14
【摘要】:在互聯(lián)網(wǎng)蓬勃發(fā)展的今天,網(wǎng)絡(luò)信息呈指數(shù)式增長。面對海量的網(wǎng)絡(luò)信息,如何以最快捷、準(zhǔn)確的方式獲取信息,也許是每一個網(wǎng)民最大的需求。在這種情況下,谷歌、百度、雅虎等搜索引擎順勢而生,為網(wǎng)民獲取信息打開了通路。但是,傳統(tǒng)的搜索引擎遠非完美,其以線性列表的方式顯示搜索結(jié)果,給網(wǎng)民快速獲、準(zhǔn)確地取信息帶來了困難。因此,研究者們將文本聚類引入到對搜索引擎返回結(jié)果進行分析的過程中,以幫助用戶快速找到所求。 本文的研究工作主要圍繞如何提高聚類質(zhì)量和聚類算法計算效率展開。具體做法是從非負(fù)矩陣分解算法、向量空間模型、后綴數(shù)組排序和中文分詞模塊四個方面著手,對中文聚類算法的關(guān)鍵技術(shù)進行深入的研究,并以Lingo聚類算法為原型,研究提出了一種用于對中小規(guī)模文檔集進行聚類分析的中文聚類算法Rlingo。 本文所做的主要工作是:第一、首次將基于板倉-齋藤散度的非負(fù)矩陣分解引入到聚類分析中,提高了聚類標(biāo)簽的可讀性和聚類結(jié)果的整體質(zhì)量;第二、將位置因素和詞性因素引入對傳統(tǒng)的向量空間模型進行改進,進一步提高了聚類結(jié)果的質(zhì)量;第三、基于線性后綴數(shù)組排序算法:skew算法,提出了一種能消除無實際意義特征詞對特征抽取質(zhì)量干擾的改進型skew后綴數(shù)組排序算法,減少了聚類算法對中小規(guī)模文檔集進行聚類分析的處理時間;第四、基于Nutch,利用Rlingo實現(xiàn)了一個面向旅游的聚類系統(tǒng),系統(tǒng)性能基本達到預(yù)期效果。 最后,,本文設(shè)置了對照實驗,比較了Rlingo、Lingo、K-means和STC的綜合性能。實驗表明:Rlingo聚類算法對中小文檔集的聚類結(jié)果明顯優(yōu)于其他三種聚類算法,改進的聚類算法基本達到預(yù)期效果。
[Abstract]:In the vigorous development of the Internet today, network information is exponential growth. In the face of mass network information, how to obtain information in the most rapid and accurate way is perhaps the biggest demand of every Internet user. In this case, Google, Baidu, Yahoo and other search engines, opened the way for Internet users to access information. However, the traditional search engine is far from perfect, which displays the search results in the form of linear list, which makes it difficult for Internet users to get information quickly and accurately. Therefore, the researchers introduce text clustering into the process of analyzing the results returned by search engines in order to help users quickly find what they are looking for. This paper focuses on how to improve the clustering quality and the computational efficiency of the clustering algorithm. In this paper, the key technologies of Chinese clustering algorithm are studied from four aspects: non-negative matrix decomposition algorithm, vector space model, suffix array sort and Chinese word segmentation module. The algorithm is based on Lingo clustering algorithm. This paper presents a Chinese clustering algorithm Rlingo. for clustering analysis of small and medium-sized document sets. The main work of this paper is as follows: first, the nonnegative matrix decomposition based on the Bankura-Saito divergence is introduced into the clustering analysis for the first time, which improves the readability of the clustering tags and the overall quality of the clustering results; Secondly, the position factor and the part of speech factor are introduced into the traditional vector space model to improve the quality of the clustering results. Thirdly, based on the linear suffix array sorting algorithm: skew algorithm, an improved skew suffix array sorting algorithm is proposed, which can eliminate the quality interference of feature extraction without actual meaning. The processing time of clustering analysis for small and medium-sized document sets is reduced by clustering algorithm. Fourthly, a tourism-oriented clustering system based on Nutch, is implemented with Rlingo. Finally, a comparative experiment was conducted to compare the comprehensive performance of Rlingo,Lingo,K-means and STC. The experimental results show that the clustering results of Rlingo clustering algorithm for small and medium document sets are obviously better than the other three clustering algorithms, and the improved clustering algorithm basically achieves the expected results.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3;TP311.13
[Abstract]:In the vigorous development of the Internet today, network information is exponential growth. In the face of mass network information, how to obtain information in the most rapid and accurate way is perhaps the biggest demand of every Internet user. In this case, Google, Baidu, Yahoo and other search engines, opened the way for Internet users to access information. However, the traditional search engine is far from perfect, which displays the search results in the form of linear list, which makes it difficult for Internet users to get information quickly and accurately. Therefore, the researchers introduce text clustering into the process of analyzing the results returned by search engines in order to help users quickly find what they are looking for. This paper focuses on how to improve the clustering quality and the computational efficiency of the clustering algorithm. In this paper, the key technologies of Chinese clustering algorithm are studied from four aspects: non-negative matrix decomposition algorithm, vector space model, suffix array sort and Chinese word segmentation module. The algorithm is based on Lingo clustering algorithm. This paper presents a Chinese clustering algorithm Rlingo. for clustering analysis of small and medium-sized document sets. The main work of this paper is as follows: first, the nonnegative matrix decomposition based on the Bankura-Saito divergence is introduced into the clustering analysis for the first time, which improves the readability of the clustering tags and the overall quality of the clustering results; Secondly, the position factor and the part of speech factor are introduced into the traditional vector space model to improve the quality of the clustering results. Thirdly, based on the linear suffix array sorting algorithm: skew algorithm, an improved skew suffix array sorting algorithm is proposed, which can eliminate the quality interference of feature extraction without actual meaning. The processing time of clustering analysis for small and medium-sized document sets is reduced by clustering algorithm. Fourthly, a tourism-oriented clustering system based on Nutch, is implemented with Rlingo. Finally, a comparative experiment was conducted to compare the comprehensive performance of Rlingo,Lingo,K-means and STC. The experimental results show that the clustering results of Rlingo clustering algorithm for small and medium document sets are obviously better than the other three clustering algorithms, and the improved clustering algorithm basically achieves the expected results.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3;TP311.13
【參考文獻】
相關(guān)期刊論文 前3條
1 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲研究綜述[J];計算機應(yīng)用研究;2007年10期
2 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報;2007年03期
3 魏群;趙驥;劉保相;;網(wǎng)頁模糊歸類算法的應(yīng)用與實現(xiàn)[J];微計算機信息;2006年15期
相關(guān)博士學(xué)位論文 前1條
1 周
本文編號:2325074
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2325074.html
最近更新
教材專著