基于Nutch的聚類(lèi)搜索引擎的研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-11-11 14:14
【摘要】:在互聯(lián)網(wǎng)蓬勃發(fā)展的今天,網(wǎng)絡(luò)信息呈指數(shù)式增長(zhǎng)。面對(duì)海量的網(wǎng)絡(luò)信息,如何以最快捷、準(zhǔn)確的方式獲取信息,,也許是每一個(gè)網(wǎng)民最大的需求。在這種情況下,谷歌、百度、雅虎等搜索引擎順勢(shì)而生,為網(wǎng)民獲取信息打開(kāi)了通路。但是,傳統(tǒng)的搜索引擎遠(yuǎn)非完美,其以線性列表的方式顯示搜索結(jié)果,給網(wǎng)民快速獲、準(zhǔn)確地取信息帶來(lái)了困難。因此,研究者們將文本聚類(lèi)引入到對(duì)搜索引擎返回結(jié)果進(jìn)行分析的過(guò)程中,以幫助用戶(hù)快速找到所求。 本文的研究工作主要圍繞如何提高聚類(lèi)質(zhì)量和聚類(lèi)算法計(jì)算效率展開(kāi)。具體做法是從非負(fù)矩陣分解算法、向量空間模型、后綴數(shù)組排序和中文分詞模塊四個(gè)方面著手,對(duì)中文聚類(lèi)算法的關(guān)鍵技術(shù)進(jìn)行深入的研究,并以Lingo聚類(lèi)算法為原型,研究提出了一種用于對(duì)中小規(guī)模文檔集進(jìn)行聚類(lèi)分析的中文聚類(lèi)算法Rlingo。 本文所做的主要工作是:第一、首次將基于板倉(cāng)-齋藤散度的非負(fù)矩陣分解引入到聚類(lèi)分析中,提高了聚類(lèi)標(biāo)簽的可讀性和聚類(lèi)結(jié)果的整體質(zhì)量;第二、將位置因素和詞性因素引入對(duì)傳統(tǒng)的向量空間模型進(jìn)行改進(jìn),進(jìn)一步提高了聚類(lèi)結(jié)果的質(zhì)量;第三、基于線性后綴數(shù)組排序算法:skew算法,提出了一種能消除無(wú)實(shí)際意義特征詞對(duì)特征抽取質(zhì)量干擾的改進(jìn)型skew后綴數(shù)組排序算法,減少了聚類(lèi)算法對(duì)中小規(guī)模文檔集進(jìn)行聚類(lèi)分析的處理時(shí)間;第四、基于Nutch,利用Rlingo實(shí)現(xiàn)了一個(gè)面向旅游的聚類(lèi)系統(tǒng),系統(tǒng)性能基本達(dá)到預(yù)期效果。 最后,本文設(shè)置了對(duì)照實(shí)驗(yàn),比較了Rlingo、Lingo、K-means和STC的綜合性能。實(shí)驗(yàn)表明:Rlingo聚類(lèi)算法對(duì)中小文檔集的聚類(lèi)結(jié)果明顯優(yōu)于其他三種聚類(lèi)算法,改進(jìn)的聚類(lèi)算法基本達(dá)到預(yù)期效果。
[Abstract]:In the vigorous development of the Internet today, network information is exponential growth. In the face of mass network information, how to obtain information in the most rapid and accurate way is perhaps the biggest demand of every Internet user. In this case, Google, Baidu, Yahoo and other search engines, opened the way for Internet users to access information. However, the traditional search engine is far from perfect, which displays the search results in the form of linear list, which makes it difficult for Internet users to get information quickly and accurately. Therefore, the researchers introduce text clustering into the process of analyzing the results returned by search engines in order to help users quickly find what they are looking for. This paper focuses on how to improve the clustering quality and the computational efficiency of the clustering algorithm. In this paper, the key technologies of Chinese clustering algorithm are studied from four aspects: non-negative matrix decomposition algorithm, vector space model, suffix array sort and Chinese word segmentation module. The algorithm is based on Lingo clustering algorithm. This paper presents a Chinese clustering algorithm Rlingo. for clustering analysis of small and medium-sized document sets. The main work of this paper is as follows: first, the nonnegative matrix decomposition based on the Bankura-Saito divergence is introduced into the clustering analysis for the first time, which improves the readability of the clustering tags and the overall quality of the clustering results; Secondly, the position factor and the part of speech factor are introduced into the traditional vector space model to improve the quality of the clustering results. Thirdly, based on the linear suffix array sorting algorithm: skew algorithm, an improved skew suffix array sorting algorithm is proposed, which can eliminate the quality interference of feature extraction without actual meaning. The processing time of clustering analysis for small and medium-sized document sets is reduced by clustering algorithm. Fourthly, a tourism-oriented clustering system based on Nutch, is implemented with Rlingo. Finally, a comparative experiment was conducted to compare the comprehensive performance of Rlingo,Lingo,K-means and STC. The experimental results show that the clustering results of Rlingo clustering algorithm for small and medium document sets are obviously better than the other three clustering algorithms, and the improved clustering algorithm basically achieves the expected results.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP391.3;TP311.13
[Abstract]:In the vigorous development of the Internet today, network information is exponential growth. In the face of mass network information, how to obtain information in the most rapid and accurate way is perhaps the biggest demand of every Internet user. In this case, Google, Baidu, Yahoo and other search engines, opened the way for Internet users to access information. However, the traditional search engine is far from perfect, which displays the search results in the form of linear list, which makes it difficult for Internet users to get information quickly and accurately. Therefore, the researchers introduce text clustering into the process of analyzing the results returned by search engines in order to help users quickly find what they are looking for. This paper focuses on how to improve the clustering quality and the computational efficiency of the clustering algorithm. In this paper, the key technologies of Chinese clustering algorithm are studied from four aspects: non-negative matrix decomposition algorithm, vector space model, suffix array sort and Chinese word segmentation module. The algorithm is based on Lingo clustering algorithm. This paper presents a Chinese clustering algorithm Rlingo. for clustering analysis of small and medium-sized document sets. The main work of this paper is as follows: first, the nonnegative matrix decomposition based on the Bankura-Saito divergence is introduced into the clustering analysis for the first time, which improves the readability of the clustering tags and the overall quality of the clustering results; Secondly, the position factor and the part of speech factor are introduced into the traditional vector space model to improve the quality of the clustering results. Thirdly, based on the linear suffix array sorting algorithm: skew algorithm, an improved skew suffix array sorting algorithm is proposed, which can eliminate the quality interference of feature extraction without actual meaning. The processing time of clustering analysis for small and medium-sized document sets is reduced by clustering algorithm. Fourthly, a tourism-oriented clustering system based on Nutch, is implemented with Rlingo. Finally, a comparative experiment was conducted to compare the comprehensive performance of Rlingo,Lingo,K-means and STC. The experimental results show that the clustering results of Rlingo clustering algorithm for small and medium document sets are obviously better than the other three clustering algorithms, and the improved clustering algorithm basically achieves the expected results.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP391.3;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲(chóng)研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年10期
2 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報(bào);2007年03期
3 魏群;趙驥;劉保相;;網(wǎng)頁(yè)模糊歸類(lèi)算法的應(yīng)用與實(shí)現(xiàn)[J];微計(jì)算機(jī)信息;2006年15期
相關(guān)博士學(xué)位論文 前1條
1 周
本文編號(hào):2325073
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2325073.html
最近更新
教材專(zhuān)著