天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

基于Nutch的聚類(lèi)搜索引擎的研究與實(shí)現(xiàn)

發(fā)布時(shí)間:2018-11-11 14:14
【摘要】:在互聯(lián)網(wǎng)蓬勃發(fā)展的今天,網(wǎng)絡(luò)信息呈指數(shù)式增長(zhǎng)。面對(duì)海量的網(wǎng)絡(luò)信息,如何以最快捷、準(zhǔn)確的方式獲取信息,,也許是每一個(gè)網(wǎng)民最大的需求。在這種情況下,谷歌、百度、雅虎等搜索引擎順勢(shì)而生,為網(wǎng)民獲取信息打開(kāi)了通路。但是,傳統(tǒng)的搜索引擎遠(yuǎn)非完美,其以線性列表的方式顯示搜索結(jié)果,給網(wǎng)民快速獲、準(zhǔn)確地取信息帶來(lái)了困難。因此,研究者們將文本聚類(lèi)引入到對(duì)搜索引擎返回結(jié)果進(jìn)行分析的過(guò)程中,以幫助用戶(hù)快速找到所求。 本文的研究工作主要圍繞如何提高聚類(lèi)質(zhì)量和聚類(lèi)算法計(jì)算效率展開(kāi)。具體做法是從非負(fù)矩陣分解算法、向量空間模型、后綴數(shù)組排序和中文分詞模塊四個(gè)方面著手,對(duì)中文聚類(lèi)算法的關(guān)鍵技術(shù)進(jìn)行深入的研究,并以Lingo聚類(lèi)算法為原型,研究提出了一種用于對(duì)中小規(guī)模文檔集進(jìn)行聚類(lèi)分析的中文聚類(lèi)算法Rlingo。 本文所做的主要工作是:第一、首次將基于板倉(cāng)-齋藤散度的非負(fù)矩陣分解引入到聚類(lèi)分析中,提高了聚類(lèi)標(biāo)簽的可讀性和聚類(lèi)結(jié)果的整體質(zhì)量;第二、將位置因素和詞性因素引入對(duì)傳統(tǒng)的向量空間模型進(jìn)行改進(jìn),進(jìn)一步提高了聚類(lèi)結(jié)果的質(zhì)量;第三、基于線性后綴數(shù)組排序算法:skew算法,提出了一種能消除無(wú)實(shí)際意義特征詞對(duì)特征抽取質(zhì)量干擾的改進(jìn)型skew后綴數(shù)組排序算法,減少了聚類(lèi)算法對(duì)中小規(guī)模文檔集進(jìn)行聚類(lèi)分析的處理時(shí)間;第四、基于Nutch,利用Rlingo實(shí)現(xiàn)了一個(gè)面向旅游的聚類(lèi)系統(tǒng),系統(tǒng)性能基本達(dá)到預(yù)期效果。 最后,本文設(shè)置了對(duì)照實(shí)驗(yàn),比較了Rlingo、Lingo、K-means和STC的綜合性能。實(shí)驗(yàn)表明:Rlingo聚類(lèi)算法對(duì)中小文檔集的聚類(lèi)結(jié)果明顯優(yōu)于其他三種聚類(lèi)算法,改進(jìn)的聚類(lèi)算法基本達(dá)到預(yù)期效果。
[Abstract]:In the vigorous development of the Internet today, network information is exponential growth. In the face of mass network information, how to obtain information in the most rapid and accurate way is perhaps the biggest demand of every Internet user. In this case, Google, Baidu, Yahoo and other search engines, opened the way for Internet users to access information. However, the traditional search engine is far from perfect, which displays the search results in the form of linear list, which makes it difficult for Internet users to get information quickly and accurately. Therefore, the researchers introduce text clustering into the process of analyzing the results returned by search engines in order to help users quickly find what they are looking for. This paper focuses on how to improve the clustering quality and the computational efficiency of the clustering algorithm. In this paper, the key technologies of Chinese clustering algorithm are studied from four aspects: non-negative matrix decomposition algorithm, vector space model, suffix array sort and Chinese word segmentation module. The algorithm is based on Lingo clustering algorithm. This paper presents a Chinese clustering algorithm Rlingo. for clustering analysis of small and medium-sized document sets. The main work of this paper is as follows: first, the nonnegative matrix decomposition based on the Bankura-Saito divergence is introduced into the clustering analysis for the first time, which improves the readability of the clustering tags and the overall quality of the clustering results; Secondly, the position factor and the part of speech factor are introduced into the traditional vector space model to improve the quality of the clustering results. Thirdly, based on the linear suffix array sorting algorithm: skew algorithm, an improved skew suffix array sorting algorithm is proposed, which can eliminate the quality interference of feature extraction without actual meaning. The processing time of clustering analysis for small and medium-sized document sets is reduced by clustering algorithm. Fourthly, a tourism-oriented clustering system based on Nutch, is implemented with Rlingo. Finally, a comparative experiment was conducted to compare the comprehensive performance of Rlingo,Lingo,K-means and STC. The experimental results show that the clustering results of Rlingo clustering algorithm for small and medium document sets are obviously better than the other three clustering algorithms, and the improved clustering algorithm basically achieves the expected results.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP391.3;TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前3條

1 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲(chóng)研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年10期

2 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報(bào);2007年03期

3 魏群;趙驥;劉保相;;網(wǎng)頁(yè)模糊歸類(lèi)算法的應(yīng)用與實(shí)現(xiàn)[J];微計(jì)算機(jī)信息;2006年15期

相關(guān)博士學(xué)位論文 前1條

1 周

本文編號(hào):2325073


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2325073.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶(hù)7cee2***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
老司机精品国产在线视频| 91亚洲精品亚洲国产| 成人精品网一区二区三区| 日本精品中文字幕在线视频| 免费福利午夜在线观看| 亚洲国产精品国自产拍社区| 高清不卡视频在线观看| 日本亚洲精品在线观看| 欧美亚洲综合另类色妞| 国产精品一区二区三区欧美 | 一级片二级片欧美日韩| 免费观看一区二区三区黄片| 色婷婷视频国产一区视频| 国产成人免费激情视频| 国产又粗又猛又长又大| 丝袜美女诱惑在线观看| 国产91色综合久久高清| 亚洲日本韩国一区二区三区| 中文字幕av诱惑一区二区| 办公室丝袜高跟秘书国产| 亚洲人午夜精品射精日韩| 国产福利一区二区三区四区| 色婷婷在线精品国自产拍| 日本熟妇五十一区二区三区 | 99久久国产精品免费| 午夜福利黄片免费观看| 国产农村妇女成人精品| 日韩一本不卡在线观看| 99久热只有精品视频免费看| 久久精品国产亚洲av麻豆| 91精品欧美综合在ⅹ| 国产美女精品人人做人人爽| 九九热这里只有免费精品| 欧美亚洲另类久久久精品 | 国产又粗又猛又长又大| 国产精品午夜福利在线观看| 久久国产亚洲精品赲碰热| 亚洲欧美黑人一区二区| 国产传媒高清视频在线| 亚洲精品一区二区三区免| 在线观看视频日韩精品 |