基于Hadoop平臺的網(wǎng)頁聚類方法研究

發(fā)布時間：2018-03-01 06:04

本文關(guān)鍵詞： Normalized Cuts Multiclass譜聚類網(wǎng)頁聚類 Hadoop MapReduce　出處：《華南理工大學(xué)》2012年碩士論文　論文類型：學(xué)位論文

【摘要】：網(wǎng)頁是互聯(lián)網(wǎng)中信息存在的主要形式，人們通過網(wǎng)頁發(fā)布和查詢信息。而隨著信息時代的日益變遷，網(wǎng)頁的數(shù)量呈現(xiàn)了爆炸式的增長。在數(shù)以億計的網(wǎng)頁中，如何才能更加有效的挖掘知識？如何才能快速的辨別垃圾信息？如何才能更加從容地對數(shù)據(jù)歸類？數(shù)據(jù)挖掘是處理這些問題的有力工具，而網(wǎng)頁聚類則是其中的一種手段。通過聚類，能夠無監(jiān)督或半監(jiān)督的對網(wǎng)頁進(jìn)行基于語義的劃分。網(wǎng)頁聚類的實際應(yīng)用很廣，它能夠應(yīng)用到很多實際問題當(dāng)中。搜索引擎能夠通過網(wǎng)頁聚類，為用戶提供更多的相關(guān)信息。對搜索引擎結(jié)果進(jìn)行聚類，，能夠為用戶提供搜索結(jié)果的導(dǎo)航，用戶能夠根據(jù)聚類標(biāo)簽，直接定位到自己期望的內(nèi)容。網(wǎng)頁聚類還能區(qū)分垃圾網(wǎng)頁等等。因此，網(wǎng)頁聚類一直以來都是數(shù)據(jù)挖掘中的一個研究重點，但是還有很多問題值得我們繼續(xù)研究。可以將網(wǎng)頁聚類問題劃分為多個子問題，即網(wǎng)頁的去噪、內(nèi)容的提取、相似度的定義、降維、聚類算法的應(yīng)用、類別數(shù)目的確定、聚類標(biāo)簽的生成等。對于上述的每個子問題，都經(jīng)過了前人的研究，但仍然存在改進(jìn)的空間。本文針對網(wǎng)頁聚類問題中的聚類算法的應(yīng)用進(jìn)行了研究，將Multiclass譜聚類算法應(yīng)用到了網(wǎng)頁聚類和網(wǎng)頁結(jié)果聚類中。并實現(xiàn)了能對搜索結(jié)果聚類的網(wǎng)頁搜索引擎，該搜索引擎系統(tǒng)中集成了多重聚類方式，集成了Multiclass譜聚類算法和Normalized Cuts算法等聚類算法。基于譜聚類的網(wǎng)頁聚類方法雖然能夠獲得良好的聚類效果，但算法中使用了一個N*N維（其中N是聚類對象的個數(shù)）的矩陣來表示聚類對象之間的相似關(guān)系。隨著聚類對象數(shù)目的增多，該矩陣的大小增長更快，導(dǎo)致內(nèi)存無法存儲該矩陣，從而使得譜聚類方法失去可擴(kuò)展性。因此本文研究了增強(qiáng)譜聚類的擴(kuò)展性的方法，提出了使用Hadoop平臺中的MapReduce機(jī)制擴(kuò)展Normalized Cuts算法的方法，并實現(xiàn)了基于Hadoop平臺的網(wǎng)頁聚類方法，這種方法具有可擴(kuò)展性，能并行的執(zhí)行，從而解決了單臺機(jī)器不能將整個相似性矩陣存儲在內(nèi)存中的問題。
[Abstract]:Web pages are the main forms of information in the Internet. People publish and query information through web pages. With the change of the information age, the number of web pages is increasing explosively. In hundreds of millions of web pages, How can we excavate knowledge more effectively? How to quickly identify spam? How can data be categorized more calmly? Data mining is a powerful tool to deal with these problems, and web page clustering is one of the means. Through clustering, pages can be partitioned based on semantics without supervision or semi-supervision. The practical application of web page clustering is very wide, it can be applied to many practical problems. Search engine can provide users with more relevant information through web page clustering. Can provide users with navigation of search results, users can directly locate their desired content based on clustering tags. Web clustering can also distinguish garbage pages and so on. Web page clustering has always been a research focus in data mining, but there are still many problems that we should continue to study. The problem of web page clustering can be divided into several sub-problems, namely, the denoising of web pages, the extraction of content, the definition of similarity, the reduction of dimension, the application of clustering algorithm, and the determination of the number of categories. For each of the above sub-problems, there is still room for improvement, but there is still room for improvement. The Multiclass spectral clustering algorithm is applied to the web page clustering and the web page result clustering, and a web search engine which can cluster the search results is implemented. Multiclass spectrum clustering algorithm and Normalized Cuts clustering algorithm are integrated. Although the web page clustering method based on spectral clustering can obtain good clustering effect, However, the algorithm uses a matrix of N dimension (where N is the number of cluster objects) to express the similarity between clustering objects. With the increase of the number of clustering objects, the size of the matrix grows faster, resulting in the memory can not store the matrix. Therefore, the method of enhancing the extensibility of spectral clustering is studied in this paper, and the method of extending Normalized Cuts algorithm using MapReduce mechanism in Hadoop platform is proposed, and the web page clustering method based on Hadoop platform is realized. This method is extensible and can be executed in parallel, which solves the problem that a single machine can not store the whole similarity matrix in memory.
【學(xué)位授予單位】：華南理工大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文前3條

1 丁月華,文貴華,郭煒強(qiáng);基于核向量空間模型的專利分類[J];華南理工大學(xué)學(xué)報(自然科學(xué)版);2005年08期

2 黃文蓓;楊靜;顧君忠;;基于分塊的網(wǎng)頁正文信息提取算法研究[J];計算機(jī)應(yīng)用;2007年S1期

3 趙欣欣;索紅光;劉玉樹;;基于標(biāo)記窗的網(wǎng)頁正文信息提取方法[J];計算機(jī)應(yīng)用研究;2007年03期

本文編號：1550698

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1550698.html

上一篇：圖像檢索中的標(biāo)注與排序方法研究
下一篇：關(guān)系數(shù)據(jù)庫關(guān)鍵詞的糾錯性查詢及優(yōu)化研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop平臺的網(wǎng)頁聚類方法研究