天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

基于Hadoop平臺(tái)的網(wǎng)頁(yè)聚類(lèi)方法研究

發(fā)布時(shí)間:2018-03-01 06:04

  本文關(guān)鍵詞: Normalized Cuts Multiclass譜聚類(lèi) 網(wǎng)頁(yè)聚類(lèi) Hadoop MapReduce 出處:《華南理工大學(xué)》2012年碩士論文 論文類(lèi)型:學(xué)位論文


【摘要】:網(wǎng)頁(yè)是互聯(lián)網(wǎng)中信息存在的主要形式,人們通過(guò)網(wǎng)頁(yè)發(fā)布和查詢(xún)信息。而隨著信息時(shí)代的日益變遷,網(wǎng)頁(yè)的數(shù)量呈現(xiàn)了爆炸式的增長(zhǎng)。在數(shù)以?xún)|計(jì)的網(wǎng)頁(yè)中,如何才能更加有效的挖掘知識(shí)?如何才能快速的辨別垃圾信息?如何才能更加從容地對(duì)數(shù)據(jù)歸類(lèi)?數(shù)據(jù)挖掘是處理這些問(wèn)題的有力工具,而網(wǎng)頁(yè)聚類(lèi)則是其中的一種手段。通過(guò)聚類(lèi),能夠無(wú)監(jiān)督或半監(jiān)督的對(duì)網(wǎng)頁(yè)進(jìn)行基于語(yǔ)義的劃分。 網(wǎng)頁(yè)聚類(lèi)的實(shí)際應(yīng)用很廣,它能夠應(yīng)用到很多實(shí)際問(wèn)題當(dāng)中。搜索引擎能夠通過(guò)網(wǎng)頁(yè)聚類(lèi),為用戶(hù)提供更多的相關(guān)信息。對(duì)搜索引擎結(jié)果進(jìn)行聚類(lèi),,能夠?yàn)橛脩?hù)提供搜索結(jié)果的導(dǎo)航,用戶(hù)能夠根據(jù)聚類(lèi)標(biāo)簽,直接定位到自己期望的內(nèi)容。網(wǎng)頁(yè)聚類(lèi)還能區(qū)分垃圾網(wǎng)頁(yè)等等。因此,網(wǎng)頁(yè)聚類(lèi)一直以來(lái)都是數(shù)據(jù)挖掘中的一個(gè)研究重點(diǎn),但是還有很多問(wèn)題值得我們繼續(xù)研究。 可以將網(wǎng)頁(yè)聚類(lèi)問(wèn)題劃分為多個(gè)子問(wèn)題,即網(wǎng)頁(yè)的去噪、內(nèi)容的提取、相似度的定義、降維、聚類(lèi)算法的應(yīng)用、類(lèi)別數(shù)目的確定、聚類(lèi)標(biāo)簽的生成等。對(duì)于上述的每個(gè)子問(wèn)題,都經(jīng)過(guò)了前人的研究,但仍然存在改進(jìn)的空間。本文針對(duì)網(wǎng)頁(yè)聚類(lèi)問(wèn)題中的聚類(lèi)算法的應(yīng)用進(jìn)行了研究,將Multiclass譜聚類(lèi)算法應(yīng)用到了網(wǎng)頁(yè)聚類(lèi)和網(wǎng)頁(yè)結(jié)果聚類(lèi)中。并實(shí)現(xiàn)了能對(duì)搜索結(jié)果聚類(lèi)的網(wǎng)頁(yè)搜索引擎,該搜索引擎系統(tǒng)中集成了多重聚類(lèi)方式,集成了Multiclass譜聚類(lèi)算法和Normalized Cuts算法等聚類(lèi)算法。 基于譜聚類(lèi)的網(wǎng)頁(yè)聚類(lèi)方法雖然能夠獲得良好的聚類(lèi)效果,但算法中使用了一個(gè)N*N維(其中N是聚類(lèi)對(duì)象的個(gè)數(shù))的矩陣來(lái)表示聚類(lèi)對(duì)象之間的相似關(guān)系。隨著聚類(lèi)對(duì)象數(shù)目的增多,該矩陣的大小增長(zhǎng)更快,導(dǎo)致內(nèi)存無(wú)法存儲(chǔ)該矩陣,從而使得譜聚類(lèi)方法失去可擴(kuò)展性。因此本文研究了增強(qiáng)譜聚類(lèi)的擴(kuò)展性的方法,提出了使用Hadoop平臺(tái)中的MapReduce機(jī)制擴(kuò)展Normalized Cuts算法的方法,并實(shí)現(xiàn)了基于Hadoop平臺(tái)的網(wǎng)頁(yè)聚類(lèi)方法,這種方法具有可擴(kuò)展性,能并行的執(zhí)行,從而解決了單臺(tái)機(jī)器不能將整個(gè)相似性矩陣存儲(chǔ)在內(nèi)存中的問(wèn)題。
[Abstract]:Web pages are the main forms of information in the Internet. People publish and query information through web pages. With the change of the information age, the number of web pages is increasing explosively. In hundreds of millions of web pages, How can we excavate knowledge more effectively? How to quickly identify spam? How can data be categorized more calmly? Data mining is a powerful tool to deal with these problems, and web page clustering is one of the means. Through clustering, pages can be partitioned based on semantics without supervision or semi-supervision. The practical application of web page clustering is very wide, it can be applied to many practical problems. Search engine can provide users with more relevant information through web page clustering. Can provide users with navigation of search results, users can directly locate their desired content based on clustering tags. Web clustering can also distinguish garbage pages and so on. Web page clustering has always been a research focus in data mining, but there are still many problems that we should continue to study. The problem of web page clustering can be divided into several sub-problems, namely, the denoising of web pages, the extraction of content, the definition of similarity, the reduction of dimension, the application of clustering algorithm, and the determination of the number of categories. For each of the above sub-problems, there is still room for improvement, but there is still room for improvement. The Multiclass spectral clustering algorithm is applied to the web page clustering and the web page result clustering, and a web search engine which can cluster the search results is implemented. Multiclass spectrum clustering algorithm and Normalized Cuts clustering algorithm are integrated. Although the web page clustering method based on spectral clustering can obtain good clustering effect, However, the algorithm uses a matrix of N dimension (where N is the number of cluster objects) to express the similarity between clustering objects. With the increase of the number of clustering objects, the size of the matrix grows faster, resulting in the memory can not store the matrix. Therefore, the method of enhancing the extensibility of spectral clustering is studied in this paper, and the method of extending Normalized Cuts algorithm using MapReduce mechanism in Hadoop platform is proposed, and the web page clustering method based on Hadoop platform is realized. This method is extensible and can be executed in parallel, which solves the problem that a single machine can not store the whole similarity matrix in memory.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前3條

1 丁月華,文貴華,郭煒強(qiáng);基于核向量空間模型的專(zhuān)利分類(lèi)[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年08期

2 黃文蓓;楊靜;顧君忠;;基于分塊的網(wǎng)頁(yè)正文信息提取算法研究[J];計(jì)算機(jī)應(yīng)用;2007年S1期

3 趙欣欣;索紅光;劉玉樹(shù);;基于標(biāo)記窗的網(wǎng)頁(yè)正文信息提取方法[J];計(jì)算機(jī)應(yīng)用研究;2007年03期



本文編號(hào):1550698

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1550698.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶(hù)cab15***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
欧美日韩国产另类一区二区| 狠色婷婷久久一区二区三区| 国产精品亚洲精品亚洲| 儿媳妇的诱惑中文字幕| 精品日韩中文字幕视频在线| 日韩美女偷拍视频久久| 麻豆欧美精品国产综合久久| 国产精品九九九一区二区| 欧美日韩一区二区午夜| 欧美丰满人妻少妇精品| 午夜福利大片亚洲一区| 精品香蕉一区二区在线| 老司机亚洲精品一区二区| 久久这里只有精品中文字幕| 亚洲男人的天堂色偷偷| 国产亚洲成av人在线观看| 国内自拍偷拍福利视频| 91人妻人人精品人人爽| 99香蕉精品视频国产版| 国产一级二级三级观看| 国产精品亚洲精品亚洲| 欧美国产日本高清在线| 欧美日韩综合在线第一页| 中文字幕不卡欧美在线| 亚洲综合一区二区三区在线| 中文字幕熟女人妻视频| 免费观看一区二区三区黄片| 黄片美女在线免费观看| 久久老熟女一区二区三区福利| 高清免费在线不卡视频| 中文字幕久久精品亚洲乱码| 中文久久乱码一区二区| 国产高清一区二区不卡| 亚洲色图欧美另类人妻| 欧美日本亚欧在线观看| 精品国产日韩一区三区| 一区二区三区精品人妻| 亚洲在线观看福利视频| 亚洲综合一区二区三区在线| 嫩呦国产一区二区三区av| 亚洲一区二区三区av高清|