Web結(jié)構(gòu)挖掘與高維數(shù)據(jù)挖掘研究

發(fā)布時間：2018-11-12 18:56

【摘要】：數(shù)據(jù)挖掘是人工智能、機(jī)器學(xué)習(xí)、模式識別和信息決策等領(lǐng)域的前沿研究方向之一。隨著Web的迅速發(fā)展以及數(shù)據(jù)采樣能力的提升,Web挖掘和高維數(shù)據(jù)挖掘逐漸成為數(shù)據(jù)挖掘的兩個重要任務(wù)。 Web是現(xiàn)代社會人們傳播和獲取信息最重要的一個平臺。Web中包含的網(wǎng)頁數(shù)量已經(jīng)達(dá)到十億的規(guī)模,并且仍在與日劇增,Web包含的信息量更是呈現(xiàn)爆炸式的增長。由于Web中的信息是非結(jié)構(gòu)化和自組織的,傳統(tǒng)的信息檢索技術(shù)很難在實際需求中得到有效的應(yīng)用。除了Web頁面以外,Web中還有大量的超鏈接。超鏈接蘊(yùn)含了對網(wǎng)頁的重要性評價信息,因此Web結(jié)構(gòu)挖掘(即Web鏈接分析)成為提高Web信息檢索質(zhì)量最重要的途徑。聚類分析是數(shù)據(jù)挖掘的基本方法之一,在許多領(lǐng)域都有著廣泛的應(yīng)用。近年來很多聚類問題中的數(shù)據(jù)普遍呈現(xiàn)出高維特征。而已有的經(jīng)典聚類方法都是基于低維數(shù)據(jù)空間的假設(shè),不能對高維數(shù)據(jù)進(jìn)行有效聚類。高維數(shù)據(jù)聚類問題成為目前聚類分析研究的重點。流形聚類是近年來發(fā)展起來并被廣泛研究的一種高維數(shù)據(jù)聚類分析方法。本文針對數(shù)據(jù)挖掘中的Web結(jié)構(gòu)挖掘和高維數(shù)據(jù)聚類兩個典型問題,研究分析了基于鏈接分析的搜索引擎頁面排序算法、Web社區(qū)發(fā)現(xiàn)算法、流形聚類中的有效相異度度量以及針對大規(guī)模高維數(shù)據(jù)流形聚類的低秩逼近問題,主要貢獻(xiàn)包括： (1)分析了基于鏈接分析的頁面排序算法PageRank算法和HITS算法的特點,提出了基于多級衰減模型的PageRank算法框架,根據(jù)衰減模型來分配頁面間的直接鏈接和間接鏈接的權(quán)值,提高了查詢的精確度；提出了基于頁面相似度和鏈接流行度的HITS改進(jìn)算法,根據(jù)頁面間相對于查詢主題的相似度以及頁面間鏈接的流行度來分配鏈接的權(quán)值,有效緩解了HITS算法的主題漂移問題。 (2)深入研究了基于最大流的社區(qū)發(fā)現(xiàn)技術(shù)中邊容量與社區(qū)的規(guī)模之間的關(guān)系,從社區(qū)發(fā)現(xiàn)角度分析了鏈接結(jié)構(gòu)的特征,提出利用網(wǎng)頁的入度和出度的概率分布來分配邊容量的方法,減少了噪音頁面被提取出來的可能性,提高了網(wǎng)絡(luò)社區(qū)的質(zhì)量。 (3)提出了基于鄰域路徑的有效相異度,強(qiáng)化了通過流形學(xué)習(xí)算法獲得的數(shù)據(jù)低維表示的類別特征,改善了通過流形學(xué)習(xí)進(jìn)行聚類的效果。分析了采用Nystrom擴(kuò)展方法逼近大規(guī)模核矩陣特征向量的近似程度與抽樣點之間的關(guān)系,并基于此分析提出了增量抽樣策略,提高了利用Nystrom擴(kuò)展方法進(jìn)行加速流形聚類時的聚類質(zhì)量。
[Abstract]:Data mining is one of the leading research fields in artificial intelligence, machine learning, pattern recognition and information decision-making. With the rapid development of Web and the improvement of data sampling ability, Web mining and high-dimensional data mining have become two important tasks of data mining. Web is the most important platform for people to spread and obtain information in modern society. The number of web pages contained in Web has reached one billion, and it is still increasing rapidly, and the amount of information contained in Web is increasing explosively. Because the information in Web is unstructured and self-organized, the traditional information retrieval technology is difficult to be effectively applied in the actual requirements. In addition to Web pages, there are plenty of hyperlinks in Web. Hyperlinks contain the importance evaluation information of web pages, so Web structure mining (I. E. Web link analysis) is the most important way to improve the quality of Web information retrieval. Clustering analysis is one of the basic methods of data mining and has been widely used in many fields. In recent years, many data in clustering problems generally show high dimensional features. However, the existing classical clustering methods are based on the assumption of low dimensional data space, and can not effectively cluster high-dimensional data. High-dimensional data clustering problem has become the focus of cluster analysis. Manifold clustering is a high dimensional data clustering method developed in recent years and widely studied. Aiming at the two typical problems of Web structure mining and high dimensional data clustering in data mining, this paper studies and analyzes the search engine page sorting algorithm based on link analysis and the Web community discovery algorithm. The effective dissimilarity measure in manifold clustering and the low rank approximation for large-scale high-dimensional data flow clustering are discussed in this paper. The main contributions are as follows: (1) the characteristics of PageRank and HITS algorithms based on link analysis are analyzed. The PageRank algorithm framework based on multilevel attenuation model is proposed. According to the attenuation model, the weights of direct and indirect links between pages are allocated, which improves the accuracy of query. An improved HITS algorithm based on page similarity and link popularity is proposed. The weights of links are assigned according to the similarity between pages relative to query topics and the popularity of links between pages. The problem of topic drift in HITS algorithm is effectively alleviated. (2) the relationship between the side capacity and the community size in the community discovery technology based on the maximum flow is deeply studied, and the characteristics of the link structure are analyzed from the perspective of community discovery. This paper proposes a method to allocate the edge capacity by using the probability distribution of the entry and output of the web pages, which reduces the possibility of the noise pages being extracted and improves the quality of the network community. (3) the effective dissimilarity degree based on the neighborhood path is proposed, which strengthens the class feature of the low dimensional representation of the data obtained by the manifold learning algorithm, and improves the clustering effect through the manifold learning. The relationship between the approximation degree of the eigenvector of the large scale kernel matrix and the sampling points by using the Nystrom extension method is analyzed. Based on this analysis, an incremental sampling strategy is proposed. The clustering quality of accelerating manifold clustering using Nystrom extension method is improved.
【學(xué)位授予單位】：大連理工大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2012
【分類號】：TP311.13

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 寧彬;;Web數(shù)據(jù)挖掘綜述[J];華南金融電腦;2006年02期

2 梅薇;;Web數(shù)據(jù)挖掘初探[J];中國集體經(jīng)濟(jì);2008年06期

3 張禹;;Web結(jié)構(gòu)挖掘算法的比較改進(jìn)研究[J];硅谷;2009年19期

4 李方敏;CGI的安全編程[J];計算機(jī)工程與應(yīng)用;1999年06期

5 宋如順,姜乃松;基于Web的遠(yuǎn)程考試系統(tǒng)設(shè)計與實現(xiàn)[J];計算機(jī)工程;1999年06期

6 王紅霞,姚家亮;利用ASP構(gòu)建新型信息系統(tǒng)的方法與實現(xiàn)[J];計算機(jī)應(yīng)用;1999年09期

7 鄧勁生,張銀福;面向?qū)ο蟮亩嗝襟w信息WEB發(fā)布[J];計算機(jī)應(yīng)用研究;1999年09期

8 刁興春,李赤紅;Intranet環(huán)境下事務(wù)處理的理論研究和實現(xiàn)[J];小型微型計算機(jī)系統(tǒng);1999年06期

9 高昆;基于ASP的WEB站點開發(fā)技術(shù)分析[J];北華大學(xué)學(xué)報(社會科學(xué)版);1999年05期

10 王清心,胡建華;經(jīng)貿(mào)數(shù)據(jù)庫的WEB集成發(fā)布[J];昆明理工大學(xué)學(xué)報;1999年02期

相關(guān)會議論文前10條

1 徐仁干;吳共慶;李海光;胡學(xué)鋼;吳信東;;基于Web的頻繁分子結(jié)構(gòu)挖掘系統(tǒng)[A];全國第21屆計算機(jī)技術(shù)與應(yīng)用學(xué)術(shù)會議（CACIS·2010）暨全國第2屆安全關(guān)鍵技術(shù)與應(yīng)用學(xué)術(shù)會議論文集[C];2010年

2 石晶;龔震宇;裘杭萍;;基于Web挖掘的個性化服務(wù)技術(shù)[A];第十九屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報告篇）[C];2002年

3 李利波;劉明利;;一種改進(jìn)的無回溯反向Web服務(wù)動態(tài)組合方法[A];2011年全國通信安全學(xué)術(shù)會議論文集[C];2011年

4 游爭光;劉建勛;唐明董;;分布式Web服務(wù)測試系統(tǒng)的設(shè)計與實現(xiàn)[A];CCF NCSC 2011——第二屆中國計算機(jī)學(xué)會服務(wù)計算學(xué)術(shù)會議論文集[C];2011年

5 殷華蓓;李通;唐常杰;張?zhí)鞈c;左志松;;從Web文件中挖掘個性化導(dǎo)航知識[A];第十七屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集（研究報告篇）[C];2000年

6 ;基于廣義對話的Web用戶的聚類(英文)[A];第十七屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集（研究報告篇）[C];2000年

7 鄧長壽;郭景峰;楊焱林;鄧安遠(yuǎn);;下一代Web搜索引擎初探[A];第十八屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集（研究報告篇）[C];2001年

8 ;WTCA:A Web Text Clustering Algorithm Based on DFSSM[A];第二十七屆中國控制會議論文集[C];2008年

9 胡建強(qiáng);周斌;尹剛;鄒鵬;;基于角色的Web服務(wù)訪問控制技術(shù)研究[A];第二十屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報告篇）[C];2003年

10 黃建波;丁揚;方芳;;基于代理服務(wù)器的Web加速的實現(xiàn)[A];2010通信理論與技術(shù)新發(fā)展——第十五屆全國青年通信學(xué)術(shù)會議論文集（上冊）[C];2010年

相關(guān)重要報紙文章前10條

1 趙曉濤;Web安全服務(wù)為王[N];網(wǎng)絡(luò)世界;2008年

2 本報記者趙曉濤;Web安全：歷史的命題[N];網(wǎng)絡(luò)世界;2008年

3 彭敏;企業(yè)級Web2.0迎來應(yīng)用高潮[N];電腦商報;2009年

4 本報記者毛江華;安啟華聯(lián)手賽門鐵克掘金Web安全[N];計算機(jī)世界;2009年

5 閆冰;“推”出Web交付新天地[N];網(wǎng)絡(luò)世界;2009年

6 趙曉濤;中國成全球Web安全新看點[N];網(wǎng)絡(luò)世界;2009年

7 邊歆;動態(tài)阻斷Web2.0威脅[N];網(wǎng)絡(luò)世界;2009年

8 泰樂公司首席技術(shù)官兼執(zhí)行副總裁Vikram Saksena;學(xué)習(xí)Web 3.0 做聰明的“管道工”[N];通信產(chǎn)業(yè)報;2009年

9 ;Web2.0工具使用須謹(jǐn)慎[N];網(wǎng)絡(luò)世界;2009年

10 Anchiva中國區(qū)總經(jīng)理李松;Web安全選型三個標(biāo)準(zhǔn)[N];網(wǎng)絡(luò)世界;2008年

相關(guān)博士學(xué)位論文前10條

1 于紅;Web結(jié)構(gòu)挖掘與高維數(shù)據(jù)挖掘研究[D];大連理工大學(xué);2012年

2 張建武;面向Web應(yīng)用的安全評測技術(shù)研究[D];北京郵電大學(xué);2012年

3 李常寶;基于索引的web服務(wù)發(fā)現(xiàn)研究[D];北京郵電大學(xué);2011年

4 魏登萍;語義Web服務(wù)發(fā)現(xiàn)中匹配策略的研究與實現(xiàn)[D];國防科學(xué)技術(shù)大學(xué);2011年

5 黃雪娟;語義Web服務(wù)及其合成方法的研究[D];武漢大學(xué);2009年

6 朱俊武;基于本體的Web服務(wù)語義支撐技術(shù)研究[D];南京航空航天大學(xué);2008年

7 許笑;分布式Web信息采集關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2011年

8 楊卉;Web文本觀點挖掘及隱含情感傾向的研究[D];吉林大學(xué);2011年

9 王秀峰;Web導(dǎo)航中用戶認(rèn)知特征及行為研究[D];南京大學(xué);2013年

10 馬建斌;中文Web信息作者同一認(rèn)定技術(shù)研究[D];河北農(nóng)業(yè)大學(xué);2010年

相關(guān)碩士學(xué)位論文前10條

1 劉嘉;Web結(jié)構(gòu)挖掘研究[D];西安電子科技大學(xué);2009年

2 唐黎;Deep Web頁面結(jié)構(gòu)分析與核心內(nèi)容提取研究[D];重慶大學(xué);2011年

3 吳新勇;基于需求群組的Web服務(wù)調(diào)度模型研究[D];上海交通大學(xué);2011年

4 李曉明;Web點擊流數(shù)據(jù)的聚類技術(shù)研究[D];東北大學(xué);2009年

5 徐衛(wèi);Web新聞熱點發(fā)現(xiàn)系統(tǒng)的設(shè)計與實現(xiàn)[D];華中科技大學(xué);2011年

6 李遠(yuǎn)方;基于云計算的Web結(jié)構(gòu)挖掘算法研究[D];云南大學(xué);2011年

7 姜本臣;基于嵌入式Web服務(wù)器應(yīng)用技術(shù)的研究[D];沈陽工業(yè)大學(xué);2012年

8 胡峰;Web數(shù)據(jù)挖掘及其在網(wǎng)絡(luò)新聞文本數(shù)據(jù)中的應(yīng)用[D];電子科技大學(xué);2010年

9 李瑩;基于最大流與頁面相似度值的Web結(jié)構(gòu)挖掘研究[D];陜西師范大學(xué);2011年

10 歐偉強(qiáng);Web信息挖掘的研究及應(yīng)用[D];電子科技大學(xué);2010年

，

本文編號：2327928

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2327928.html

上一篇：綜合集成研討廳的幾個示例
下一篇：論搜索引擎網(wǎng)絡(luò)服務(wù)提供商侵權(quán)責(zé)任的承擔(dān)——對現(xiàn)行主流觀點的質(zhì)疑

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web結(jié)構(gòu)挖掘與高維數(shù)據(jù)挖掘研究