基于混合蛙跳算法的Web文本聚類研究

發(fā)布時(shí)間：2018-05-07 11:54

本文選題：Web文本聚類 + 混合蛙跳算法��；參考：《江南大學(xué)》2013年碩士論文

【摘要】：隨著互聯(lián)網(wǎng)技術(shù)的迅速普及和不斷發(fā)展，網(wǎng)頁上的文本信息在爆炸性的增長。如何對互聯(lián)網(wǎng)上的信息進(jìn)行有效的挖掘成為計(jì)算機(jī)科學(xué)領(lǐng)域所面臨的一個(gè)巨大挑戰(zhàn)。人們急需從大量的Web資源中快速、準(zhǔn)確、有效地獲取感興趣的知識(shí)。文本聚類技術(shù)的出現(xiàn)為海量文本信息的分類管理及可視化研究提供了一條有效的途徑。文本聚類作為信息過濾、信息檢索、搜索引擎、文本數(shù)據(jù)庫、數(shù)字化圖書館等領(lǐng)域的技術(shù)基礎(chǔ)，獲得了廣泛的應(yīng)用和發(fā)展。由于Web文本數(shù)據(jù)的海量、高維、動(dòng)態(tài)以及不可預(yù)測性，基于Web的聚類研究已逐漸成為了新的熱點(diǎn)。論文把重點(diǎn)放在Web文本聚類算法的研究上，K-means(K均值)和FCM(模糊C均值)是聚類中基于劃分的算法，由于其簡單、快速和有效，被廣泛應(yīng)用于Web文本聚類中，但在應(yīng)用過程中這些算法常常會(huì)在求解過程中陷入局部極小值，而且對初始值敏感。論文研究混合蛙跳算法在Web文本聚類中的應(yīng)用，通過將混合蛙跳算法分別與K-means和FCM相結(jié)合，一定程度上解決了這兩種聚類算法易陷入局部極小值和對初始值敏感的問題，提高了這兩種算法的收斂精度。論文首先對文本聚類技術(shù)的概念、特點(diǎn)和應(yīng)用領(lǐng)域進(jìn)行了介紹，對幾種經(jīng)典的聚類方法的實(shí)現(xiàn)方式進(jìn)行了重點(diǎn)的描述，并分析了它們的優(yōu)勢和不足之處。其次，詳細(xì)的介紹了混合蛙跳算法，針對傳統(tǒng)混合蛙跳算法的不足，提出一種改進(jìn)的混合蛙跳算法，，它通過混沌搜索優(yōu)化初始解，變異操作生成新個(gè)體，并設(shè)計(jì)了一種新的搜索策略，有效的提高了算法尋優(yōu)能力。最后，將改進(jìn)的混合蛙跳算法分別與K-means和FCM相結(jié)合。在基于混合蛙跳的K-means算法中，根據(jù)青蛙群體的適應(yīng)度方差來確定K-means算法的操作時(shí)機(jī)，抑制早熟收斂，用UCI數(shù)據(jù)集和隨機(jī)產(chǎn)生的數(shù)據(jù)來驗(yàn)證其有效性。在基于混合蛙跳的FCM算法中，使用混合蛙跳算法的優(yōu)化過程代替FCM的基于梯度下降的迭代過程，提高了算法全局尋優(yōu)能力，通過實(shí)際語料庫的測試結(jié)果比較，改進(jìn)的算法提高了聚類精度，在全局尋優(yōu)能力方面具有優(yōu)勢。
[Abstract]:With the rapid popularization and development of Internet technology, text information on web pages is increasing explosively. How to effectively mine the information on the Internet has become a great challenge in the field of computer science. There is an urgent need to quickly, accurately and effectively acquire interesting knowledge from a large number of Web resources. The emergence of text clustering technology provides an effective way for the classification management and visualization of massive text information. Text clustering, as the technical foundation of information filtering, information retrieval, search engine, text database, digital library and so on, has been widely used and developed. Because of the huge volume, high dimension, dynamic and unpredictability of Web text data, clustering based on Web has gradually become a new hot spot. In this paper, we focus on the research of Web text clustering algorithm (K-Means K mean) and FCM (Fuzzy C mean) are partition-based algorithms in clustering. Because of their simplicity, fast and efficiency, they are widely used in Web text clustering. However, in the process of application, these algorithms often fall into local minima and are sensitive to initial values. In this paper, the application of hybrid leapfrog algorithm in Web text clustering is studied. By combining the hybrid leapfrog algorithm with K-means and FCM, the problem that these two clustering algorithms are prone to fall into local minima and are sensitive to initial value is solved to a certain extent. The convergence accuracy of these two algorithms is improved. Firstly, the paper introduces the concept, characteristics and application of text clustering technology, describes the implementation of several classical clustering methods, and analyzes their advantages and disadvantages. Secondly, the hybrid leapfrog algorithm is introduced in detail. In view of the shortcomings of the traditional hybrid leapfrog algorithm, an improved hybrid leapfrog algorithm is proposed, which optimizes the initial solution by chaotic search and generates a new individual by mutation operation. A new search strategy is designed to effectively improve the ability of algorithm optimization. Finally, the improved hybrid leapfrog algorithm is combined with K-means and FCM, respectively. In the K-means algorithm based on mixed leapfrog, the timing of K-means algorithm is determined according to the variance of frog population fitness, and the precocious convergence is restrained. The validity of K-means algorithm is verified by UCI data set and randomly generated data. In the FCM algorithm based on hybrid leapfrog, the optimization process of hybrid leapfrog algorithm is used to replace the gradient descent iterative process of FCM, which improves the global optimization ability of the algorithm. The test results of the actual corpus are compared. The improved algorithm improves the clustering accuracy and has the advantage in global optimization.
【學(xué)位授予單位】：江南大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 陳云飛,劉玉樹,錢越英,趙基海;一種基于密度的啟發(fā)性群體智能聚類算法[J];北京理工大學(xué)學(xué)報(bào);2005年01期

2 朱靖波,姚天順;中文信息自動(dòng)抽取[J];東北大學(xué)學(xué)報(bào);1998年01期

3 薛麗萍;尹俊勛;周家銳;紀(jì)震;;混合粒子對優(yōu)化算法在說話人識(shí)別中的應(yīng)用[J];電子與信息學(xué)報(bào);2009年06期

4 王輝;錢鋒;;群體智能優(yōu)化算法[J];化工自動(dòng)化及儀表;2007年05期

5 曹曉辛,李檸,黃道;基于蟻群聚類算法的模糊神經(jīng)網(wǎng)絡(luò)[J];華東理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年02期

6 吳斌,傅偉鵬,鄭毅,劉少輝,史忠植;一種基于群體智能的Web文檔聚類算法[J];計(jì)算機(jī)研究與發(fā)展;2002年11期

7 高知新;李鐵克;蘇志雄;;Memetic算法在板坯排序中的應(yīng)用[J];計(jì)算機(jī)工程與應(yīng)用;2009年19期

8 孟慶瑩;王聯(lián)國;;基于鄰域正交交叉算子的混合蛙跳算法[J];計(jì)算機(jī)工程與應(yīng)用;2011年36期

9 沈達(dá)陽;孫茂松;;萬維網(wǎng)知識(shí)挖掘方法的研究[J];計(jì)算機(jī)科學(xué);2000年02期

10 王敞;陳增強(qiáng);袁著祉;;基于遺傳算法的K均值聚類分析[J];計(jì)算機(jī)科學(xué);2003年02期

相關(guān)碩士學(xué)位論文前2條

1 曲建華;Web上的信息過濾問題研究[D];山東師范大學(xué);2003年

2 趙鵬軍;優(yōu)化問題的幾種智能算法[D];西安電子科技大學(xué);2009年

本文編號(hào)：1856817

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1856817.html

上一篇：基于領(lǐng)域知識(shí)的自動(dòng)答題方法研究
下一篇：SEEKER:基于關(guān)鍵詞的關(guān)系數(shù)據(jù)庫信息檢索

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于混合蛙跳算法的Web文本聚類研究