中文網(wǎng)頁分類算法研究

發(fā)布時(shí)間：2018-01-30 23:12

本文關(guān)鍵詞： 中文網(wǎng)頁分類向量空間模型詞共現(xiàn)圖 KNN　出處：《江蘇科技大學(xué)》2013年碩士論文　論文類型：學(xué)位論文

【摘要】：隨著Internet及其相關(guān)技術(shù)的飛速發(fā)展，互聯(lián)網(wǎng)上出現(xiàn)了海量而龐雜的Web信息資源。如何從這些海量的非結(jié)構(gòu)化數(shù)據(jù)中提取和產(chǎn)生知識(shí)，找到人們感興趣的內(nèi)容，已經(jīng)成為當(dāng)前迫切需要解決的重要問題。中文網(wǎng)頁分類技術(shù)作為解決這一問題的關(guān)鍵技術(shù)之一，日益成為研究的熱點(diǎn)。其在搜索引擎、信息推送、信息過濾和自動(dòng)問答等領(lǐng)域得到了越來越廣泛的應(yīng)用。本文詳細(xì)介紹了中文網(wǎng)頁分類中的關(guān)鍵技術(shù)，包括網(wǎng)頁的預(yù)處理技術(shù)、特征提取技術(shù)和主流的網(wǎng)頁分類算法。討論了諸如TF-IDF、互信息、2統(tǒng)計(jì)量、信息增益和期望交叉熵等特征提取方法。詳細(xì)分析了最小距離算法、KNN算法、樸素貝葉斯算法和支持向量機(jī)算法等主流網(wǎng)頁分類算法的基本思想和主要的優(yōu)缺點(diǎn)。在網(wǎng)頁的特征提取算法中，傳統(tǒng)的VSM模型忽略了詞項(xiàng)之間具有相互依賴且語義相關(guān)的特點(diǎn)。詞共現(xiàn)圖方法可以較好的解決這一問題，，但是目前的主流詞共現(xiàn)圖方法大多對(duì)于特征詞項(xiàng)權(quán)重的計(jì)算機(jī)械簡單。而本文提出的一種改進(jìn)型的詞共現(xiàn)圖方法既考慮詞之間語義信息，又不忽視高頻詞對(duì)于主題表示的重要影響。實(shí)驗(yàn)證明，該方法實(shí)現(xiàn)簡單，準(zhǔn)確率較高。在網(wǎng)頁分類算法中，KNN算法有著非常廣泛的應(yīng)用。但KNN算法的一個(gè)顯著缺點(diǎn)是計(jì)算復(fù)雜度會(huì)隨著訓(xùn)練集規(guī)模的增加而線性增加，在訓(xùn)練集規(guī)模較大時(shí)，該算法時(shí)間消耗非常大。針對(duì)這一情況，本文提出了一種改進(jìn)型的KNN算法，主要的思想是通過改進(jìn)待分類文本的近鄰點(diǎn)的查找策略，從而提高KNN算法的運(yùn)行效率，降低其算法復(fù)雜度。在本文的最后，通過實(shí)驗(yàn)驗(yàn)證了KNN、NB和SVM算法的各自性能。并對(duì)本文提出的改進(jìn)型KNN算法給出了對(duì)比實(shí)驗(yàn)數(shù)據(jù)，證明了它的確擁有提高分類計(jì)算效率、降低算法復(fù)雜度的優(yōu)點(diǎn)。
[Abstract]:With the rapid development of Internet and its related technologies, massive and complex Web information resources appear on the Internet. How to extract and generate knowledge from these massive unstructured data. Finding the content that people are interested in has become an important problem that needs to be solved urgently. As one of the key technologies to solve this problem, Chinese web page classification technology has become a hot research topic day by day. The fields of information push, information filtering and automatic question and answer have been used more and more widely. This paper introduces the key technologies of Chinese web page classification in detail, including page preprocessing, feature extraction and mainstream web page classification algorithms, and discusses statistics such as TF-IDF and mutual information. The information gain and expected cross-entropy are extracted. The minimum distance algorithm and KNN algorithm are analyzed in detail. The basic idea and main advantages and disadvantages of the main web page classification algorithms such as naive Bayes algorithm and support vector machine algorithm. In the feature extraction algorithm of web pages, the traditional VSM model ignores the interdependent and semantically related features of word items. The word co-occurrence graph method can solve this problem better. However, the current mainstream word co-occurrence graph method is mostly simple to calculate the weight of feature words. A modified word co-occurrence graph method proposed in this paper not only takes into account the semantic information between words. The experimental results show that the method is simple and accurate. KNN algorithm is widely used in web page classification algorithm, but one of the significant disadvantages of KNN algorithm is that the computational complexity increases linearly with the increase of training set size. When the training set is large, the time consumption of the algorithm is very large. In view of this situation, this paper proposes an improved KNN algorithm, the main idea is to improve the nearest neighbor search strategy of the text to be classified. In order to improve the efficiency of the KNN algorithm and reduce the complexity of the algorithm. At the end of this paper, the performance of KNNNNNNNNNNB and SVM algorithm is verified by experiments, and the experimental data of the improved KNN algorithm proposed in this paper are compared. It is proved that it does have the advantages of improving the efficiency of classification computation and reducing the complexity of the algorithm.
【學(xué)位授予單位】：江蘇科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP393.092;TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 孫茂松,鄒嘉彥;漢語自動(dòng)分詞研究評(píng)述[J];當(dāng)代語言學(xué);2001年01期

2 丁世飛;齊丙娟;譚紅艷;;支持向量機(jī)理論與算法研究綜述[J];電子科技大學(xué)學(xué)報(bào);2011年01期

3 李蓉 ,葉世偉 ,史忠植;SVM-KNN分類器——一種提高SVM分類精度的新方法[J];電子學(xué)報(bào);2002年05期

4 盧葦;彭雅;;幾種常用文本分類算法性能比較與分析[J];湖南大學(xué)學(xué)報(bào)(自然科學(xué)版);2007年06期

5 王本年,高陽,陳世福,謝俊元;Web智能研究現(xiàn)狀與發(fā)展趨勢[J];計(jì)算機(jī)研究與發(fā)展;2005年05期

6 王國勝,鐘義信;支持向量機(jī)的理論基礎(chǔ)——統(tǒng)計(jì)學(xué)習(xí)理論[J];計(jì)算機(jī)工程與應(yīng)用;2001年19期

7 張煥炯,王國勝,鐘義信;基于漢明距離的文本相似度計(jì)算[J];計(jì)算機(jī)工程與應(yīng)用;2001年19期

8 常鵬;馬輝;;高效的短文本主題詞抽取方法[J];計(jì)算機(jī)工程與應(yīng)用;2011年20期

9 丘海瀾;文翰;肖南峰;;基于訪問日志的網(wǎng)頁內(nèi)容監(jiān)控挖掘系統(tǒng)[J];計(jì)算機(jī)工程;2011年04期

10 劉應(yīng)東;�；菝�;;基于k-最近鄰圖的小樣本KNN分類算法[J];計(jì)算機(jī)工程;2011年09期

相關(guān)碩士學(xué)位論文前7條

1 梁曄平;中文文本自動(dòng)分類相關(guān)算法的研究與實(shí)現(xiàn)[D];華南理工大學(xué);2010年

2 張匯;基于貝葉斯的網(wǎng)頁文本分類算法[D];華中科技大學(xué);2004年

3 朱望斌;自動(dòng)文本分類算法研究[D];湖南大學(xué);2005年

4 謝光華;中文網(wǎng)頁自動(dòng)分類的研究及其應(yīng)用[D];大連理工大學(xué);2007年

5 萬樂;網(wǎng)頁的預(yù)處理技術(shù)[D];吉林大學(xué);2008年

6 劉慧;基于KNN的中文文本分類算法研究[D];西南交通大學(xué);2010年

7 白凡;改進(jìn)的K近鄰算法在網(wǎng)頁文本分類中的應(yīng)用[D];安徽大學(xué);2010年

本文編號(hào)：1477494

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1477494.html

上一篇：云計(jì)算安全風(fēng)險(xiǎn)與對(duì)策分析
下一篇：適于子采樣望遠(yuǎn)鏡運(yùn)動(dòng)估值的搜索窗存儲(chǔ)結(jié)構(gòu)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

中文網(wǎng)頁分類算法研究