基于KNN及相關(guān)鏈接的中文網(wǎng)頁分類研究
發(fā)布時(shí)間:2018-03-07 00:20
本文選題:中文網(wǎng)頁分類 切入點(diǎn):網(wǎng)頁提取 出處:《哈爾濱工程大學(xué)》2008年碩士論文 論文類型:學(xué)位論文
【摘要】: 隨著Internet的飛速發(fā)展,網(wǎng)上信息正在呈指數(shù)級(jí)增長(zhǎng)。面對(duì)雜亂的網(wǎng)頁信息資源,人們需要對(duì)海量的網(wǎng)頁信息進(jìn)行分類整理,從而可以快速檢索到期望的目標(biāo)及其關(guān)聯(lián)信息。網(wǎng)頁自動(dòng)分類提供了處理和組織大規(guī)模網(wǎng)頁的關(guān)鍵技術(shù),是使信息資源得以合理有效組織的重要方法。如何提高網(wǎng)頁分類的準(zhǔn)確率和召回率,是研究人員不懈追求的目標(biāo)。 本文通過中文網(wǎng)頁正文提取方法,較好地提取出中文網(wǎng)頁中的正文文本,將網(wǎng)頁標(biāo)記的處理、噪音信息過濾和網(wǎng)頁正文提取三個(gè)方面結(jié)合起來。網(wǎng)頁中的鏈接主要分為兩類,與本頁主題相關(guān)的鏈接稱為相關(guān)鏈接,與本頁主題無關(guān)的鏈接稱為無關(guān)鏈接,例如導(dǎo)航條和廣告鏈接等等。本文提出的相關(guān)鏈接提取算法,能夠較好地抽取出中文網(wǎng)頁中的相關(guān)鏈接,該算法時(shí)間復(fù)雜性低,準(zhǔn)確率和召回率都令人滿意。本文基于向量空間模型,采用詞頻法選擇網(wǎng)頁中的特征詞,采用機(jī)器學(xué)習(xí)算法KNN對(duì)中文網(wǎng)頁進(jìn)行分類,設(shè)計(jì)實(shí)現(xiàn)了一個(gè)中文網(wǎng)頁分類器。比較了基于網(wǎng)頁標(biāo)題分類、基于網(wǎng)頁正文分類、基于網(wǎng)頁相關(guān)鏈接分類,以及將正文與相關(guān)鏈接結(jié)合分類、將標(biāo)題與相關(guān)鏈接結(jié)合分類的分類效果,印證了中文網(wǎng)頁中相關(guān)鏈接對(duì)網(wǎng)頁分類具有積極影響的設(shè)想,同時(shí)也提出了一種分類方法。 通過開放測(cè)試,實(shí)驗(yàn)數(shù)據(jù)表明,本文提出的網(wǎng)頁正文和相關(guān)鏈接結(jié)合分類的方法所需的訓(xùn)練集較小,各個(gè)類別的分類F1值均在92%以上,比傳統(tǒng)的網(wǎng)頁分類效果有了一定的提高。
[Abstract]:With the rapid development of Internet, the online information is increasing exponentially. This allows you to quickly retrieve the desired target and its associated information. Automated web page categorization provides key techniques for processing and organizing large-scale web pages, It is an important method to organize information resources reasonably and effectively. How to improve the accuracy and recall rate of web page classification is the goal pursued by researchers. In this paper, the text of Chinese web pages is extracted by the method of text extraction, which combines three aspects: the processing of page tags, noise information filtering and page text extraction. The links in web pages are divided into two types. Links related to topics on this page are called related links, and links that are not related to topics on this page are called irrelevant links, such as navigation bars and advertising links. The algorithm has the advantages of low time complexity, good accuracy and good recall rate. Based on vector space model, the feature words in Chinese web pages are selected by word frequency method. A Chinese web page classifier is designed and implemented by machine learning algorithm KNN. The classification effect of combining the text with the related links and the combination of the title and the related links proves the assumption that the related links in Chinese web pages have a positive impact on the classification of web pages. At the same time, a classification method is proposed. Through the open test, the experimental data show that the training set of the web page text and related links combined with classification method proposed in this paper is relatively small, and the F1 value of each category is above 92%. Compared with the traditional web page classification effect has certain improvement.
【學(xué)位授予單位】:哈爾濱工程大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2008
【分類號(hào)】:TP393.092
【引證文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前1條
1 白凡;改進(jìn)的K近鄰算法在網(wǎng)頁文本分類中的應(yīng)用[D];安徽大學(xué);2010年
,本文編號(hào):1577140
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1577140.html
最近更新
教材專著