基于網(wǎng)頁塊劃分的Web文本分類算法研究與實現(xiàn)
發(fā)布時間:2018-08-13 13:05
【摘要】: 目前Internet已經(jīng)成為人們獲取信息的一個重要途徑。隨著Web信息的日益增長,如何在如此大量的數(shù)據(jù)中提取有用信息成為一個重要課題。為了能夠有效地組織和分析海量的Web文本資源,針對Web文本的數(shù)據(jù)挖掘技術(shù)變得越來越重要。Web文本分類研究是Web文本挖掘中的一個重要研究內(nèi)容。Web文本中存在噪音信息及其半結(jié)構(gòu)化的特點,使得針對Web文本的分類技術(shù)與傳統(tǒng)的純文本分類技術(shù)有所差別。 基于機(jī)器學(xué)習(xí)的文本分類技術(shù)由文本的表示、分類方法及效果評估三部分組成。向量空間模型是文檔最常用的表示結(jié)構(gòu),特征選擇和特征降維是影響該結(jié)構(gòu)的兩個主要因素。貝葉斯定理、支持向量機(jī)模型等機(jī)器學(xué)習(xí)方法常常用在文本分類器的構(gòu)造過程中。 大多數(shù)基于模板的商業(yè)網(wǎng)頁包含與主題相關(guān)的內(nèi)容塊,以及諸如廣告、導(dǎo)航欄、版權(quán)等噪音信息。這些噪音內(nèi)容的存在影響了基于網(wǎng)頁的信息處理領(lǐng)域,如信息檢索、網(wǎng)頁分類等。利用HTML網(wǎng)頁中具有分塊啟發(fā)作用的一些特殊標(biāo)記將網(wǎng)頁分塊,通過計算網(wǎng)頁塊在整個網(wǎng)頁集中的出現(xiàn)頻率判定其是否為噪音塊,給出了一種網(wǎng)頁分塊算法ContentDiscoverer。實驗表明,與同類算法相比,ContentDiscoverer具有更快的執(zhí)行速度和更好的主題內(nèi)容塊識別效果。 將ContentDiscoverer分塊算法用在網(wǎng)頁分類中,設(shè)計并實現(xiàn)了一個中文網(wǎng)頁分類器。實驗結(jié)果表明,進(jìn)行網(wǎng)頁塊劃分后,其分類的準(zhǔn)確性有了較大的提高。
[Abstract]:At present, Internet has become an important way for people to obtain information. With the increasing of Web information, how to extract useful information from such a large amount of data has become an important issue. In order to effectively organize and analyze a large amount of Web text resources, The data mining technology of Web text becomes more and more important. The research of web text classification is an important research content in Web text mining. The classification technology for Web text is different from the traditional pure text classification technology. The text classification technology based on machine learning consists of three parts: text representation, classification method and effect evaluation. Vector space model is the most commonly used representation structure of documents. Feature selection and feature dimensionality reduction are the two main factors that affect the structure. Bayesian theorem, support vector machine model and other machine learning methods are often used in the construction of text classifier. Most template-based business pages contain content blocks related to the subject, as well as noise information such as advertising, navigation bars, copyright and so on. The presence of these noise content affects the field of web-based information processing, such as information retrieval, web page classification and so on. In this paper, we use some special tags in HTML web pages to divide web pages into blocks. By calculating the frequency of web page blocks appearing in the whole web page set, we determine whether they are noise blocks or not, and present a content Discovery algorithm. The experimental results show that the algorithm has faster execution speed and better recognition effect than the similar algorithms. A Chinese web page classifier is designed and implemented by using ContentDiscoverer block algorithm in web page classification. The experimental results show that the accuracy of the classification is improved greatly.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2007
【分類號】:TP391.1
[Abstract]:At present, Internet has become an important way for people to obtain information. With the increasing of Web information, how to extract useful information from such a large amount of data has become an important issue. In order to effectively organize and analyze a large amount of Web text resources, The data mining technology of Web text becomes more and more important. The research of web text classification is an important research content in Web text mining. The classification technology for Web text is different from the traditional pure text classification technology. The text classification technology based on machine learning consists of three parts: text representation, classification method and effect evaluation. Vector space model is the most commonly used representation structure of documents. Feature selection and feature dimensionality reduction are the two main factors that affect the structure. Bayesian theorem, support vector machine model and other machine learning methods are often used in the construction of text classifier. Most template-based business pages contain content blocks related to the subject, as well as noise information such as advertising, navigation bars, copyright and so on. The presence of these noise content affects the field of web-based information processing, such as information retrieval, web page classification and so on. In this paper, we use some special tags in HTML web pages to divide web pages into blocks. By calculating the frequency of web page blocks appearing in the whole web page set, we determine whether they are noise blocks or not, and present a content Discovery algorithm. The experimental results show that the algorithm has faster execution speed and better recognition effect than the similar algorithms. A Chinese web page classifier is designed and implemented by using ContentDiscoverer block algorithm in web page classification. The experimental results show that the accuracy of the classification is improved greatly.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2007
【分類號】:TP391.1
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 高金勇;徐朝軍;馮奕z,
本文編號:2181087
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2181087.html
最近更新
教材專著