Web文本分類方法研究與系統(tǒng)實現(xiàn)
[Abstract]:In recent years, Web has developed rapidly into the largest public information source in the world. How to enable Web users to locate the needed information conveniently and quickly in the vast information resources, The correct classification of Web text is the core problem. Web text classification is derived from automatic classification technology and is an important part of Web text mining. It not only can effectively improve the search efficiency of users, help users to locate the target knowledge quickly and accurately, but also can obtain the interest characteristics of different users, and provide a reference to meet the personalized service requirements of users. Most of the current classification studies regard document categories as flat, disjoint, and do not take into account the hierarchical relationship between categories. When the number of categories is large, the time cost of learning classifier by plane classification is very large, and when classifying unknown documents, we need to compare them with all class models, which is obviously not appropriate. Based on the in-depth study of Web text mining and automatic classification technology, this paper implements a multi-level Web text classification system based on the hierarchical relationship between categories. The innovations and key technologies of this paper are as follows: 1. A hierarchical training and classification model is established. Aiming at the features of many kinds of web pages which are rich in content and involving many fields, this paper analyzes the problems existing in the method of plane classification in the case of multiple categories, and puts forward the idea of hierarchical classification. A hierarchical training and classification model is established. 2. An automatic Web text extractor is designed and implemented. The noise such as ads and hyperlinks in Web pages brings great trouble to Web text classification. In this paper, an automatic Web text extractor is implemented, which makes the Web page become pure text containing title and text. 3. In this paper, a keyword extraction method suitable for Web web pages is proposed. Different positions and different parts of speech in web pages play different roles in the expression of web pages. In view of this characteristic, this paper proposes a new method based on part of speech. Position and word frequency information weighted keyword extraction method to further filter out the page noise words, and achieved good results. 4. A classification method based on the weighting of 蠂 2 statistics is proposed. 蠂 2 statistics can well reflect the correlation between features and categories. This paper innovatively applies 蠂 2 statistics to text classification, which not only simplifies the classification process, but also obtains better classification speed and accuracy in practical application. According to the characteristics of Web texts, this paper proposes a set of implementation schemes for large-scale, multi-class Web text classification, and designs a multi-level classification system for Web texts. The results show that the classification performance of this system is better than that of general plane classifier in practice.
【學位授予單位】:電子科技大學
【學位級別】:碩士
【學位授予年份】:2010
【分類號】:TP391.1
【參考文獻】
相關期刊論文 前10條
1 付雪峰,王明文;基于模糊-粗糙集的文本分類方法[J];華南理工大學學報(自然科學版);2004年S1期
2 王繼成,潘金貴,張福炎;Web文本挖掘技術研究[J];計算機研究與發(fā)展;2000年05期
3 李曉黎,劉繼敏,史忠植;概念推理網(wǎng)及其在文本分類中的應用[J];計算機研究與發(fā)展;2000年09期
4 王本年,高陽,陳世福,謝俊元;Web智能研究現(xiàn)狀與發(fā)展趨勢[J];計算機研究與發(fā)展;2005年05期
5 李波,李新軍;一種基于粗糙集和支持向量機的混合分類算法[J];計算機應用;2004年03期
6 涂承勝,魯明羽,陸玉昌;Web內容挖掘技術研究[J];計算機應用研究;2003年11期
7 范焱,鄭誠,王清毅,蔡慶生,劉潔;用Naive Bayes方法協(xié)調分類Web網(wǎng)頁[J];軟件學報;2001年09期
8 白翎雁;才書訓;;Web文本挖掘及相關技術研究[J];沈陽工程學院學報(自然科學版);2008年03期
9 高淑琴;;Web文本分類技術研究現(xiàn)狀述評[J];圖書情報知識;2008年03期
10 許高建;;基于Web的文本挖掘技術研究[J];計算機技術與發(fā)展;2007年06期
相關博士學位論文 前2條
1 劉永丹;文檔數(shù)據(jù)庫若干關鍵技術研究[D];復旦大學;2004年
2 王煜;基于決策樹和K最近鄰算法的文本分類研究[D];天津大學;2006年
相關碩士學位論文 前7條
1 孫麗華;中文文本自動分類的研究[D];哈爾濱工程大學;2002年
2 羅強;基于粗糙集理論的知識發(fā)現(xiàn)在web文本挖掘上的應用研究[D];廣西大學;2003年
3 張濱;中文文檔分類技術研究[D];武漢大學;2004年
4 彭雅;文本分類算法及其應用研究[D];湖南大學;2004年
5 汪傳建;基于混合模型的文本分類的研究[D];東北大學;2005年
6 鄒丹;基于Web的中文文本分類的研究與實現(xiàn)[D];中國地質大學(北京);2006年
7 邢麗莉;基于Web的中文文本分類技術的研究[D];河北工程大學;2008年
,本文編號:2367860
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2367860.html