Web文本分類方法研究與系統(tǒng)實現(xiàn)

發(fā)布時間：2018-12-07 20:55

【摘要】： 近年來,Web已經飛速發(fā)展成為了世界上數(shù)據量最大的公共信息源。如何使Web用戶能夠在浩瀚的信息資源中方便、快捷的定位到所需要的信息,已經成為迫切需要解決的問題。Web文本的正確分類正是其中的核心問題。Web文本分類源自于自動分類技術,是Web文本挖掘的重要組成部分。它不僅可以有效提高用戶的搜索效率,幫助用戶快速、準確的定位到目標知識,而且還可以獲取到不同用戶的類別興趣特征,為滿足用戶的個性化服務要求提供參考。目前的分類研究多把文檔類別看成是平面化的、不相交的,沒有考慮到類別間的層次關系。當類別數(shù)目較多時,平面分類學習得到分類器的時間開銷大,而且在對未知文檔分類時,需要與全部類模型進行比較,這顯然很不恰當。本文在對Web文本挖掘及自動分類技術進行深入研究的基礎上,結合類別間的層次關系,實現(xiàn)了一個多層次的Web文本分類系統(tǒng)。本文創(chuàng)新點和關鍵技術如下: 1.建立了層次化的訓練和分類模型:本文針對網頁內容豐富、涉及多領域的多個類別的特征,分析了平面分類方法在多類別情況下存在的問題,提出了層次分類的思想,建立了層次化的訓練和分類模型。 2.設計并實現(xiàn)了Web文本的自動抽取器:Web網頁中摻雜的廣告、超鏈接等噪聲給Web文本分類帶來了極大困擾。本文實現(xiàn)了一個Web文本自動抽取器,使Web頁面經過處理變?yōu)檩^純凈的包含標題和正文內容的純文本。 3.提出了一種適合于Web網頁的關鍵詞提取方法:網頁中不同位置和不同詞性的詞語對表達網頁內容所起的作用也有所不同,針對這一特點,本文提出了基于詞性、位置和詞頻信息加權的關鍵詞提取方法來進一步過濾掉網頁噪聲詞,取得了較好的效果。 4.提出了一種基于χ2統(tǒng)計量加權的分類方法:χ2統(tǒng)計量能夠很好的反映特征和類別間的相關性。本文創(chuàng)新性的將χ2統(tǒng)計量應用于文本分類,不但簡化了分類過程,而且在實際應用中得到了較好的分類速度和準確度。本論文根據Web文本的特點提出了一套針對大規(guī)模、多類別的Web文本進行分類的實施方案,設計了一個Web文本的多層次分類系統(tǒng)。結果表明,本系統(tǒng)在實踐中的分類性能優(yōu)于一般的平面分類器。
[Abstract]:In recent years, Web has developed rapidly into the largest public information source in the world. How to enable Web users to locate the needed information conveniently and quickly in the vast information resources, The correct classification of Web text is the core problem. Web text classification is derived from automatic classification technology and is an important part of Web text mining. It not only can effectively improve the search efficiency of users, help users to locate the target knowledge quickly and accurately, but also can obtain the interest characteristics of different users, and provide a reference to meet the personalized service requirements of users. Most of the current classification studies regard document categories as flat, disjoint, and do not take into account the hierarchical relationship between categories. When the number of categories is large, the time cost of learning classifier by plane classification is very large, and when classifying unknown documents, we need to compare them with all class models, which is obviously not appropriate. Based on the in-depth study of Web text mining and automatic classification technology, this paper implements a multi-level Web text classification system based on the hierarchical relationship between categories. The innovations and key technologies of this paper are as follows: 1. A hierarchical training and classification model is established. Aiming at the features of many kinds of web pages which are rich in content and involving many fields, this paper analyzes the problems existing in the method of plane classification in the case of multiple categories, and puts forward the idea of hierarchical classification. A hierarchical training and classification model is established. 2. An automatic Web text extractor is designed and implemented. The noise such as ads and hyperlinks in Web pages brings great trouble to Web text classification. In this paper, an automatic Web text extractor is implemented, which makes the Web page become pure text containing title and text. 3. In this paper, a keyword extraction method suitable for Web web pages is proposed. Different positions and different parts of speech in web pages play different roles in the expression of web pages. In view of this characteristic, this paper proposes a new method based on part of speech. Position and word frequency information weighted keyword extraction method to further filter out the page noise words, and achieved good results. 4. A classification method based on the weighting of 蠂 2 statistics is proposed. 蠂 2 statistics can well reflect the correlation between features and categories. This paper innovatively applies 蠂 2 statistics to text classification, which not only simplifies the classification process, but also obtains better classification speed and accuracy in practical application. According to the characteristics of Web texts, this paper proposes a set of implementation schemes for large-scale, multi-class Web text classification, and designs a multi-level classification system for Web texts. The results show that the classification performance of this system is better than that of general plane classifier in practice.
【學位授予單位】：電子科技大學
【學位級別】：碩士
【學位授予年份】：2010
【分類號】：TP391.1

【參考文獻】

相關期刊論文前10條

1 付雪峰,王明文;基于模糊-粗糙集的文本分類方法[J];華南理工大學學報(自然科學版);2004年S1期

2 王繼成,潘金貴,張福炎;Web文本挖掘技術研究[J];計算機研究與發(fā)展;2000年05期

3 李曉黎,劉繼敏,史忠植;概念推理網及其在文本分類中的應用[J];計算機研究與發(fā)展;2000年09期

4 王本年,高陽,陳世福,謝俊元;Web智能研究現(xiàn)狀與發(fā)展趨勢[J];計算機研究與發(fā)展;2005年05期

5 李波,李新軍;一種基于粗糙集和支持向量機的混合分類算法[J];計算機應用;2004年03期

6 涂承勝,魯明羽,陸玉昌;Web內容挖掘技術研究[J];計算機應用研究;2003年11期

7 范焱,鄭誠,王清毅,蔡慶生,劉潔;用Naive Bayes方法協(xié)調分類Web網頁[J];軟件學報;2001年09期

8 白翎雁;才書訓;;Web文本挖掘及相關技術研究[J];沈陽工程學院學報(自然科學版);2008年03期

9 高淑琴;;Web文本分類技術研究現(xiàn)狀述評[J];圖書情報知識;2008年03期

10 許高建;;基于Web的文本挖掘技術研究[J];計算機技術與發(fā)展;2007年06期

相關博士學位論文前2條

1 劉永丹;文檔數(shù)據庫若干關鍵技術研究[D];復旦大學;2004年

2 王煜;基于決策樹和K最近鄰算法的文本分類研究[D];天津大學;2006年

相關碩士學位論文前7條

1 孫麗華;中文文本自動分類的研究[D];哈爾濱工程大學;2002年

2 羅強;基于粗糙集理論的知識發(fā)現(xiàn)在web文本挖掘上的應用研究[D];廣西大學;2003年

3 張濱;中文文檔分類技術研究[D];武漢大學;2004年

4 彭雅;文本分類算法及其應用研究[D];湖南大學;2004年

5 汪傳建;基于混合模型的文本分類的研究[D];東北大學;2005年

6 鄒丹;基于Web的中文文本分類的研究與實現(xiàn)[D];中國地質大學（北京）;2006年

7 邢麗莉;基于Web的中文文本分類技術的研究[D];河北工程大學;2008年

，

本文編號：2367860

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2367860.html

上一篇：多媒體廣告發(fā)布系統(tǒng)設計與實現(xiàn)
下一篇：影視植入式廣告在品牌傳播上的價值研究

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web文本分類方法研究與系統(tǒng)實現(xiàn)