基于混合特征的中文文本分類研究

發(fā)布時間：2018-01-20 20:45

本文關鍵詞： 文本分類特征權重算法混合特征支持向量機　出處：《東北大學》2012年碩士論文　論文類型：學位論文

【摘要】：隨著信息技術的高速發(fā)展和互聯(lián)網(wǎng)自媒體時代的到來,越來越多的信息以電子文本的形式存在于互聯(lián)網(wǎng)上。從海量的網(wǎng)頁文本信息中提取準確的、有價值的知識成為信息處理的一大目標。文本自動分類技術作為信息處理領域的研究熱點,能夠將文檔自動按照類別進行組織和處理,較大程度的解決了信息資源的無序性,作為信息檢索,信息過濾和搜索引擎等領域的技術基礎,有著廣泛的應用前景。本文以垂直搜索領域的網(wǎng)頁文本主題信息檢索做為應用背景,將實現(xiàn)網(wǎng)頁文本的精確主題分類作為主要任務,圍繞垂直搜索對分類結果集的內容直達性要求更高的特點,設計并實現(xiàn)了基于混合特征的中文文本分類系統(tǒng),有效的解決了傳統(tǒng)網(wǎng)頁文本分類結果集直達性能不強的問題。主要的研究內容包括網(wǎng)頁結構化信息的獲取機制、混合特征模型的建立方法、分類器的訓練策略等。在結構化信息的獲取上,設計并實現(xiàn)了網(wǎng)頁文本自動抽取方法,通過對網(wǎng)頁結構的分析,有效過濾了網(wǎng)頁中的廣告、圖片、超鏈接等噪聲,抽取網(wǎng)頁中包括標題和正文內容在內的純文本信息。在混合特征建模上,將文本信息進行了中文分詞等自然語言處理,使用了特征降維算法取得特征詞集,改進了特征權重賦值算法,完成了內容特征建模,并驗證了改進算法對分類性能的優(yōu)化能力；同時提出了由網(wǎng)頁語言學特征和網(wǎng)絡特征構成的頁面特征集,通過統(tǒng)計歸一化實現(xiàn)頁面特征的建模,從而得到了本文的混合特征向量空間模型。在分類器的訓練策略上,引入了機器學習中有監(jiān)督的分類思想,研究了支持向量機算法,采用了經參數(shù)優(yōu)化的支持向量機算法對混合特征模型進行訓練,獲得了識別性能更好的主題分類器和頁面過濾器。本系統(tǒng)通過將主題分類器與頁面過濾器級聯(lián)實現(xiàn)了基于混合特征的中文文本分類系統(tǒng)。系統(tǒng)首先根據(jù)網(wǎng)頁資源的網(wǎng)絡地址獲取網(wǎng)頁資源信息,依靠算法從獲取的網(wǎng)頁信息中提取出特定的文本信息；然后基于獲取的文本信息進行混合特征的模型建立和分類系統(tǒng)的構造；最后通過性能測試,證明了系統(tǒng)具有較高的分類精度和較強的頁面過濾能力。
[Abstract]:With the rapid development of information technology and the arrival of Internet self-media era, more and more information exists on the Internet in the form of electronic text. As a research hotspot in the field of information processing, text automatic classification technology can automatically organize and process documents according to categories. As the technical foundation of information retrieval, information filtering and search engine, it has a wide application prospect. In this paper, the vertical search domain of web page text topic information retrieval as the application background, the realization of accurate topic classification of web text as the main task. A Chinese text classification system based on mixed features is designed and implemented around the characteristics of vertical search which requires higher directness of the content of the classification result set. It effectively solves the problem that the direct performance of the traditional text classification result set is not strong. The main research contents include the access mechanism of the structured information of the web page and the method of building the mixed feature model. The training strategy of classifier. In order to obtain the structured information, we design and implement the automatic extraction method of web page text. Through the analysis of the web page structure, we effectively filter the noise such as advertisement, picture, hyperlink and so on. Extract plain text information from web pages, including title and text content. In the hybrid feature modeling, the text information is processed by natural language such as Chinese word segmentation, the feature reduction algorithm is used to obtain the feature set, and the assignment algorithm of feature weight is improved, and the content feature modeling is completed. The ability of the improved algorithm to optimize the classification performance is verified. At the same time, a set of page features is proposed, which is composed of linguistic features of web pages and network features. The modeling of page features is realized by statistical normalization, and the mixed feature vector space model of this paper is obtained. In the training strategy of classifier, the supervised classification idea in machine learning is introduced, the support vector machine algorithm is studied, and the hybrid feature model is trained by parameter-optimized support vector machine algorithm. Theme classifiers and page filters with better recognition performance are obtained. This system realizes the Chinese text classification system based on mixed features by concatenating the topic classifier and the page filter. Firstly, the system obtains the web resource information according to the web address of the web resource. Based on the algorithm, the specific text information is extracted from the obtained web page information. Then the mixed feature model is built based on the obtained text information and the classification system is constructed. Finally, through the performance test, it is proved that the system has high classification accuracy and strong page filtering ability.
【學位授予單位】：東北大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP391.1

【參考文獻】

相關期刊論文前7條

1 李曉黎,劉繼敏,史忠植;概念推理網(wǎng)及其在文本分類中的應用[J];計算機研究與發(fā)展;2000年09期

2 劉群,張華平,俞鴻魁,程學旗;基于層疊隱馬模型的漢語詞法分析[J];計算機研究與發(fā)展;2004年08期

3 馬玉春,宋瀚濤;Web中文文本分詞技術研究[J];計算機應用;2004年04期

4 鄧宏濤;中文自動分詞系統(tǒng)的設計模型[J];計算機與數(shù)字工程;2005年04期

5 沈達陽,孫茂松,黃昌寧;漢語分詞系統(tǒng)中的信息集成和最佳路徑搜索方法[J];中文信息學報;1997年02期

6 孫茂松,左正平,黃昌寧;漢語自動分詞詞典機制的實驗研究[J];中文信息學報;2000年01期

7 張茂元,盧正鼎,鄒春燕;一種基于語境的中文分詞方法研究[J];小型微型計算機系統(tǒng);2005年01期

，

本文編號：1449419

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1449419.html

上一篇：圖書館未來的深度思考——環(huán)境、價值與行動
下一篇：服務搜索引擎中個性化服務推薦功能的設計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于混合特征的中文文本分類研究