天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

網(wǎng)絡(luò)文本分類技術(shù)研究

發(fā)布時間:2018-05-03 23:39

  本文選題:網(wǎng)頁文本提取 + 中文分詞; 參考:《北方工業(yè)大學(xué)》2012年碩士論文


【摘要】:如今,由于網(wǎng)絡(luò)技術(shù)的發(fā)展,使得互聯(lián)網(wǎng)已成為人們獲取信息的主要資源庫。但網(wǎng)絡(luò)的開放性使得網(wǎng)絡(luò)中充滿了各式各樣的信息。為了使人們能夠迅速從網(wǎng)絡(luò)中獲取到自己感興趣的信息,如何使用網(wǎng)絡(luò)文本分類技術(shù)來處理雜亂的網(wǎng)絡(luò)信息,讓這些信息資源變得有序,開始變得越來越重要。網(wǎng)絡(luò)文本分類技術(shù)是信息過濾、搜索引擎等領(lǐng)域的基礎(chǔ),因此網(wǎng)絡(luò)文本分類技術(shù)已逐步成為當(dāng)今的研究熱點。 本文首先介紹了網(wǎng)絡(luò)文本提取技術(shù)和文本分類的相關(guān)理論,如:HTML語言、中文分詞、相似度計算、權(quán)重值計算、特征提取以及常用的文本分類方法。并且介紹了根據(jù)這些基本的理論方法,設(shè)計并實現(xiàn)了網(wǎng)絡(luò)文本分類系統(tǒng)。 本文主要進行了以下幾方面的研究:在對網(wǎng)絡(luò)文本提取部分,通過對HTML語言特點和一般網(wǎng)頁結(jié)構(gòu)的分析設(shè)計實現(xiàn)了網(wǎng)頁的文本提取。在文本分類部分中,主要詳細分析了KNN文本分類算法和樸素貝葉斯文本分類算法,并通過文本分類的算法實現(xiàn)對文本的領(lǐng)域分類。在對樸素貝葉斯分類方法分析的基礎(chǔ)上,針對該方法的獨立性假設(shè)的問題,采用了貝葉斯網(wǎng)絡(luò)TAN模型對貝葉斯分類方法進行了改進,考慮了兩詞間的關(guān)系,一定程度上放寬了獨立性假設(shè)。提出了文本態(tài)度判斷的方法,通過針對文本情感特征詞提取,對情感詞進行權(quán)值分析,評估文本態(tài)度,從而判斷出文本的態(tài)度實現(xiàn)對文本的二層分類。最后對網(wǎng)絡(luò)文本分類系統(tǒng)測試,通過使用語料庫文本的實驗測試,證明該系統(tǒng)有一定的準(zhǔn)確性,通過提取網(wǎng)頁的文本內(nèi)容對分類系統(tǒng)進行實驗測試,證明該系統(tǒng)有一定的實用性。
[Abstract]:Nowadays, with the development of network technology, the Internet has become the main resource for people to obtain information. But the openness of the network makes the network full of all kinds of information. In order to get the interesting information from the network quickly, how to use the network text classification technology to deal with the messy network information, make these information resources become orderly, began to become more and more important. Network text classification technology is the basis of information filtering, search engine and other fields, so network text classification technology has gradually become a hot research topic. This paper first introduces the network text extraction technology and the related theories of text classification, such as: HTML language, Chinese word segmentation, similarity calculation, weight calculation, feature extraction and common text classification methods. According to these basic theories and methods, a network text classification system is designed and implemented. This paper mainly studies the following aspects: in the part of web text extraction, the text extraction of web pages is realized through the analysis and design of the characteristics of HTML language and the structure of general web pages. In the part of text classification, KNN text classification algorithm and naive Bayesian text classification algorithm are analyzed in detail, and text domain classification is realized by text classification algorithm. Based on the analysis of the naive Bayesian classification method, the Bayesian network TAN model is used to improve the Bayesian classification method, considering the relationship between the two words. Independence assumptions have been relaxed to some extent. This paper puts forward a method of judging the text attitude. By extracting the emotional feature words of the text, analyzing the weight value of the emotion words and evaluating the text attitude, we can judge the attitude of the text to realize the two-layer classification of the text. Finally, the network text classification system test, through the use of corpus text test, proved that the system has a certain accuracy, by extracting the text content of the web page of the classification system for experimental testing. It is proved that the system is practical.
【學(xué)位授予單位】:北方工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.1

【相似文獻】

相關(guān)期刊論文 前10條

1 吳謀碩;;基于遺傳算法的文本分類技術(shù)[J];電腦知識與技術(shù);2011年22期

2 高金勇;徐朝軍;馮奕z,

本文編號:1840635


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1840635.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4ed1b***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com