天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

基于維基百科的多種類型文獻(xiàn)自動分類研究

發(fā)布時(shí)間:2018-07-08 17:44

  本文選題:數(shù)字圖書館 + 文本分類; 參考:《武漢大學(xué)》2017年碩士論文


【摘要】:隨著互聯(lián)網(wǎng)的逐漸普及,這些新興的網(wǎng)絡(luò)文本資源以極快的速度增長,這導(dǎo)致傳統(tǒng)的手工分類方法由于效率較低,難以及時(shí)、有效地對這些網(wǎng)絡(luò)數(shù)字資源進(jìn)行合理地分類管理,因此必須利用自動文本分類技術(shù)來對其進(jìn)行分類組織。而當(dāng)前的自動文本分類技術(shù)往往研究的是針對來自同種文獻(xiàn)類型的文本資源,而數(shù)字圖書館作為一種新型圖書館,其面臨的待分類整理的文獻(xiàn)來自圖書、期刊、網(wǎng)頁等等多種領(lǐng)域且屬于多種類型,目前針對多種文獻(xiàn)類型的自動分類研究還有待完善,所以研究改進(jìn)針對多種文獻(xiàn)類型的自動分類算法對數(shù)字圖書館的成長與發(fā)展能起到顯著的推動作用。本文通過介紹與分析當(dāng)前文本分類方面的相關(guān)研究及主要技術(shù),提出了一種通過基于維基百科的特征擴(kuò)展來提高針對不同類型文獻(xiàn)分類效果的分類方法。針對由不同文獻(xiàn)類型所造成的特征不匹配問題,本文認(rèn)為通過第三方語料庫可以有效地將原本不匹配的特征詞進(jìn)行關(guān)聯(lián),從而解決在特征詞不匹配的情形下無法對不同類型文本間進(jìn)行語義相關(guān)度計(jì)算的問題。一方面可以豐富當(dāng)前待分類文本的語義特征,與由不同類型文獻(xiàn)訓(xùn)練來得到的分類器產(chǎn)生相匹配特征,同時(shí)還可以解決在文本分類問題中普遍存在的特征稀疏等問題。本文主要進(jìn)行的研究內(nèi)容如下:(1)本文以互聯(lián)網(wǎng)上的文本內(nèi)容爆炸式增長為背景,論述未來數(shù)字圖書館面對以幾何級數(shù)增加的網(wǎng)絡(luò)文本分類管理困難的問題,引出了多種類型文獻(xiàn)自動分類技術(shù)研究的必要性。繼而本文提出的通過特征擴(kuò)展解決上述問題的思路,并通過論述與分析當(dāng)前相關(guān)研究的成果與進(jìn)展來論證本文提出的文本分類方法的可行性與適用性。(2)本研究提出了一種基于特征擴(kuò)展的多種類型文獻(xiàn)文本分類方法,其中特征擴(kuò)展操作是消除不同類型文獻(xiàn)自動分類時(shí)文本間語義差異的核心步驟。而在進(jìn)行特征擴(kuò)展前需要從訓(xùn)練文本中提取一部分特征詞作為特征擴(kuò)展候選詞集。本研究在論述傳統(tǒng)特征選擇方法的不足并舉例說明其缺點(diǎn)的基礎(chǔ)上,繼而提出對其進(jìn)行改進(jìn)的原理與方法,并通過計(jì)算表明新的特征選擇方法確實(shí)能解決原有不足。最后,本文使用改進(jìn)的特征選擇方法進(jìn)行特征擴(kuò)展候選詞集的提取,并通過實(shí)驗(yàn)對比證明該方法的有效性。(3)為解決對不同類型文獻(xiàn)間進(jìn)行自動分類時(shí)遇到的特征不匹配等問題,本文提出一種基于特征擴(kuò)展的文本分類方法,使用維基百科計(jì)算的語義相關(guān)度來準(zhǔn)確衡量特征詞之間的相關(guān)程度。在對待分類文本完成特征擴(kuò)展之后,本文使用LDA主題模型對數(shù)據(jù)進(jìn)行表示建模,但傳統(tǒng)的LDA模型不能正常地對帶權(quán)特征詞進(jìn)行建模,故而本文又對LDA模型進(jìn)行改進(jìn),提出一種加權(quán)LDA模型使其能對帶權(quán)特征詞進(jìn)行同樣的建模與求解,同時(shí)由于特征詞被賦予了不同權(quán)重,所以也提高了LDA模型本身的精度和準(zhǔn)確性。
[Abstract]:With the gradual popularization of the Internet, these new network text resources are growing at a very fast speed, which leads to the traditional manual classification method is difficult to manage these network digital resources in a reasonable and timely manner due to its low efficiency. Therefore, it is necessary to use automatic text classification technology to organize it. The current automatic text classification technology is often aimed at the text resources from the same type of literature, and the digital library, as a new type of library, faces the literature to be sorted out from books, periodicals. Web pages and other fields belong to a variety of types. At present, the research on automatic classification of various literature types needs to be improved. Therefore, the research and improvement of the automatic classification algorithm for various literature types can play a significant role in promoting the growth and development of digital libraries. Based on the introduction and analysis of the current research on text classification and its main techniques, this paper proposes a new method to improve the classification effect of different types of documents by extending the features based on Wikipedia. Aiming at the problem of feature mismatch caused by different literature types, this paper considers that the original mismatched feature words can be effectively correlated by the third party corpus. In order to solve the problem that the semantic relevance of different types of text can not be calculated in the case of feature mismatch. On the one hand, it can enrich the semantic features of the text to be classified, and match with the classifier trained by different types of literature. At the same time, it can also solve the problem of sparse feature in the text classification problem. The main research contents of this paper are as follows: (1) based on the explosive growth of text content on the Internet, this paper discusses the problem that the future digital library faces the difficult management of network text classification with geometric progression increase. The necessity of research on automatic classification of many kinds of documents is introduced. Then this paper puts forward the idea of solving the above problems by extending the features. The feasibility and applicability of the text classification method proposed in this paper are demonstrated by discussing and analyzing the achievements and progress of the current related research. (2) this paper proposes a method of text classification of various types of literature based on feature expansion. Feature extension is the key step to eliminate semantic differences between texts in automatic classification of different types of documents. Some feature words should be extracted from the training text as feature extension candidate words before feature expansion. On the basis of discussing the shortcomings of the traditional feature selection method and illustrating its shortcomings, the paper puts forward the principle and method of improving it, and shows by calculation that the new feature selection method can really solve the original deficiency. Finally, the improved feature selection method is used to extract the extended candidate word sets, and the experimental results show that the method is effective. (3) in order to solve the problem of feature mismatch in the automatic classification of different types of literature, In this paper, a text classification method based on feature extension is proposed, which uses the semantic relevance calculated by Wikipedia to accurately measure the correlation between feature words. After finishing the feature expansion of the classified text, this paper uses the LDA topic model to model the data representation, but the traditional LDA model can not model the weighted feature words normally, so the LDA model is improved in this paper. A weighted LDA model is proposed to model and solve the weighted feature words in the same way. At the same time, the accuracy and accuracy of the LDA model are improved because the feature words are given different weights.
【學(xué)位授予單位】:武漢大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文 前10條

1 李政澤;韓毅;周斌;賈焰;;微博用戶分類的特征詞權(quán)重優(yōu)化及推薦策略[J];信息網(wǎng)絡(luò)安全;2012年08期

2 翟東海;杜佳;崔靜靜;聶洪玉;;基于雙粒度模型的中文情感特征詞提取研究[J];重慶郵電大學(xué)學(xué)報(bào)(自然科學(xué)版);2014年03期

3 李德容;干靜;張s,

本文編號:2108218


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2108218.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶c948e***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com