基于LDA模型的專利文本分類及演化研究
[Abstract]:Patent document is the carrier of technical information, whose text conceals a large amount of technical information and is the best information source of technical information. With the rapid development of New China, the number of patent applications in China has been increasing year by year, and the number of patent applications has been the highest in the world for the fifth consecutive year in 2016. Therefore, the research and development of information mining technology for these massive patent documents has become a common focus of national and enterprise research. LDA model is a typical probabilistic subject model, which has been widely used in natural language processing. Data mining and artificial intelligence are used to analyze the classification and evolution of text. The probabilistic subject model is seldom used in the research of patent text, so this paper uses LDA model to classify and evolve patent text on the basis of the existing technical framework of patent text information mining. The specific contents of this paper are as follows: (1) firstly, several traditional probabilistic subject models are summarized and briefly described, and then the LDA model used in this algorithm is described in detail. The related mathematical probability distribution and parameter inference algorithm are introduced. Finally, some typical classification algorithms and evolutionary analysis methods in patent texts are reviewed. (2) in view of the traditional automatic classification methods for patent texts, This paper presents a patent text classification method based on LDA model, which is based on the problems of vector space model (VSM) text representation. This method uses the LDA topic model to model the patent text corpus, extracts the document topic and theme-feature word matrix of the patent text, achieves the purpose of reducing dimension and extracting the semantic relation between the documents, and introduces the class-topic matrix of the class. In order to extend the topic semantics for the class, the topic similarity degree is used to classify the sublayer, and the KNN classification method is used for the small class. Experimental results: compared with the KNN patent text classification method based on vector space text representation model, this method can obtain a higher classification evaluation index. (3) using probabilistic subject model to study the topic evolution of patent literature. The development trend of patent technology is found. The LDA model models patent text according to time window, determines the optimal subject according to the degree of confusion, extracts the theme vector according to the structural characteristics of patent text, and measures the correlation between the topics by using JS divergence. This paper introduces the IPC taxonomy to calculate the technical topic strength, and finally realizes the evolution of the theme intensity, the theme content and the technical theme intensity. The experimental results show that this method can better analyze the evolution law and trend of patent technology with time. This method can dig into the subject of patent literature and help relevant practitioners to understand the evolution process and trend of patent technology.
【學位授予單位】:江西理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1
【參考文獻】
相關(guān)期刊論文 前10條
1 劉紅光;馬雙剛;劉桂鋒;;基于機器學習的專利文本分類算法研究綜述[J];圖書情報研究;2016年03期
2 劉桂鋒;汪滿容;劉海軍;;基于概率超圖半監(jiān)督學習的專利文本分類方法研究[J];情報雜志;2016年09期
3 繆建明;賈廣威;張運良;;基于摘要文本的專利快速自動分類方法[J];情報理論與實踐;2016年08期
4 祖坤琳;趙銘偉;林鴻飛;;基于有序聚類的專利知識演化研究[J];計算機工程與科學;2016年04期
5 韓紅旗;付媛;朱禮軍;;基于專利IPC分類號的技術(shù)競爭對象的群組分析方法[J];情報工程;2015年04期
6 陳海紅;;多核SVM文本分類研究[J];軟件;2015年05期
7 秦曉慧;樂小虬;;基于LDA主題關(guān)聯(lián)過濾的領(lǐng)域主題演化研究[J];現(xiàn)代圖書情報技術(shù);2015年03期
8 王鵬;高鋮;陳曉美;;基于LDA模型的文本聚類研究[J];情報科學;2015年01期
9 魏景璇;魯燃;張艷輝;;基于動態(tài)閾值和命名實體的雙重過濾話題追蹤[J];計算機應用研究;2015年04期
10 李湘東;張嬌;袁滿;;基于LDA模型的科技期刊主題演化研究[J];情報雜志;2014年07期
相關(guān)會議論文 前1條
1 王會珍;朱靖波;陳文亮;季鐸;張斌;;基于一元語法模型的中文話題追蹤[A];第二屆全國學生計算語言學研討會論文集[C];2004年
,本文編號:2289173
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2289173.html