基于LDA模型的專利文本分類及演化研究

發(fā)布時(shí)間：2018-10-23 12:03

【摘要】：專利文獻(xiàn)是技術(shù)情報(bào)的載體,它的文本中隱藏了大量的技術(shù)情報(bào)信息,是技術(shù)情報(bào)消息的最佳情報(bào)來(lái)源。隨著新中國(guó)的快速發(fā)展,我國(guó)專利的申請(qǐng)數(shù)量已在逐年升高,至2016年已經(jīng)連續(xù)第五年蟬聯(lián)全球?qū)＠暾?qǐng)量之首。因此,對(duì)于這些海量專利文獻(xiàn)的信息挖掘技術(shù)的研發(fā),已成為國(guó)家和企業(yè)研究的共同熱點(diǎn)。LDA模型是典型的概率主題模型,目前已廣泛應(yīng)用在自然語(yǔ)言處理、數(shù)據(jù)挖掘和人工智能等領(lǐng)域,用來(lái)分析文本的分類和演化問(wèn)題。其中概率主題模型很少應(yīng)用在專利文本的相關(guān)研究中,故本文在現(xiàn)有專利文本信息挖掘技術(shù)框架的基礎(chǔ)上,采用LDA模型對(duì)專利文本進(jìn)行分類及演化研究,本文具體的研究?jī)?nèi)容如下:(1)首先概述幾種傳統(tǒng)的概率主題模型并對(duì)它們作簡(jiǎn)要的敘述,再對(duì)本文算法應(yīng)用的LDA模型進(jìn)行詳細(xì)的描述,介紹其的相關(guān)數(shù)學(xué)概率分布和參數(shù)推斷算法,最后回顧專利文本中的一些典型的分類算法和演化分析方法。(2)針對(duì)傳統(tǒng)專利文本自動(dòng)分類方法中,使用向量空間模型文本表示方法存在的問(wèn)題,提出一種基于LDA模型專利文本分類方法。該方法利用LDA主題模型對(duì)專利文本語(yǔ)料庫(kù)建模,提取專利文本的文檔-主題和主題-特征詞矩陣,達(dá)到降維目的和提取文檔間的語(yǔ)義聯(lián)系,引入類的類-主題矩陣,為類進(jìn)行主題語(yǔ)義拓展,使用主題相似度構(gòu)造層次分類,小類采用KNN分類方法。實(shí)驗(yàn)結(jié)果:與基于向量空間文本表示模型的KNN專利文本分類方法對(duì)比,此方法能夠獲得更高的分類評(píng)估指數(shù)。(3)運(yùn)用概率主題模型全面研究專利文獻(xiàn)主題演化,發(fā)現(xiàn)專利技術(shù)發(fā)展趨勢(shì)。LDA模型按時(shí)間窗口對(duì)專利文本建模,困惑度確定最優(yōu)主題,按專利文本結(jié)構(gòu)特性提取主題向量,采用JS散度度量主題之間的關(guān)聯(lián),引入IPC分類號(hào)計(jì)算技術(shù)主題強(qiáng)度,最后實(shí)現(xiàn)主題強(qiáng)度、主題內(nèi)容和技術(shù)主題強(qiáng)度三方面的演化研究。實(shí)驗(yàn)結(jié)果表明該方法可以較好地分析專利技術(shù)隨時(shí)間的演化規(guī)律及趨勢(shì)。該方法能夠深入挖掘?qū)＠墨I(xiàn)的主題,幫助相關(guān)從業(yè)人員了解專利技術(shù)的演化過(guò)程及趨勢(shì)。
[Abstract]:Patent document is the carrier of technical information, whose text conceals a large amount of technical information and is the best information source of technical information. With the rapid development of New China, the number of patent applications in China has been increasing year by year, and the number of patent applications has been the highest in the world for the fifth consecutive year in 2016. Therefore, the research and development of information mining technology for these massive patent documents has become a common focus of national and enterprise research. LDA model is a typical probabilistic subject model, which has been widely used in natural language processing. Data mining and artificial intelligence are used to analyze the classification and evolution of text. The probabilistic subject model is seldom used in the research of patent text, so this paper uses LDA model to classify and evolve patent text on the basis of the existing technical framework of patent text information mining. The specific contents of this paper are as follows: (1) firstly, several traditional probabilistic subject models are summarized and briefly described, and then the LDA model used in this algorithm is described in detail. The related mathematical probability distribution and parameter inference algorithm are introduced. Finally, some typical classification algorithms and evolutionary analysis methods in patent texts are reviewed. (2) in view of the traditional automatic classification methods for patent texts, This paper presents a patent text classification method based on LDA model, which is based on the problems of vector space model (VSM) text representation. This method uses the LDA topic model to model the patent text corpus, extracts the document topic and theme-feature word matrix of the patent text, achieves the purpose of reducing dimension and extracting the semantic relation between the documents, and introduces the class-topic matrix of the class. In order to extend the topic semantics for the class, the topic similarity degree is used to classify the sublayer, and the KNN classification method is used for the small class. Experimental results: compared with the KNN patent text classification method based on vector space text representation model, this method can obtain a higher classification evaluation index. (3) using probabilistic subject model to study the topic evolution of patent literature. The development trend of patent technology is found. The LDA model models patent text according to time window, determines the optimal subject according to the degree of confusion, extracts the theme vector according to the structural characteristics of patent text, and measures the correlation between the topics by using JS divergence. This paper introduces the IPC taxonomy to calculate the technical topic strength, and finally realizes the evolution of the theme intensity, the theme content and the technical theme intensity. The experimental results show that this method can better analyze the evolution law and trend of patent technology with time. This method can dig into the subject of patent literature and help relevant practitioners to understand the evolution process and trend of patent technology.
【學(xué)位授予單位】：江西理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 劉紅光;馬雙剛;劉桂鋒;;基于機(jī)器學(xué)習(xí)的專利文本分類算法研究綜述[J];圖書(shū)情報(bào)研究;2016年03期

2 劉桂鋒;汪滿容;劉海軍;;基于概率超圖半監(jiān)督學(xué)習(xí)的專利文本分類方法研究[J];情報(bào)雜志;2016年09期

3 繆建明;賈廣威;張運(yùn)良;;基于摘要文本的專利快速自動(dòng)分類方法[J];情報(bào)理論與實(shí)踐;2016年08期

4 祖坤琳;趙銘偉;林鴻飛;;基于有序聚類的專利知識(shí)演化研究[J];計(jì)算機(jī)工程與科學(xué);2016年04期

5 韓紅旗;付媛;朱禮軍;;基于專利IPC分類號(hào)的技術(shù)競(jìng)爭(zhēng)對(duì)象的群組分析方法[J];情報(bào)工程;2015年04期

6 陳海紅;;多核SVM文本分類研究[J];軟件;2015年05期

7 秦曉慧;樂(lè)小虬;;基于LDA主題關(guān)聯(lián)過(guò)濾的領(lǐng)域主題演化研究[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2015年03期

8 王鵬;高鋮;陳曉美;;基于LDA模型的文本聚類研究[J];情報(bào)科學(xué);2015年01期

9 魏景璇;魯燃;張艷輝;;基于動(dòng)態(tài)閾值和命名實(shí)體的雙重過(guò)濾話題追蹤[J];計(jì)算機(jī)應(yīng)用研究;2015年04期

10 李湘東;張嬌;袁滿;;基于LDA模型的科技期刊主題演化研究[J];情報(bào)雜志;2014年07期

相關(guān)會(huì)議論文前1條

1 王會(huì)珍;朱靖波;陳文亮;季鐸;張斌;;基于一元語(yǔ)法模型的中文話題追蹤[A];第二屆全國(guó)學(xué)生計(jì)算語(yǔ)言學(xué)研討會(huì)論文集[C];2004年

，

本文編號(hào)：2289173

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2289173.html

上一篇：K-Means聚類算法的優(yōu)化及在圖片去重中的應(yīng)用
下一篇：一種個(gè)性化智能家居控制系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于LDA模型的專利文本分類及演化研究