基于全覆蓋粒計(jì)算的新聞文檔子話題劃分方法研究
發(fā)布時(shí)間:2018-05-12 07:33
本文選題:全覆蓋粒計(jì)算 + 主題模型 ; 參考:《太原理工大學(xué)》2017年碩士論文
【摘要】:當(dāng)今時(shí)代信息爆炸,信息量規(guī)模急劇膨脹,來自四面八方的信息如浪潮般涌入人類的生活。在如此龐大的數(shù)據(jù)面前,用戶想要在海量信息中快速、準(zhǔn)確地獲得自己感興趣的新聞話題,將面臨著巨大的挑戰(zhàn)。針對大量的新聞事件,如何按照話題進(jìn)行組織和歸類,以便能夠自動(dòng)地把相關(guān)話題的信息匯總,這已成為自然語言處理中一個(gè)重要的研究課題。話題識別與劃分技術(shù)應(yīng)運(yùn)而生,其致力于研究對來自不同的文本集進(jìn)行有效的組織、搜索與結(jié)構(gòu)化。全覆蓋粒計(jì)算是信息處理和數(shù)據(jù)挖掘的一種新的研究方法,為具有不確定、不完整信息的大規(guī)模海量數(shù)據(jù)的挖掘提供了一種新的思路。它包括全覆蓋理論和粒度的;、粒的運(yùn)算,為子話題劃分提供了一種新的解決方法。本文的創(chuàng)新點(diǎn)主要有:1、本文采用LDA(Latent Dirichlet Allocation)主題模型對海量新聞?wù)Z料進(jìn)行語義分析并建立模型,提取新聞文檔的隱含主題,得到“文檔-主題”?矩陣;通過多次實(shí)驗(yàn)對?矩陣中的概率設(shè)置合適的閾值,進(jìn)而將“文檔-主題”矩陣轉(zhuǎn)換為全覆蓋模型;在全覆蓋粒計(jì)算的基礎(chǔ)上,利用粒約簡的方法,刪除冗余覆蓋元,得到最簡覆蓋元。2、從集合論的角度提出了全覆蓋粒計(jì)算的誘導(dǎo)劃分算法DP(DerivedPartition),探討了該算法的理論依據(jù),提出了算法的具體過程,分析了算法的時(shí)間復(fù)雜度;并對算法的結(jié)構(gòu)及過程進(jìn)行了優(yōu)化,通過大量的實(shí)驗(yàn)驗(yàn)證,表明改進(jìn)后確實(shí)提高了該算法的性能;最后以實(shí)例對該算法進(jìn)行了進(jìn)一步的解釋。3、在LDA主題模型和誘導(dǎo)劃分算法的基礎(chǔ)上,設(shè)計(jì)基于全覆蓋粒計(jì)算的面向新聞文檔的子話題劃分方法;通過在搜狗新聞?wù)Z料庫上與三種傳統(tǒng)的Baseline方法、VSM方法以及經(jīng)典的Single-Pass方法的對比實(shí)驗(yàn),從不同角度驗(yàn)證了該方法的適用性、可行性和擴(kuò)展性,說明本文算法能較好的實(shí)現(xiàn)子話題劃分。
[Abstract]:Nowadays, the information explodes, the scale of information expands rapidly, and the information from all sides flows into human life. In the face of such huge data, users want to quickly and accurately get their own interesting news topics in the mass of information, will face a huge challenge. For a large number of news events, how to organize and classify them according to the topic, so as to automatically aggregate the information of related topics, has become an important research topic in natural language processing. Topic recognition and partitioning techniques emerge as the times require, which focuses on the effective organization, search and structuralization of text sets from different text sets. Full coverage computing is a new research method for information processing and data mining, which provides a new way for mining massive data with uncertain and incomplete information. It includes the theory of full coverage, granulation of granularity and operation of grain, which provides a new method for subtopic division. The main innovation of this paper is: 1. In this paper, we use the LDA(Latent Dirichlet allocation topic model to analyze and build the semantic model of mass news corpus, extract the hidden topic of news document, and get the "document-topic"? Matrix; through multiple experiments? The probability in the matrix sets the appropriate threshold value, and then the "document-topic" matrix is transformed into a full cover model, and on the basis of the calculation of the total covering particles, the redundant overlay elements are deleted by using the method of grain reduction. From the point of view of set theory, the inductive partition algorithm DPNDerivedPartitionn is proposed from the point of view of set theory. The theoretical basis of the algorithm is discussed, the concrete process of the algorithm is put forward, and the time complexity of the algorithm is analyzed. The structure and process of the algorithm are optimized, and a large number of experiments show that the improved algorithm does improve the performance of the algorithm. Finally, an example is given to further explain the algorithm. On the basis of LDA topic model and induced partitioning algorithm, a subtopic partitioning method for news documents is designed based on full coverage computing. The applicability, feasibility and expansibility of this method are verified from different angles by comparing the Sogou news corpus with three traditional Baseline methods and the classical Single-Pass method. The result shows that the algorithm can realize subtopic division well.
【學(xué)位授予單位】:太原理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 李國;張春杰;張志遠(yuǎn);;一種基于加權(quán)LDA模型的文本聚類方法[J];中國民航大學(xué)學(xué)報(bào);2016年02期
2 秦琴;謝s,
本文編號:1877710
本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/1877710.html
最近更新
教材專著