基于層次語義結(jié)構(gòu)的流式文本數(shù)據(jù)挖掘

發(fā)布時間：2018-10-16 10:56

【摘要】：文本作為一種人類信息交流的基本方式,在非結(jié)構(gòu)化數(shù)據(jù)中占有極其重要的地位。與其他形式的數(shù)據(jù)相比,文本數(shù)據(jù)通常價值較高,因而對文本數(shù)據(jù)自動分析和挖掘方法的研究一直是計算機領(lǐng)域的一個熱門話題。目前互聯(lián)網(wǎng)上的文本數(shù)據(jù)增長十分迅速,且是每時每刻持續(xù)不斷生成的,因此可將其看作是一條條連續(xù)的文本流。與傳統(tǒng)文本數(shù)據(jù)相比,流式文本數(shù)據(jù)具有一些新的特點:1)文本流中的很多數(shù)據(jù)是低質(zhì)量的,較難提取有效語義信息;2)文本流中的模式是動態(tài)變化的,對挖掘技術(shù)提出了準確捕捉這種變化的要求。以上這些特點對現(xiàn)有文本數(shù)據(jù)挖掘技術(shù)提出了新的挑戰(zhàn)。目前流式文本數(shù)據(jù)挖掘技術(shù)尚未十分完善,急需提出針對以上挑戰(zhàn)的相關(guān)算法。層次結(jié)構(gòu)作為常見的數(shù)據(jù)組織方式,不僅能夠更加精確的反映數(shù)據(jù)間的固有關(guān)系,并且是實現(xiàn)自適應(yīng)方法的一種重要途徑,而基于自適應(yīng)方法可實現(xiàn)自動匹配流式數(shù)據(jù)中不斷變化的模式。本文將層次結(jié)構(gòu)應(yīng)用到流式文本數(shù)據(jù)挖掘中,從概念層次構(gòu)建、稀有類別檢測和在線主題檢測等三方面入手,提出了三種方法以期提高流式文本數(shù)據(jù)挖掘的性能。最后基于上述方法,本文提出了一種針對流式文本數(shù)據(jù)挖掘的半監(jiān)督在線層次主題模型。本文具體貢獻如下:1)針對現(xiàn)有概念層次構(gòu)建方法在微博、用戶評論等不規(guī)范短文本中語義關(guān)系提取精度較低的問題,提出了一種基于復(fù)合語義距離的多路概念層次構(gòu)建方法。該方法中的復(fù)合語義距離結(jié)合了語義字典距離和上下文距離的優(yōu)點,并且保證了方法的適用范圍和所獲取的語義關(guān)系的精度。同時,本文還提出一種改進的多路凝聚聚類算法用以構(gòu)建概念層次。相對傳統(tǒng)凝聚聚類而言,多路凝聚聚類能保持概念對間的相對遠近關(guān)系。此外,本文還提出一種改進的概念層次相似度標準,該標準解決了其原始形式中可能出現(xiàn)的多次匹配問題。實驗結(jié)果表明,該方法生成的概念層次與真實概念層次的相似度為所有對比方法中最高。2)針對從文本流的概念層次或主題層次中發(fā)現(xiàn)新概念或主題的問題,提出了一種基于層次密度聚類的稀有類別檢測方法。在社交網(wǎng)絡(luò)或新聞流中,發(fā)現(xiàn)新穎的文檔或者新興主題是很有價值的,異常檢測在新穎數(shù)據(jù)檢測中可發(fā)揮關(guān)鍵作用。為了改進現(xiàn)有檢測方法,本文首先提出了一種基于相對距離約束和核函數(shù)的半監(jiān)督密度聚類算法(Relative Comparison Kernel Mean Shift,RKMS)。與其原始形式相比,RKMS可擴展性更強,且更加適合層次聚類這種應(yīng)用場景。然后本文基于RKMS提出了一種基于層次結(jié)構(gòu)的稀有類別檢測方法。與現(xiàn)有同類方法相比,該方法的優(yōu)點是無需預(yù)先指定類別的數(shù)目,且可通過結(jié)合主動學(xué)習(xí)和半監(jiān)督學(xué)習(xí)實現(xiàn)模型的逐步優(yōu)化。實驗結(jié)果表明,該稀有類別檢測方法在使用線性映射和非線性映射的情況下均比其他方法表現(xiàn)更好。3)針對從持續(xù)輸入的文本流中檢測和跟蹤主題的問題,提出了一種在線的層次主題模型(Hierarchical Online Non-negative Matrix Factorization,HONMF)�，F(xiàn)有在線主題模型大多以扁平方式組織已發(fā)現(xiàn)的主題,但將每個主題視作互相獨立的個體忽略了主題間的潛在關(guān)系,因而限制了這些主題模型的表達能力。針對該問題,本文首先對在線字典學(xué)習(xí)方法進行擴展并提出一種層次的在線稀疏矩陣分解方法,其可生成以層次形式組織的主題。同時,本文借鑒均值漂移(Mean Shift)聚類的思想提出一種基于主題帶寬(Topic Bandwidth)的主題層次結(jié)構(gòu)控制機制,其可自適應(yīng)的決定主題節(jié)點的數(shù)目和主題層次的深度。此外,本文還提出在已有主題層次中檢測新興主題和消亡主題的標準,并基于這些標準實現(xiàn)主題層次結(jié)構(gòu)的動態(tài)演化。實驗結(jié)果表明,HONMF能夠在更短的運行時間內(nèi)發(fā)現(xiàn)更高質(zhì)量的主題,并且可跟蹤主題結(jié)構(gòu)的變化。4)為了驗證本文研究路線的整體效果和進一步提升HONMF的性能,提出了一種基于語義關(guān)系的半監(jiān)督層次在線主題檢測框架(Semantic Relation based Semi-supervised Hierarchical Online Non-negative Matrix Factorization,SSHONMF),其將本文前述研究工作整合融合到一套流程中。該流程首先根據(jù)語義詞典和訓(xùn)練文檔生成針對特定文本挖掘任務(wù)的概念層次,并基于其中的語義關(guān)系對原始文檔矩陣進行調(diào)整。接著其會使用HONMF檢測文本流中的主題層次,同時基于本文稀有類別檢測方法中的選擇指標從主題層次中選擇出線索文檔。最后,其將根據(jù)線索文檔學(xué)習(xí)出新的相似度度量并用于后續(xù)的HONMF過程。實驗結(jié)果表明,通過結(jié)合前述方法,SSHONMF的性能比HONMF有所提升,證明了本文研究路線的合理性和有效性。
[Abstract]:As a basic way of human information communication, the text occupies an important position in unstructured data. Compared with other forms of data, text data is usually of high value, so the research on automatic analysis and mining of text data has always been a hot topic in the field of text data. At present, the growth of text data on the Internet is fast and continuously generated every minute, so it can be seen as a continuous stream of text. Compared with the traditional text data, the streaming text data has some new features: 1) Many of the data in the text stream are low-quality, more difficult to extract the effective semantic information; 2) the mode in the text stream is dynamically changed, and the mining technology is required to accurately capture the change. These features put forward new challenges to the existing text data mining technology. At present, the streaming text data mining technology has not been perfect, and the related algorithms for the above challenge are urgently needed. As a common data organization mode, the hierarchical structure not only can reflect the inherent relation of data more accurately, but also an important way to realize the adaptive method, while the self-adapting method can realize the changing mode in the automatic matching streaming data. In this paper, the hierarchy is applied to streaming text data mining. From three aspects, such as concept hierarchy construction, rare category detection and on-line topic detection, three methods are proposed in order to improve the performance of streaming text data mining. Finally, based on the above method, this paper presents a semi-supervised online hierarchical topic model for streaming text data mining. The specific contribution of this paper is as follows: 1) A multi-path concept hierarchy construction method based on composite semantic distance is proposed aiming at the problem that the existing concept hierarchy construction method does not standardize the extraction precision of semantic relation in short text in micro bo and user's comments. The composite semantic distance in the method combines the advantages of semantic dictionary distance and context distance, and guarantees the application range of the method and the accuracy of the acquired semantic relation. At the same time, an improved multi-path coherent clustering algorithm is proposed to construct the concept hierarchy. In contrast to traditional condensed polytypes, the multi-condensed poly (poly) can maintain the relative near-far relation between concept pairs. In addition, an improved concept hierarchy similarity criterion is proposed, which solves the multiple matching problems that may occur in its original form. the experimental results show that the similarity between the concept hierarchy generated by the method and the real concept hierarchy is the highest in all comparison methods. 2) aiming at the problem that a new concept or theme is found in the concept hierarchy or the theme layer of the text stream, A rare category detection method based on hierarchical density clustering is proposed. In social networks or news flows, new documents or emerging topics are found to be valuable and anomaly detection plays a key role in new data detection. In order to improve the existing detection methods, a semi-supervised density clustering algorithm based on relative distance constraint and kernel function is proposed in this paper. Compared with its original form, RKMS has stronger extensibility and is more suitable for the application scenarios of hierarchical clustering. Then based on RKMS, this paper presents a method of detecting rare category based on hierarchical structure. Compared with the prior similar method, the method has the advantages that the number of pre-specified categories is not needed, and the stepwise optimization of the model can be realized by combining the active learning and the semi-supervised learning. The experimental results show that this method is better than others in the case of linear mapping and non-linear mapping. 3) Aiming at the problem of detecting and tracking the subject from the continuous input text stream, an on-line hierarchical topic model is proposed. HONMF). Most of the existing online topic models organize the discovered topics in a flat manner, but each topic is treated as independent individuals that ignore the potential relationships between the topics, thus limiting the expression of these subject models. In order to solve the problem, this paper firstly extends the online dictionary learning method and proposes a hierarchical online sparse matrix decomposition method, which can generate the theme organized in hierarchical form. At the same time, this paper proposes a theme hierarchy control mechanism based on Topic Bandwidth (Mean Shift), which can adaptively determine the number and depth of theme nodes. In addition, this paper puts forward the criteria for detecting emerging themes and disappearing themes in the existing theme levels, and realizes the dynamic evolution of the thematic hierarchy based on these criteria. Experimental results show that HONMF can find more quality themes in shorter operating times and can track changes in subject structure. 4) In order to verify the overall effect of this study route and further improve the performance of HONMF, A semi-supervised hierarchy on-line theme detection framework (SSHONMF) based on semantic relations is proposed, which combines the research work described in this paper into a set of processes. The process firstly generates the concept hierarchy for the specific text mining task according to the semantic dictionary and the training document, and adjusts the original document matrix based on the semantic relation. Then it uses the HONMF to detect the subject level in the text stream, while selecting a thread document from the subject hierarchy based on the selection index in the rare category detection method described herein. Finally, it learns a new similarity measure based on the thread document and is used for subsequent HONMF processes. The experimental results show that SSHONMF is better than HONMF by combining the above-mentioned method, which proves the rationality and validity of the study route.
【學(xué)位授予單位】：浙江大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2016
【分類號】：TP391.1

【相似文獻】

相關(guān)期刊論文前10條

1 ;淺析大規(guī)模文本數(shù)據(jù)挖掘技術(shù)在媒體中的創(chuàng)新應(yīng)用[J];中國傳媒科技;2007年11期

2 齊彬;呂婷;;共現(xiàn)分析技術(shù)在生物醫(yī)學(xué)信息文本數(shù)據(jù)挖掘中的應(yīng)用[J];中華醫(yī)學(xué)圖書情報雜志;2009年03期

3 陳建平,侯昌波,王功文,呂鵬,朱鵬飛,曾敏,吳文;礦產(chǎn)資源定量評價中文本數(shù)據(jù)挖掘研究[J];物探化探計算技術(shù);2005年03期

4 方群;;文本數(shù)據(jù)挖掘中的進化信息算法[J];艦船電子工程;2010年08期

5 孫學(xué)軍;;Web文本數(shù)據(jù)挖掘技術(shù)及其在電子商務(wù)中的應(yīng)用[J];菏澤學(xué)院學(xué)報;2011年02期

6 宋瑞祺;;Web文本數(shù)據(jù)挖掘關(guān)鍵技術(shù)及其在網(wǎng)絡(luò)檢索中的應(yīng)用[J];山西財經(jīng)大學(xué)學(xué)報(高等教育版);2007年S1期

7 蔡立斌;;文本數(shù)據(jù)挖掘技術(shù)在Web知識庫中的應(yīng)用研究[J];科技通報;2012年12期

8 徐龍璽,吳文武;基于Web的文本數(shù)據(jù)挖掘的研究[J];山東省農(nóng)業(yè)管理干部學(xué)院學(xué)報;2005年04期

9 王偉強;高文;段立娟;;Internet上的文本數(shù)據(jù)挖掘[J];計算機科學(xué);2000年04期

10 陳建麗;;基于XML的Web文本數(shù)據(jù)挖掘模型構(gòu)建[J];電腦與電信;2008年09期

相關(guān)重要報紙文章前1條

1 編譯劉光強　王娟;香港1823政府熱線：讓百姓暢所欲言[N];中國計算機報;2010年

相關(guān)博士學(xué)位論文前1條

1 涂鼎;基于層次語義結(jié)構(gòu)的流式文本數(shù)據(jù)挖掘[D];浙江大學(xué);2016年

相關(guān)碩士學(xué)位論文前3條

1 鄒慶軒;基于關(guān)聯(lián)規(guī)則的文本數(shù)據(jù)挖掘研究[D];西南石油大學(xué);2006年

2 王禮剛;基于XML的Web文本數(shù)據(jù)挖掘研究[D];西南大學(xué);2007年

3 劉列夫;文本數(shù)據(jù)挖掘在工程圖文檔中的應(yīng)用[D];浙江大學(xué);2006年

，

本文編號：2274145

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/xxkjbs/2274145.html

上一篇：離散時間時滯系統(tǒng)隨機控制研究
下一篇：相干快跳頻系統(tǒng)關(guān)鍵技術(shù)研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于層次語義結(jié)構(gòu)的流式文本數(shù)據(jù)挖掘