基于層次語義結(jié)構(gòu)的流式文本數(shù)據(jù)挖掘
[Abstract]:As a basic way of human information communication, the text occupies an important position in unstructured data. Compared with other forms of data, text data is usually of high value, so the research on automatic analysis and mining of text data has always been a hot topic in the field of text data. At present, the growth of text data on the Internet is fast and continuously generated every minute, so it can be seen as a continuous stream of text. Compared with the traditional text data, the streaming text data has some new features: 1) Many of the data in the text stream are low-quality, more difficult to extract the effective semantic information; 2) the mode in the text stream is dynamically changed, and the mining technology is required to accurately capture the change. These features put forward new challenges to the existing text data mining technology. At present, the streaming text data mining technology has not been perfect, and the related algorithms for the above challenge are urgently needed. As a common data organization mode, the hierarchical structure not only can reflect the inherent relation of data more accurately, but also an important way to realize the adaptive method, while the self-adapting method can realize the changing mode in the automatic matching streaming data. In this paper, the hierarchy is applied to streaming text data mining. From three aspects, such as concept hierarchy construction, rare category detection and on-line topic detection, three methods are proposed in order to improve the performance of streaming text data mining. Finally, based on the above method, this paper presents a semi-supervised online hierarchical topic model for streaming text data mining. The specific contribution of this paper is as follows: 1) A multi-path concept hierarchy construction method based on composite semantic distance is proposed aiming at the problem that the existing concept hierarchy construction method does not standardize the extraction precision of semantic relation in short text in micro bo and user's comments. The composite semantic distance in the method combines the advantages of semantic dictionary distance and context distance, and guarantees the application range of the method and the accuracy of the acquired semantic relation. At the same time, an improved multi-path coherent clustering algorithm is proposed to construct the concept hierarchy. In contrast to traditional condensed polytypes, the multi-condensed poly (poly) can maintain the relative near-far relation between concept pairs. In addition, an improved concept hierarchy similarity criterion is proposed, which solves the multiple matching problems that may occur in its original form. the experimental results show that the similarity between the concept hierarchy generated by the method and the real concept hierarchy is the highest in all comparison methods. 2) aiming at the problem that a new concept or theme is found in the concept hierarchy or the theme layer of the text stream, A rare category detection method based on hierarchical density clustering is proposed. In social networks or news flows, new documents or emerging topics are found to be valuable and anomaly detection plays a key role in new data detection. In order to improve the existing detection methods, a semi-supervised density clustering algorithm based on relative distance constraint and kernel function is proposed in this paper. Compared with its original form, RKMS has stronger extensibility and is more suitable for the application scenarios of hierarchical clustering. Then based on RKMS, this paper presents a method of detecting rare category based on hierarchical structure. Compared with the prior similar method, the method has the advantages that the number of pre-specified categories is not needed, and the stepwise optimization of the model can be realized by combining the active learning and the semi-supervised learning. The experimental results show that this method is better than others in the case of linear mapping and non-linear mapping. 3) Aiming at the problem of detecting and tracking the subject from the continuous input text stream, an on-line hierarchical topic model is proposed. HONMF). Most of the existing online topic models organize the discovered topics in a flat manner, but each topic is treated as independent individuals that ignore the potential relationships between the topics, thus limiting the expression of these subject models. In order to solve the problem, this paper firstly extends the online dictionary learning method and proposes a hierarchical online sparse matrix decomposition method, which can generate the theme organized in hierarchical form. At the same time, this paper proposes a theme hierarchy control mechanism based on Topic Bandwidth (Mean Shift), which can adaptively determine the number and depth of theme nodes. In addition, this paper puts forward the criteria for detecting emerging themes and disappearing themes in the existing theme levels, and realizes the dynamic evolution of the thematic hierarchy based on these criteria. Experimental results show that HONMF can find more quality themes in shorter operating times and can track changes in subject structure. 4) In order to verify the overall effect of this study route and further improve the performance of HONMF, A semi-supervised hierarchy on-line theme detection framework (SSHONMF) based on semantic relations is proposed, which combines the research work described in this paper into a set of processes. The process firstly generates the concept hierarchy for the specific text mining task according to the semantic dictionary and the training document, and adjusts the original document matrix based on the semantic relation. Then it uses the HONMF to detect the subject level in the text stream, while selecting a thread document from the subject hierarchy based on the selection index in the rare category detection method described herein. Finally, it learns a new similarity measure based on the thread document and is used for subsequent HONMF processes. The experimental results show that SSHONMF is better than HONMF by combining the above-mentioned method, which proves the rationality and validity of the study route.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2016
【分類號】:TP391.1
【相似文獻】
相關(guān)期刊論文 前10條
1 ;淺析大規(guī)模文本數(shù)據(jù)挖掘技術(shù)在媒體中的創(chuàng)新應(yīng)用[J];中國傳媒科技;2007年11期
2 齊彬;呂婷;;共現(xiàn)分析技術(shù)在生物醫(yī)學(xué)信息文本數(shù)據(jù)挖掘中的應(yīng)用[J];中華醫(yī)學(xué)圖書情報雜志;2009年03期
3 陳建平,侯昌波,王功文,呂鵬,朱鵬飛,曾敏,吳文;礦產(chǎn)資源定量評價中文本數(shù)據(jù)挖掘研究[J];物探化探計算技術(shù);2005年03期
4 方群;;文本數(shù)據(jù)挖掘中的進化信息算法[J];艦船電子工程;2010年08期
5 孫學(xué)軍;;Web文本數(shù)據(jù)挖掘技術(shù)及其在電子商務(wù)中的應(yīng)用[J];菏澤學(xué)院學(xué)報;2011年02期
6 宋瑞祺;;Web文本數(shù)據(jù)挖掘關(guān)鍵技術(shù)及其在網(wǎng)絡(luò)檢索中的應(yīng)用[J];山西財經(jīng)大學(xué)學(xué)報(高等教育版);2007年S1期
7 蔡立斌;;文本數(shù)據(jù)挖掘技術(shù)在Web知識庫中的應(yīng)用研究[J];科技通報;2012年12期
8 徐龍璽,吳文武;基于Web的文本數(shù)據(jù)挖掘的研究[J];山東省農(nóng)業(yè)管理干部學(xué)院學(xué)報;2005年04期
9 王偉強;高文;段立娟;;Internet上的文本數(shù)據(jù)挖掘[J];計算機科學(xué);2000年04期
10 陳建麗;;基于XML的Web文本數(shù)據(jù)挖掘模型構(gòu)建[J];電腦與電信;2008年09期
相關(guān)重要報紙文章 前1條
1 編譯 劉光強 王娟;香港1823政府熱線:讓百姓暢所欲言[N];中國計算機報;2010年
相關(guān)博士學(xué)位論文 前1條
1 涂鼎;基于層次語義結(jié)構(gòu)的流式文本數(shù)據(jù)挖掘[D];浙江大學(xué);2016年
相關(guān)碩士學(xué)位論文 前3條
1 鄒慶軒;基于關(guān)聯(lián)規(guī)則的文本數(shù)據(jù)挖掘研究[D];西南石油大學(xué);2006年
2 王禮剛;基于XML的Web文本數(shù)據(jù)挖掘研究[D];西南大學(xué);2007年
3 劉列夫;文本數(shù)據(jù)挖掘在工程圖文檔中的應(yīng)用[D];浙江大學(xué);2006年
,本文編號:2274145
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/2274145.html