基于異構(gòu)網(wǎng)絡(luò)的微博新聞事件自動檢測與摘要算法研究與實現(xiàn)
發(fā)布時間:2018-06-29 23:13
本文選題:異構(gòu)信息網(wǎng)絡(luò) + 跨模態(tài)融合 ; 參考:《西南交通大學》2017年碩士論文
【摘要】:如今,微博平臺在實時傳播信息方面發(fā)揮了重要作用。然而,由于其具有規(guī)模大、實時性強和數(shù)據(jù)非結(jié)構(gòu)化的特點,常見的數(shù)據(jù)挖掘方法在處理它們時不再適用。為了克服傳統(tǒng)微博事件檢測與摘要方法忽視微博平臺中豐富視覺和社交信息的缺點,幫助人們快速掌握本質(zhì)意義的大量的微博,本文以著名社交網(wǎng)站Twitter上多個個熱點話題約100萬數(shù)據(jù)作為主要研究對象,主要研究了跨模態(tài)微博事件檢測、摘要?紤]包括文本、視覺、社交、時間等多個特征,提出了基于異構(gòu)網(wǎng)絡(luò)的事件檢測和摘要框架。首先在數(shù)據(jù)預(yù)處理階段,定義嚴格的過濾模式去除無意義的博文和圖片;接下來在事件檢測階段,使用異構(gòu)網(wǎng)絡(luò)模擬微博數(shù)據(jù)的異質(zhì)特性,采用后期多模態(tài)融合實體相似性模型來組合Twitter數(shù)據(jù)的異質(zhì)特征,并使用近似相似算法生成融合特征后的同構(gòu)圖。下一步在同構(gòu)相似度圖上采用改進DBSCAN的算法,融入概率模型解決子話題分割的問題,然后根據(jù)子話題的熱度及新穎度對產(chǎn)生的聚類排序。最后,分別為話題生成文本和視覺摘要。本文的貢獻如下:1、利用多模態(tài)信息構(gòu)建動態(tài)異構(gòu)信息網(wǎng)絡(luò),解決傳統(tǒng)方法不能利用微博豐富附加信息的缺點。利用AFF函數(shù)融合多模態(tài)特征,考慮它們的語義相似性和時空接近性來區(qū)分事件。從異構(gòu)網(wǎng)絡(luò)轉(zhuǎn)換為同構(gòu)網(wǎng)絡(luò),保留關(guān)鍵信息的同時為之后的檢測和摘要簡化結(jié)構(gòu)。2、為了提高檢測和摘要的多樣性,減少話題分割的現(xiàn)象,在聚類階段,提出HRDBSCAN算法,在原有聚類算法的基礎(chǔ)上結(jié)合概率統(tǒng)計方法合并相似類簇;在摘要階段,對子話題摘要結(jié)果再聚類,確保每個子話題在摘要只出現(xiàn)一次。3、在包含若干真實事件的Twitter數(shù)據(jù)集上實驗,實驗結(jié)果證明與現(xiàn)有方法相比本文提出框架的新穎性和優(yōu)越性。
[Abstract]:Nowadays, the Weibo platform plays an important role in spreading information in real time. However, because of its large scale, strong real-time and unstructured data, common data mining methods are no longer applicable to deal with them. In order to overcome the shortcomings of traditional Weibo event detection and summary methods which ignore the rich visual and social information in Weibo platform and help people quickly grasp a large number of Weibo with essential meaning. In this paper, 1 million data about 1 million hot topics on the famous social network are taken as the main research object, and the cross-modal Weibo event detection is mainly studied. Considering text, visual, social, time and other features, an event detection and summary framework based on heterogeneous networks is proposed. In the data preprocessing stage, strict filtering mode is defined to remove meaningless blog posts and images. Then, heterogeneous network is used to simulate the heterogeneity of Weibo data in the event detection phase. A multimodal fusion entity similarity model is used to combine the heterogeneous features of Twitter data, and an approximate similarity algorithm is used to generate the homocomposition of the fusion features. In the next step, the improved DBSCAN algorithm is used in the isomorphic similarity graph to solve the sub-topic segmentation problem by incorporating the probability model, and then the resulting clustering is sorted according to the heat and novelty of the sub-topic. Finally, text and visual summary are generated for the topic. The contributions of this paper are as follows: 1. We use multi-modal information to construct dynamic heterogeneous information network to solve the problem that traditional methods can not enrich additional information by using Weibo. AFF functions are used to fuse multi-modal features and their semantic similarity and spatio-temporal proximity are considered to distinguish events. In order to improve the diversity of detection and summary and reduce the phenomenon of topic segmentation, HRDBSCAN algorithm is proposed in the clustering stage, in order to improve the diversity of detection and summary and reduce the phenomenon of topic segmentation. On the basis of the original clustering algorithm combined with the probability and statistics method to merge the similar clusters, in the summary stage, the sub-topic summary results of the clustering, Make sure that each subtopic only appears once. 3 in the summary, and experiment on the Twitter dataset containing some real events. The experimental results show that the proposed framework is more novel and superior than the existing methods.
【學位授予單位】:西南交通大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP393.092
【參考文獻】
相關(guān)期刊論文 前1條
1 劉美玲;鄭德權(quán);趙鐵軍;于洋;;動態(tài)多文檔文摘模型[J];軟件學報;2012年02期
,本文編號:2083771
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2083771.html
最近更新
教材專著