基于ICE-LDA模型的中英文跨語言話題發(fā)現(xiàn)研究
發(fā)布時間:2018-10-17 17:05
【摘要】:近年來互聯(lián)網(wǎng)在全球化的大背景下飛速發(fā)展,針對跨語言的網(wǎng)絡(luò)數(shù)據(jù)挖掘成為國內(nèi)外輿情分析的熱點(diǎn)問題,有效實(shí)時地檢測中英文網(wǎng)絡(luò)環(huán)境下的熱點(diǎn)話題對輿情的掌握和輿情的發(fā)展有著至關(guān)重要的作用。網(wǎng)絡(luò)新聞作為網(wǎng)絡(luò)信息輿情中的重要組成部分,由于互聯(lián)網(wǎng)的大規(guī)模普及而成為人們方便快捷獲知信息的重要來源。首先,本文選擇中文與英文的網(wǎng)絡(luò)新聞作為數(shù)據(jù)源進(jìn)行采集,提出了在LDA模型上改進(jìn)的ICE-LDA模型進(jìn)行跨英漢語言網(wǎng)絡(luò)環(huán)境下的共現(xiàn)話題發(fā)現(xiàn)。采用話題向量化的方式,對建模產(chǎn)生的話題進(jìn)行JS距離檢測和話題文本分布相似度度量。其次,本文分別對爬蟲采集到的中英混合新聞數(shù)據(jù)分別構(gòu)建可對比平行語料集和非可對比語料集進(jìn)行話題建模,在建模過程中利用TF-IDF算法對文檔提取特征詞去噪,提高話題特征表示去除無意義噪音詞。最后,分別采用兩種不同的話題向量化方式進(jìn)行跨語言的共現(xiàn)話題發(fā)現(xiàn)建模。實(shí)驗(yàn)結(jié)果表明,在本文設(shè)計(jì)的爬蟲采集構(gòu)建的真實(shí)數(shù)據(jù)集上,改進(jìn)后的話題模型不僅能夠在不需要先驗(yàn)話題對的情況下對可對比語料集進(jìn)行跨語言共現(xiàn)話題進(jìn)行發(fā)現(xiàn),而且能夠?qū)φZ料不平衡的情況進(jìn)行共現(xiàn)話題發(fā)現(xiàn)。
[Abstract]:In recent years, the Internet has developed rapidly under the background of globalization. Cross-language network data mining has become a hot issue in the analysis of public opinion at home and abroad. Effective real-time detection of hot topics in the Chinese and English network environment plays an important role in the mastery and development of public opinion. As an important part of the network information public opinion, network news has become an important source for people to get information conveniently and quickly because of the large-scale popularization of the Internet. Firstly, this paper chooses the Chinese and English network news as the data source to collect, and proposes an improved ICE-LDA model based on the LDA model for co-occurrence topic discovery across the English-Chinese language network environment. Topic vectorization is used to detect the JS distance and measure the similarity of topic text distribution. Secondly, this paper constructs the Chinese and English mixed news data collected by the crawler to model the topic set of the parallel corpus and the non-comparable corpus respectively. In the process of modeling, the TF-IDF algorithm is used to remove the noise of the feature words extracted from the document. Improve topic feature to remove meaningless noise words. Finally, two different methods of topic vectorization are used to model cross-language co-occurrence topic discovery. The experimental results show that the improved topic model can not only discover the cross-language co-occurrence of the comparable corpus without a priori topic pair on the real data set constructed by the crawler collected in this paper. Moreover, we can find the co-occurrence topic in the unbalanced situation of the corpus.
【作者單位】: 四川大學(xué)網(wǎng)絡(luò)空間安全研究院;四川大學(xué)計(jì)算機(jī)學(xué)院;
【基金】:國家科技支撐計(jì)劃資助項(xiàng)目(2012BAH18B05) 國家自然科學(xué)基金資助項(xiàng)目(61272447) 四川大學(xué)青年教師啟動基金(2015SCU11079)
【分類號】:TP391.1
,
本文編號:2277362
[Abstract]:In recent years, the Internet has developed rapidly under the background of globalization. Cross-language network data mining has become a hot issue in the analysis of public opinion at home and abroad. Effective real-time detection of hot topics in the Chinese and English network environment plays an important role in the mastery and development of public opinion. As an important part of the network information public opinion, network news has become an important source for people to get information conveniently and quickly because of the large-scale popularization of the Internet. Firstly, this paper chooses the Chinese and English network news as the data source to collect, and proposes an improved ICE-LDA model based on the LDA model for co-occurrence topic discovery across the English-Chinese language network environment. Topic vectorization is used to detect the JS distance and measure the similarity of topic text distribution. Secondly, this paper constructs the Chinese and English mixed news data collected by the crawler to model the topic set of the parallel corpus and the non-comparable corpus respectively. In the process of modeling, the TF-IDF algorithm is used to remove the noise of the feature words extracted from the document. Improve topic feature to remove meaningless noise words. Finally, two different methods of topic vectorization are used to model cross-language co-occurrence topic discovery. The experimental results show that the improved topic model can not only discover the cross-language co-occurrence of the comparable corpus without a priori topic pair on the real data set constructed by the crawler collected in this paper. Moreover, we can find the co-occurrence topic in the unbalanced situation of the corpus.
【作者單位】: 四川大學(xué)網(wǎng)絡(luò)空間安全研究院;四川大學(xué)計(jì)算機(jī)學(xué)院;
【基金】:國家科技支撐計(jì)劃資助項(xiàng)目(2012BAH18B05) 國家自然科學(xué)基金資助項(xiàng)目(61272447) 四川大學(xué)青年教師啟動基金(2015SCU11079)
【分類號】:TP391.1
,
本文編號:2277362
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2277362.html
最近更新
教材專著