天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

基于ICE-LDA模型的中英文跨語言話題發(fā)現(xiàn)研究

發(fā)布時間:2018-10-17 17:05
【摘要】:近年來互聯(lián)網(wǎng)在全球化的大背景下飛速發(fā)展,針對跨語言的網(wǎng)絡(luò)數(shù)據(jù)挖掘成為國內(nèi)外輿情分析的熱點(diǎn)問題,有效實(shí)時地檢測中英文網(wǎng)絡(luò)環(huán)境下的熱點(diǎn)話題對輿情的掌握和輿情的發(fā)展有著至關(guān)重要的作用。網(wǎng)絡(luò)新聞作為網(wǎng)絡(luò)信息輿情中的重要組成部分,由于互聯(lián)網(wǎng)的大規(guī)模普及而成為人們方便快捷獲知信息的重要來源。首先,本文選擇中文與英文的網(wǎng)絡(luò)新聞作為數(shù)據(jù)源進(jìn)行采集,提出了在LDA模型上改進(jìn)的ICE-LDA模型進(jìn)行跨英漢語言網(wǎng)絡(luò)環(huán)境下的共現(xiàn)話題發(fā)現(xiàn)。采用話題向量化的方式,對建模產(chǎn)生的話題進(jìn)行JS距離檢測和話題文本分布相似度度量。其次,本文分別對爬蟲采集到的中英混合新聞數(shù)據(jù)分別構(gòu)建可對比平行語料集和非可對比語料集進(jìn)行話題建模,在建模過程中利用TF-IDF算法對文檔提取特征詞去噪,提高話題特征表示去除無意義噪音詞。最后,分別采用兩種不同的話題向量化方式進(jìn)行跨語言的共現(xiàn)話題發(fā)現(xiàn)建模。實(shí)驗(yàn)結(jié)果表明,在本文設(shè)計(jì)的爬蟲采集構(gòu)建的真實(shí)數(shù)據(jù)集上,改進(jìn)后的話題模型不僅能夠在不需要先驗(yàn)話題對的情況下對可對比語料集進(jìn)行跨語言共現(xiàn)話題進(jìn)行發(fā)現(xiàn),而且能夠?qū)φZ料不平衡的情況進(jìn)行共現(xiàn)話題發(fā)現(xiàn)。
[Abstract]:In recent years, the Internet has developed rapidly under the background of globalization. Cross-language network data mining has become a hot issue in the analysis of public opinion at home and abroad. Effective real-time detection of hot topics in the Chinese and English network environment plays an important role in the mastery and development of public opinion. As an important part of the network information public opinion, network news has become an important source for people to get information conveniently and quickly because of the large-scale popularization of the Internet. Firstly, this paper chooses the Chinese and English network news as the data source to collect, and proposes an improved ICE-LDA model based on the LDA model for co-occurrence topic discovery across the English-Chinese language network environment. Topic vectorization is used to detect the JS distance and measure the similarity of topic text distribution. Secondly, this paper constructs the Chinese and English mixed news data collected by the crawler to model the topic set of the parallel corpus and the non-comparable corpus respectively. In the process of modeling, the TF-IDF algorithm is used to remove the noise of the feature words extracted from the document. Improve topic feature to remove meaningless noise words. Finally, two different methods of topic vectorization are used to model cross-language co-occurrence topic discovery. The experimental results show that the improved topic model can not only discover the cross-language co-occurrence of the comparable corpus without a priori topic pair on the real data set constructed by the crawler collected in this paper. Moreover, we can find the co-occurrence topic in the unbalanced situation of the corpus.
【作者單位】: 四川大學(xué)網(wǎng)絡(luò)空間安全研究院;四川大學(xué)計(jì)算機(jī)學(xué)院;
【基金】:國家科技支撐計(jì)劃資助項(xiàng)目(2012BAH18B05) 國家自然科學(xué)基金資助項(xiàng)目(61272447) 四川大學(xué)青年教師啟動基金(2015SCU11079)
【分類號】:TP391.1
,

本文編號:2277362

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2277362.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶48685***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
免费特黄一级一区二区三区| 激情亚洲一区国产精品久久| 日韩三极片在线免费播放| 亚洲国产欧美久久精品| 久久热在线视频免费观看| 日本午夜精品视频在线观看| 国产内射一级二级三级| 亚洲黄色在线观看免费高清| 麻豆印象传媒在线观看| 亚洲精品国产福利在线| 亚洲熟妇中文字幕五十路| 色鬼综合久久鬼色88| 国产精品色热综合在线| 中国一区二区三区人妻| 国产精品一区二区三区日韩av| 久久青青草原中文字幕| 人妻精品一区二区三区视频免精| 精品女同在线一区二区| 深夜少妇一区二区三区| 九九热视频经典在线观看| 日本加勒比系列在线播放| 99久久国产综合精品二区| 麻豆精品在线一区二区三区| 欧美午夜一级特黄大片| 国产麻豆视频一二三区| 亚洲国产精品久久琪琪| 亚洲精品成人午夜久久| 色婷婷中文字幕在线视频| 少妇人妻中出中文字幕| 国内九一激情白浆发布| 日本在线视频播放91| 中文人妻精品一区二区三区四区 | 久久福利视频这里有精品| 国产精品一区二区高潮| 69久久精品亚洲一区二区| 伊人天堂午夜精品草草网| 亚洲中文字幕视频在线观看| 真实国产乱子伦对白视频不卡| 国产一级内射麻豆91| 欧美午夜一区二区福利视频| 不卡一区二区在线视频|