漢越雙語(yǔ)新聞話題發(fā)現(xiàn)研究
本文關(guān)鍵詞: 新聞要素 Hadoop 漢越可比語(yǔ)料 雙語(yǔ)詞語(yǔ)相似度 漢越雙語(yǔ)話題 出處:《昆明理工大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)信息技術(shù)的進(jìn)步,中國(guó)與越南等地區(qū)在政治、經(jīng)濟(jì)、文化等各方面的交流也越來(lái)越密切。作為兩國(guó)信息交流的主要載體,及時(shí)有效的發(fā)現(xiàn)有關(guān)兩國(guó)的相關(guān)新聞話題及新聞話題的發(fā)展演化變得尤為重要。因此針對(duì)當(dāng)前尚未充分考慮到利用新聞頁(yè)面要素之間的關(guān)聯(lián)關(guān)系進(jìn)行話題發(fā)現(xiàn)的問題及漢越雙語(yǔ)平行語(yǔ)料稀缺、漢越雙語(yǔ)詞典比較難構(gòu)建、統(tǒng)計(jì)機(jī)器翻譯尚未完全成熟的現(xiàn)狀,提出了融合頁(yè)面要素關(guān)聯(lián)關(guān)系的中文新聞話題發(fā)現(xiàn)方法和基于可比語(yǔ)料詞語(yǔ)相似度的漢越跨語(yǔ)言話題發(fā)現(xiàn)方法:(1)考慮到新聞話題之間具有主題相關(guān)的特點(diǎn),同一話題中的新聞往往還存在發(fā)布時(shí)間相近、實(shí)體共現(xiàn)、事件要素共現(xiàn)等特點(diǎn),這些要素之間的關(guān)聯(lián)關(guān)系對(duì)新聞話題的發(fā)現(xiàn)具有重要影響,因此提出了融合頁(yè)面要素關(guān)聯(lián)關(guān)系的中文新聞話題發(fā)現(xiàn)方法。首先采用基于詞頻統(tǒng)計(jì)的TF-IDF方法計(jì)算基于詞的特征權(quán)重生成文檔空間向量利用余弦相似度算法計(jì)算新聞頁(yè)面相似度,得到新聞頁(yè)面初始相似度矩陣。然后以不同新聞文檔內(nèi)要素的關(guān)聯(lián)關(guān)系特征作為半監(jiān)督約束信息對(duì)初始相似度矩陣進(jìn)行校正,對(duì)調(diào)整后的初始相似度矩陣采用近鄰傳播的聚類算法實(shí)現(xiàn)文本聚類,對(duì)聚類后的新聞文檔簇抽取新聞話題,從而實(shí)現(xiàn)新聞話題的發(fā)現(xiàn)。最后通過對(duì)比實(shí)驗(yàn)驗(yàn)證融合新聞要素關(guān)聯(lián)關(guān)系的話題發(fā)現(xiàn)方法較未加入約束信息的方法取得較好的效果。(2)可比語(yǔ)料是指發(fā)表的新聞文章由兩種不同的語(yǔ)言在同一時(shí)期內(nèi)自然形成并且不同語(yǔ)言表達(dá)的新聞是主題相關(guān)的,因此提出了基于可比語(yǔ)料詞語(yǔ)相似度的漢越跨語(yǔ)言話題發(fā)現(xiàn)方法。首先利用漢越可比語(yǔ)料訓(xùn)練出雙語(yǔ)詞語(yǔ)表征的詞向量,以詞向量為基礎(chǔ),計(jì)算漢語(yǔ)查詢?cè)~與越南語(yǔ)詞之間的相似度,根據(jù)相似度值選取出越南語(yǔ)候選擴(kuò)展詞。然后根據(jù)得到的漢越雙語(yǔ)詞的相似度,實(shí)現(xiàn)中文新聞話題到越南語(yǔ)查詢擴(kuò)展的翻譯,利用查詢擴(kuò)展得到的越南詞在越南語(yǔ)語(yǔ)料庫(kù)中進(jìn)行檢索返回與查詢相關(guān)的越南語(yǔ)文檔,利用AP算法進(jìn)行聚類獲得與中文文本相關(guān)的越南語(yǔ)各類事件。對(duì)比實(shí)驗(yàn)表明本文借助可比語(yǔ)料的查詢表達(dá)式翻譯的方法較傳統(tǒng)的雙語(yǔ)LDA的方法在跨語(yǔ)言話題分析方面具有較好的效果。(3)設(shè)計(jì)并實(shí)現(xiàn)了漢越雙語(yǔ)輿情話題發(fā)現(xiàn)原型系統(tǒng),利用該系統(tǒng)可以方便快捷的了解到中國(guó)和東南亞國(guó)家對(duì)某一新聞話題的報(bào)道情況和話題詳情,為進(jìn)一步研究該課題提供了實(shí)驗(yàn)平臺(tái),為后續(xù)研究漢越雙語(yǔ)新聞話題的演變提供了相關(guān)資源。
[Abstract]:With the development of information technology on the Internet, the exchanges between China and Vietnam in political, economic, cultural and other fields are getting closer and closer. As the main carrier of information exchange between the two countries, Timely and effective discovery of relevant news topics and the development and evolution of news topics in both countries has become particularly important. The problem and the scarcity of Chinese-Vietnamese bilingual parallel data, Chinese-Vietnamese bilingual dictionaries are difficult to build, and statistical machine translation is not yet fully mature. In this paper, a Chinese news topic discovery method based on the correlation of page elements and a Chinese-Vietnamese cross-language topic discovery method based on the similarity of comparable corpus words are proposed. News in the same topic often has the characteristics of similar release time, co-occurrence of entity, co-occurrence of event elements, and so on. The relationship between these elements has an important impact on the discovery of news topics. This paper proposes a Chinese news topic discovery method based on the correlation of page elements. Firstly, the TF-IDF method based on word frequency statistics is used to calculate the feature weight generated document space vector based on word frequency, and the cosine similarity algorithm is used to calculate the document space vector. News page similarity, The initial similarity matrix of news pages is obtained, and then the initial similarity matrix is corrected by using the correlation relation feature of the elements in different news documents as semi-supervised constraint information. For the adjusted initial similarity matrix, the nearest neighbor propagation clustering algorithm is used to realize the text clustering, and the news topic is extracted from the clustered news document clusters. Finally, a comparative experiment was conducted to verify that the method of topic discovery combined with the correlation of news elements achieved better results than the method without constraint information. 2) the comparable corpus refers to the published news articles. Chapters come naturally from two different languages over the same period of time and news expressed in different languages is thematically relevant. Therefore, a cross-language topic discovery method based on the similarity of comparable corpus is proposed. Firstly, the Chinese and Vietnamese comparative corpus is used to train the word vector of bilingual words representation, which is based on the word vector. The similarity between Chinese query words and Vietnamese words is calculated, and Vietnamese candidate extension words are selected according to the similarity value. Then, according to the similarity of Chinese-Vietnamese bilingual words, the translation of Chinese news topic to Vietnamese query expansion is realized. The Vietnamese words obtained by query expansion are retrieved in the Vietnamese Corpus to return the Vietnamese language documents related to the query. The AP algorithm is used to cluster the Vietnamese events related to Chinese text. The comparative experiment shows that the method of query expression translation based on comparable corpus is more effective than the traditional bilingual LDA method in cross-language topic analysis. The prototype system of Chinese-Vietnamese bilingual public opinion topic discovery is designed and implemented. By using the system, we can easily and quickly understand the reports and details of a certain news topic in China and Southeast Asian countries, and provide an experimental platform for further research on this topic. It provides relevant resources for the further study on the evolution of bilingual news topics between China and Vietnam.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 劉端陽(yáng);王良芳;;結(jié)合語(yǔ)義擴(kuò)展度和詞匯鏈的關(guān)鍵詞提取算法[J];計(jì)算機(jī)科學(xué);2013年12期
2 田久樂;趙蔚;;基于同義詞詞林的詞語(yǔ)相似度計(jì)算方法[J];吉林大學(xué)學(xué)報(bào)(信息科學(xué)版);2010年06期
3 劉銘;王曉龍;劉遠(yuǎn)超;;基于詞匯鏈的關(guān)鍵短語(yǔ)抽取方法的研究[J];計(jì)算機(jī)學(xué)報(bào);2010年07期
4 張先飛;郭志剛;劉嵩;程磊;田雨暄;;基于觸發(fā)詞指導(dǎo)的自相似度聚類事件檢測(cè)[J];計(jì)算機(jī)科學(xué);2010年03期
5 俞輝;;基于LSA和pLSA的多文檔自動(dòng)文摘[J];計(jì)算機(jī)工程與科學(xué);2009年09期
6 肖宇;于劍;;基于近鄰傳播算法的半監(jiān)督聚類[J];軟件學(xué)報(bào);2008年11期
7 石晶;胡明;石鑫;戴國(guó)忠;;基于LDA模型的文本分割[J];計(jì)算機(jī)學(xué)報(bào);2008年10期
8 俞輝;;基于PLSA模型的Web用戶聚類算法研究[J];計(jì)算機(jī)工程與科學(xué);2008年07期
9 洪宇;張宇;劉挺;李生;;話題檢測(cè)與跟蹤的評(píng)測(cè)及研究綜述[J];中文信息學(xué)報(bào);2007年06期
10 趙華;趙鐵軍;于浩;鄭德權(quán);;基于查詢向量的英語(yǔ)話題跟蹤研究[J];計(jì)算機(jī)研究與發(fā)展;2007年08期
相關(guān)碩士學(xué)位論文 前1條
1 龔海軍;網(wǎng)絡(luò)熱點(diǎn)話題自動(dòng)發(fā)現(xiàn)技術(shù)研究[D];華中師范大學(xué);2008年
,本文編號(hào):1549605
本文鏈接:http://sikaile.net/jingjilunwen/jiliangjingjilunwen/1549605.html