英、漢跨語言話題檢測與跟蹤技術(shù)研究

發(fā)布時(shí)間：2018-04-22 02:20

本文選題：跨語言話題檢測 + 跨語言話題跟蹤��；參考：《中央民族大學(xué)》2013年博士論文

【摘要】：當(dāng)今世界已經(jīng)逐步邁入信息化和數(shù)字化時(shí)代。根據(jù)CNNIC第30次調(diào)查報(bào)告①顯示,截止2012年6月底我國網(wǎng)絡(luò)用戶數(shù)量已達(dá)到5.38億,網(wǎng)站數(shù)達(dá)到250萬,網(wǎng)絡(luò)新聞的用戶規(guī)模達(dá)到3.92億,網(wǎng)民對網(wǎng)絡(luò)新聞的使用率高達(dá)73.0%。由于網(wǎng)絡(luò)新聞發(fā)布簡便快捷等特點(diǎn),互聯(lián)網(wǎng)已成為新聞傳播的“第四媒體”。普通民眾希望從海量網(wǎng)絡(luò)資源中獲取自己感興趣的新聞話題,同時(shí)也希望了解其他國家的新聞話題。因此,對網(wǎng)絡(luò)新聞話題進(jìn)行跨語言的檢測與跟蹤,己經(jīng)逐漸成為當(dāng)今國內(nèi)外學(xué)者研究的興趣之所在。目前的跨語言話題檢測與跟蹤研究中存在著多個具有挑戰(zhàn)性的難題。首先,網(wǎng)絡(luò)新聞報(bào)道文本描述手段匱乏,涉及多語言環(huán)境的新聞報(bào)道話題描述難度更大；其次,跨語言話題檢測與跟蹤需要實(shí)現(xiàn)多語言環(huán)境下的新聞報(bào)道處理,怎樣跨越語言鴻溝,是首先需要攻克的技術(shù)難題之一。再次,如何更好地發(fā)展現(xiàn)有技術(shù),并將其應(yīng)用到話題檢測與跟蹤研究中,這一問題值得進(jìn)一步探討。針對上述問題,希望本文對英、漢跨語言話題檢測與跟蹤技術(shù)的研究能為語言處理相關(guān)技術(shù)的發(fā)展做出微薄貢獻(xiàn),并能為我國多民族語言文本處理提供一定的借鑒。本文的研究主要包括跨語言新聞報(bào)道文本分析、跨語言話題模型構(gòu)建方法、語料庫構(gòu)建方法、跨語言話題檢測和跨語言話題跟蹤等五個部分。首先,筆者從新聞報(bào)道的本質(zhì)因素研究入手,從新聞的認(rèn)知理解和本身特性這兩個角度來分析新聞報(bào)道的核心要素。通過分析,筆者認(rèn)為詞匯處理是對文本進(jìn)行描述的有效途徑之一；新聞要素也可作為對報(bào)道文本加以區(qū)分的手段。其次,本文從“報(bào)道-話題-事件”的相互關(guān)系出發(fā),闡述了CLTDT研究中新聞報(bào)道模型構(gòu)建的基本思路；分析了當(dāng)前常用文本表示模型的特點(diǎn)與不足；認(rèn)為早期文本表示模型缺乏對“報(bào)道-話題-事件”之間關(guān)系的深入描寫和刻畫。為了揭示新聞文本中潛藏的話題,本文選取了LSI模型和LDA模型進(jìn)行文本建模實(shí)驗(yàn),并通過實(shí)驗(yàn)對比和分析了兩種模型對新聞報(bào)道文本的描述能力。在以上理論分析和實(shí)驗(yàn)驗(yàn)證的基礎(chǔ)上,我們提出在英、漢可比語料庫的基礎(chǔ)上進(jìn)行跨語言話題檢測與跟蹤研究的思路。通過語料采集、元數(shù)據(jù)處理、新聞事件分類、語料分詞處理和標(biāo)注、命名實(shí)體標(biāo)注等流程和步驟,本文嘗試建立“英、漢跨語言新聞報(bào)道可比語料庫”。我們將以語料庫中所包含的英、漢新聞報(bào)道文本語料為基礎(chǔ),對跨語言環(huán)境中的新聞話題進(jìn)行檢測與跟蹤研究。在綜合當(dāng)前跨語言處理技術(shù)和LDA模型研究的基礎(chǔ)上,結(jié)合本文研究目的,筆者提出跨語言聯(lián)合LDA (CLU-LDA)模型。這一模型既可以對英、漢新聞報(bào)道進(jìn)行事件回顧檢測,又可以對新事件進(jìn)行發(fā)現(xiàn)。在跨語言話題跟蹤中,通過使用先驗(yàn)的話題模型對新聞報(bào)道樣本話題進(jìn)行推斷,借助已有先驗(yàn)知識和可比語料庫,我們不僅可以在時(shí)間序列上描繪出新聞事件的話題發(fā)展?fàn)顩r,還可以對特定新聞報(bào)道進(jìn)行有效跟蹤。
[Abstract]:Today, the world has gradually entered the era of information and digital. According to the thirtieth survey report of CNNIC, the number of Internet users in China has reached 538 million by the end of June 2012, the number of Web sites has reached 2 million 500 thousand, the user scale of network news has reached 392 million, and the use rate of Internet news is as high as 73.0%. because of the simple distribution of network news. Fast and so on, the Internet has become the "fourth media" of news communication. Ordinary people want to get news topics of interest from the mass network resources, and also want to know the news topics of other countries. Therefore, the cross language detection and tracking of the topic of network news has gradually become a domestic and foreign scholar. The interest of the study is.
There are many challenging problems in the current cross language topic detection and tracking research. First, the text description means of the network news report is scarce and the news report topic involving multi language environment is more difficult to describe. Secondly, the cross language topic detection and tracking needs to deal with the news reports under the multi language environment and how to cross the language. The more language gap is one of the technical problems that need to be tackled first. Again, how to develop the existing technology better and apply it to the research of topic detection and tracking is worth further discussion. In view of the above problems, this paper hopes that the research of the English, Chinese and cross language topic detection and tracking technology can be used for language processing related technologies. It will make a modest contribution to the development and provide some references for the processing of multilingual texts in China.
The research of this paper includes five parts: cross language news report text analysis, cross language topic model building method, corpus construction method, cross language topic detection and cross language topic tracking.
First of all, the author starts with the study of the essential factors of news reports and analyzes the core elements of news reports from the two perspectives of the cognitive understanding of the news and their own characteristics. Through the analysis, the author thinks that lexical processing is one of the effective ways to describe the text, and the news elements can also be used as a means to distinguish the text from the news.
Secondly, starting from the relationship of "report topic event", this paper expounds the basic idea of the construction of news report model in CLTDT research, analyzes the characteristics and shortcomings of the current common text representation model, and thinks that the early text representation model lacks the deep description and characterization of the relationship between "report topic event". To reveal the latent topic in the news text, this paper selects the LSI model and the LDA model to carry out the text modeling experiment, and compares and analyzes the ability of the two models to describe the news text.
On the basis of the above theoretical analysis and experimental verification, we put forward the ideas of cross language topic detection and tracking on the basis of the English and Chinese corpus, through the process and steps of language collection, metadata processing, news event classification, word segmentation processing and tagging, and the labeling of the name of the life body. This paper tries to establish "English and Chinese". We will examine and track news topics in a cross language environment, based on the corpus of English and Chinese news reports that are included in the corpus.
On the basis of the study of current cross language processing and LDA model and the purpose of this study, I propose a cross language joint LDA (CLU-LDA) model. This model can not only review the events of English and Chinese news reports, but also discover new events. In cross language topic tracking, we use a priori topic model. Based on the prior knowledge and comparable corpus, we can not only describe the development of news events on the time series, but also track the specific news reports effectively.

【學(xué)位授予單位】：中央民族大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2013
【分類號】：H15;H315;H087

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 房璐;葛運(yùn)東;洪宇;姚建民;;可比較語料庫構(gòu)建及在跨語言信息檢索中的應(yīng)用[J];廣西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年03期

2 趙華;趙鐵軍;張姝;王浩暢;;基于內(nèi)容分析的話題檢測研究[J];哈爾濱工業(yè)大學(xué)學(xué)報(bào);2006年10期

3 劉遠(yuǎn)超;宋明凱;劉銘;張想;;用于細(xì)顆粒度挖掘的產(chǎn)品評論語料庫構(gòu)建技術(shù)[J];哈爾濱工業(yè)大學(xué)學(xué)報(bào);2012年03期

4 賈自艷 ,何清 ,張�？� ,李嘉佑 ,史忠植;一種基于動態(tài)進(jìn)化模型的事件探測和追蹤算法[J];計(jì)算機(jī)研究與發(fā)展;2004年07期

5 于滿泉;駱衛(wèi)華;許洪波;白碩;;話題識別與跟蹤中的層次化話題識別技術(shù)研究[J];計(jì)算機(jī)研究與發(fā)展;2006年03期

6 張sソ，

本文編號：1785169

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/wenyilunwen/yuyanxuelw/1785169.html

上一篇：MTI口譯方向?qū)I(yè)實(shí)習(xí)探索
下一篇：基本組合格式在詞類劃分中的功用

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

英、漢跨語言話題檢測與跟蹤技術(shù)研究

英、漢跨語言話題檢測與跟蹤技術(shù)研究