英、漢跨語言話題檢測與跟蹤技術研究
發(fā)布時間:2018-04-22 02:20
本文選題:跨語言話題檢測 + 跨語言話題跟蹤 ; 參考:《中央民族大學》2013年博士論文
【摘要】:當今世界已經(jīng)逐步邁入信息化和數(shù)字化時代。根據(jù)CNNIC第30次調查報告①顯示,截止2012年6月底我國網(wǎng)絡用戶數(shù)量已達到5.38億,網(wǎng)站數(shù)達到250萬,網(wǎng)絡新聞的用戶規(guī)模達到3.92億,網(wǎng)民對網(wǎng)絡新聞的使用率高達73.0%。由于網(wǎng)絡新聞發(fā)布簡便快捷等特點,互聯(lián)網(wǎng)已成為新聞傳播的“第四媒體”。普通民眾希望從海量網(wǎng)絡資源中獲取自己感興趣的新聞話題,同時也希望了解其他國家的新聞話題。因此,對網(wǎng)絡新聞話題進行跨語言的檢測與跟蹤,己經(jīng)逐漸成為當今國內外學者研究的興趣之所在。 目前的跨語言話題檢測與跟蹤研究中存在著多個具有挑戰(zhàn)性的難題。首先,網(wǎng)絡新聞報道文本描述手段匱乏,涉及多語言環(huán)境的新聞報道話題描述難度更大;其次,跨語言話題檢測與跟蹤需要實現(xiàn)多語言環(huán)境下的新聞報道處理,怎樣跨越語言鴻溝,是首先需要攻克的技術難題之一。再次,如何更好地發(fā)展現(xiàn)有技術,并將其應用到話題檢測與跟蹤研究中,這一問題值得進一步探討。針對上述問題,希望本文對英、漢跨語言話題檢測與跟蹤技術的研究能為語言處理相關技術的發(fā)展做出微薄貢獻,并能為我國多民族語言文本處理提供一定的借鑒。 本文的研究主要包括跨語言新聞報道文本分析、跨語言話題模型構建方法、語料庫構建方法、跨語言話題檢測和跨語言話題跟蹤等五個部分。 首先,筆者從新聞報道的本質因素研究入手,從新聞的認知理解和本身特性這兩個角度來分析新聞報道的核心要素。通過分析,筆者認為詞匯處理是對文本進行描述的有效途徑之一;新聞要素也可作為對報道文本加以區(qū)分的手段。 其次,本文從“報道-話題-事件”的相互關系出發(fā),闡述了CLTDT研究中新聞報道模型構建的基本思路;分析了當前常用文本表示模型的特點與不足;認為早期文本表示模型缺乏對“報道-話題-事件”之間關系的深入描寫和刻畫。為了揭示新聞文本中潛藏的話題,本文選取了LSI模型和LDA模型進行文本建模實驗,并通過實驗對比和分析了兩種模型對新聞報道文本的描述能力。 在以上理論分析和實驗驗證的基礎上,我們提出在英、漢可比語料庫的基礎上進行跨語言話題檢測與跟蹤研究的思路。通過語料采集、元數(shù)據(jù)處理、新聞事件分類、語料分詞處理和標注、命名實體標注等流程和步驟,本文嘗試建立“英、漢跨語言新聞報道可比語料庫”。我們將以語料庫中所包含的英、漢新聞報道文本語料為基礎,對跨語言環(huán)境中的新聞話題進行檢測與跟蹤研究。 在綜合當前跨語言處理技術和LDA模型研究的基礎上,結合本文研究目的,筆者提出跨語言聯(lián)合LDA (CLU-LDA)模型。這一模型既可以對英、漢新聞報道進行事件回顧檢測,又可以對新事件進行發(fā)現(xiàn)。在跨語言話題跟蹤中,通過使用先驗的話題模型對新聞報道樣本話題進行推斷,借助已有先驗知識和可比語料庫,我們不僅可以在時間序列上描繪出新聞事件的話題發(fā)展狀況,還可以對特定新聞報道進行有效跟蹤。
[Abstract]:Today, the world has gradually entered the era of information and digital. According to the thirtieth survey report of CNNIC, the number of Internet users in China has reached 538 million by the end of June 2012, the number of Web sites has reached 2 million 500 thousand, the user scale of network news has reached 392 million, and the use rate of Internet news is as high as 73.0%. because of the simple distribution of network news. Fast and so on, the Internet has become the "fourth media" of news communication. Ordinary people want to get news topics of interest from the mass network resources, and also want to know the news topics of other countries. Therefore, the cross language detection and tracking of the topic of network news has gradually become a domestic and foreign scholar. The interest of the study is.
There are many challenging problems in the current cross language topic detection and tracking research. First, the text description means of the network news report is scarce and the news report topic involving multi language environment is more difficult to describe. Secondly, the cross language topic detection and tracking needs to deal with the news reports under the multi language environment and how to cross the language. The more language gap is one of the technical problems that need to be tackled first. Again, how to develop the existing technology better and apply it to the research of topic detection and tracking is worth further discussion. In view of the above problems, this paper hopes that the research of the English, Chinese and cross language topic detection and tracking technology can be used for language processing related technologies. It will make a modest contribution to the development and provide some references for the processing of multilingual texts in China.
The research of this paper includes five parts: cross language news report text analysis, cross language topic model building method, corpus construction method, cross language topic detection and cross language topic tracking.
First of all, the author starts with the study of the essential factors of news reports and analyzes the core elements of news reports from the two perspectives of the cognitive understanding of the news and their own characteristics. Through the analysis, the author thinks that lexical processing is one of the effective ways to describe the text, and the news elements can also be used as a means to distinguish the text from the news.
Secondly, starting from the relationship of "report topic event", this paper expounds the basic idea of the construction of news report model in CLTDT research, analyzes the characteristics and shortcomings of the current common text representation model, and thinks that the early text representation model lacks the deep description and characterization of the relationship between "report topic event". To reveal the latent topic in the news text, this paper selects the LSI model and the LDA model to carry out the text modeling experiment, and compares and analyzes the ability of the two models to describe the news text.
On the basis of the above theoretical analysis and experimental verification, we put forward the ideas of cross language topic detection and tracking on the basis of the English and Chinese corpus, through the process and steps of language collection, metadata processing, news event classification, word segmentation processing and tagging, and the labeling of the name of the life body. This paper tries to establish "English and Chinese". We will examine and track news topics in a cross language environment, based on the corpus of English and Chinese news reports that are included in the corpus.
On the basis of the study of current cross language processing and LDA model and the purpose of this study, I propose a cross language joint LDA (CLU-LDA) model. This model can not only review the events of English and Chinese news reports, but also discover new events. In cross language topic tracking, we use a priori topic model. Based on the prior knowledge and comparable corpus, we can not only describe the development of news events on the time series, but also track the specific news reports effectively.
【學位授予單位】:中央民族大學
【學位級別】:博士
【學位授予年份】:2013
【分類號】:H15;H315;H087
【參考文獻】
相關期刊論文 前10條
1 房璐;葛運東;洪宇;姚建民;;可比較語料庫構建及在跨語言信息檢索中的應用[J];廣西師范大學學報(自然科學版);2010年03期
2 趙華;趙鐵軍;張姝;王浩暢;;基于內容分析的話題檢測研究[J];哈爾濱工業(yè)大學學報;2006年10期
3 劉遠超;宋明凱;劉銘;張想;;用于細顆粒度挖掘的產(chǎn)品評論語料庫構建技術[J];哈爾濱工業(yè)大學學報;2012年03期
4 賈自艷 ,何清 ,張海俊 ,李嘉佑 ,史忠植;一種基于動態(tài)進化模型的事件探測和追蹤算法[J];計算機研究與發(fā)展;2004年07期
5 于滿泉;駱衛(wèi)華;許洪波;白碩;;話題識別與跟蹤中的層次化話題識別技術研究[J];計算機研究與發(fā)展;2006年03期
6 張sソ,
本文編號:1785169
本文鏈接:http://sikaile.net/wenyilunwen/yuyanxuelw/1785169.html