基于CRF的中文微博交通信息事件抽取
發(fā)布時間:2018-09-04 15:22
【摘要】:在自然語言處理領(lǐng)域,事件抽取和追蹤一直是一個非常重要的研究方向。如何準(zhǔn)確高效地從大量繁雜無序的信息中提取到感興趣的事件信息,是事件抽取研究領(lǐng)域的關(guān)鍵問題。 本課題選擇抽取的對象文本來源于著名的中文微博媒體——新浪微博。微博,即“Microblog”,是一個基于用戶關(guān)系的分享,傳播以及獲取信息的平臺。人們每天發(fā)布上百萬條微博。作為一種新興媒體,微博中蘊(yùn)含了海量的信息,是當(dāng)前各類大數(shù)據(jù)研究的絕佳平臺。與城市交通信息有關(guān)的微博常常提及諸如事故信息,堵車信息,道路施工信息。這些微博蘊(yùn)含的信息往往具有很高的準(zhǔn)確性和時效性,通過有針對性的抓取,,排除噪音,事件抽取,我們將能得到覆蓋整個城市交通網(wǎng)的實(shí)時信息來源。 然而,傳統(tǒng)的標(biāo)準(zhǔn)自然語言處理工具針對中文微博文本的處理不盡人意,因此,本文描述了本課題構(gòu)建的一整套系統(tǒng)方案,實(shí)現(xiàn)從抓取微博,去除噪音,微博話題限定,句子分割,詞性標(biāo)注,命名實(shí)體識別,事件抽取到事件展示的過程。本課題使用了基于條件隨機(jī)場概率模型CRF和基于規(guī)則的正則表達(dá)式相結(jié)合的辦法進(jìn)行自然語言處理,使用python作為主要開發(fā)語言。 實(shí)驗(yàn)結(jié)果表明,經(jīng)測評分析得出的最優(yōu)方案能以達(dá)83%的準(zhǔn)確率提取微博文本中的事件要素;微博文本標(biāo)準(zhǔn)化處理方法能夠有效的提升后期事件抽取的準(zhǔn)確率;系統(tǒng)最終能能實(shí)時的展示出所提取的信息。
[Abstract]:In the field of natural language processing, event extraction and tracking has been a very important research direction. How to accurately and efficiently extract the interesting event information from a large number of complex and unordered information is a key issue in the field of event extraction. The object text selected in this paper comes from the famous Chinese Weibo media-Sina Weibo. Weibo, or "Microblog", is a user-based sharing, dissemination and access to information platform. People publish millions of Weibo every day. As a new media, Weibo contains a great deal of information, which is a perfect platform for all kinds of research. Weibo, who is concerned with urban traffic information, often refers to accident information, traffic jam information and road construction information. The information contained by Weibo often has high accuracy and timeliness. Through targeted grabbing, noise elimination and event extraction, we will be able to obtain real-time information sources covering the entire urban traffic network. However, the traditional standard natural language processing tools are not satisfactory for the Chinese text of Weibo. Therefore, this paper describes a whole set of system schemes constructed in this paper, which can achieve the goal of grasping Weibo, removing noise, and limiting the topic of Weibo. Sentence segmentation, part of speech tagging, named entity recognition, event extraction to event presentation process. In this paper, the method of combining conditional random field probability model (CRF) and regular expression based on rules is used to process natural language, and python is used as the main development language. The experimental results show that the optimal scheme can extract the event elements of Weibo text with the accuracy of 83%, and the standardized processing method of Weibo text can effectively improve the accuracy of later event extraction. Finally, the system can display the extracted information in real time.
【學(xué)位授予單位】:上海交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP391.1;TP393.092
[Abstract]:In the field of natural language processing, event extraction and tracking has been a very important research direction. How to accurately and efficiently extract the interesting event information from a large number of complex and unordered information is a key issue in the field of event extraction. The object text selected in this paper comes from the famous Chinese Weibo media-Sina Weibo. Weibo, or "Microblog", is a user-based sharing, dissemination and access to information platform. People publish millions of Weibo every day. As a new media, Weibo contains a great deal of information, which is a perfect platform for all kinds of research. Weibo, who is concerned with urban traffic information, often refers to accident information, traffic jam information and road construction information. The information contained by Weibo often has high accuracy and timeliness. Through targeted grabbing, noise elimination and event extraction, we will be able to obtain real-time information sources covering the entire urban traffic network. However, the traditional standard natural language processing tools are not satisfactory for the Chinese text of Weibo. Therefore, this paper describes a whole set of system schemes constructed in this paper, which can achieve the goal of grasping Weibo, removing noise, and limiting the topic of Weibo. Sentence segmentation, part of speech tagging, named entity recognition, event extraction to event presentation process. In this paper, the method of combining conditional random field probability model (CRF) and regular expression based on rules is used to process natural language, and python is used as the main development language. The experimental results show that the optimal scheme can extract the event elements of Weibo text with the accuracy of 83%, and the standardized processing method of Weibo text can effectively improve the accuracy of later event extraction. Finally, the system can display the extracted information in real time.
【學(xué)位授予單位】:上海交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP391.1;TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 郭濤;曲寶勝;郭勇;;自然語言處理中的模型[J];電腦學(xué)習(xí);2011年02期
2 蔡淑琴;張靜;王e
本文編號:2222569
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2222569.html
最近更新
教材專著