面向中文微博的產(chǎn)品名實(shí)體識(shí)別與規(guī)范化算法設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-01-21 17:55
本文關(guān)鍵詞: 微博 產(chǎn)品名實(shí)體識(shí)別 層疊條件隨機(jī)場(chǎng) 詞向量 實(shí)體規(guī)范化 出處:《北京理工大學(xué)》2015年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)的發(fā)展,微博等社交網(wǎng)絡(luò)平臺(tái)逐漸興起,用戶不再僅僅是信息的瀏覽者,同時(shí)也成為信息的發(fā)布者,互聯(lián)網(wǎng)已經(jīng)從一個(gè)信息發(fā)布平臺(tái)轉(zhuǎn)變?yōu)榛?dòng)交流的平臺(tái)。新浪、騰訊等微博平臺(tái)上海量的微博信息承載著巨大的商業(yè)價(jià)值。微博作為傳播最快、用戶量最大的社交媒體之一,成為重要的信息來(lái)源。互聯(lián)網(wǎng)時(shí)代,網(wǎng)絡(luò)營(yíng)銷、輿情監(jiān)控和商業(yè)智能越來(lái)越受到企業(yè)的關(guān)注,從海量的微博信息中準(zhǔn)確的識(shí)別出產(chǎn)品名實(shí)體是實(shí)現(xiàn)網(wǎng)絡(luò)輿情監(jiān)控和商業(yè)智能的基礎(chǔ)和前提。目前從微博中識(shí)別產(chǎn)品名實(shí)體時(shí)仍然采用傳統(tǒng)媒體文本中常用的處理方法,忽略了微博上下文信息缺乏、省略問(wèn)題嚴(yán)重、表達(dá)不規(guī)范等問(wèn)題,導(dǎo)致從微博中識(shí)別產(chǎn)品名實(shí)體的性能較差、實(shí)體歧義問(wèn)題較嚴(yán)重。針對(duì)這些問(wèn)題,本文主要研究針對(duì)微博文本的產(chǎn)品名實(shí)體識(shí)別方法,主要工作和創(chuàng)新點(diǎn)如下:1)提出了基于層疊條件隨機(jī)場(chǎng)模型和產(chǎn)品知識(shí)庫(kù)的產(chǎn)品名實(shí)體識(shí)別方法,該方法通過(guò)引入具有屬性分類的產(chǎn)品實(shí)體知識(shí)庫(kù),提升了產(chǎn)品名實(shí)體識(shí)別的性能,實(shí)驗(yàn)結(jié)果表明該方法對(duì)復(fù)雜結(jié)構(gòu)的實(shí)體識(shí)別準(zhǔn)確率和召回率分別提高了0.6%和3.2%。2)提出一種融合全局上下文語(yǔ)義信息的基于詞向量模型的特征選擇方法,該方法針對(duì)微博文本上下文語(yǔ)義信息缺乏的不足,采用詞向量和詞聚類兩種方法進(jìn)行特征選擇,詞聚類方法可以降低對(duì)訓(xùn)練語(yǔ)料的要求,實(shí)驗(yàn)結(jié)果顯示詞向量和詞聚類方法分別可以使產(chǎn)品名實(shí)體的整體識(shí)別性能F1值提高3.12%和3.34%。3)提出了基于全局以及局部上下文信息和用戶交互關(guān)系的產(chǎn)品名實(shí)體規(guī)范化方法,實(shí)驗(yàn)結(jié)果表明該方法比基于知識(shí)庫(kù)的方法F1值提升了6.92%。4)設(shè)計(jì)并實(shí)現(xiàn)了針對(duì)微博文本進(jìn)行產(chǎn)品名實(shí)體識(shí)別和規(guī)范化的原型系統(tǒng),該系統(tǒng)綜合考慮了識(shí)別和規(guī)范化的準(zhǔn)確率和召回率以及系統(tǒng)的時(shí)間和空間效率,實(shí)現(xiàn)了對(duì)微博文本的逐條處理和批量處理兩種處理方式。
[Abstract]:With the development of the Internet, Weibo and other social network platforms are gradually rising, users are not only information visitors, but also become information publishers. The Internet has changed from an information publishing platform to an interactive exchange platform. Weibo platforms such as Sina, Tencent, etc., Shanghai's Weibo information carries enormous commercial value. Weibo as the fastest spread. One of the largest users of social media has become an important source of information. In the Internet era, Internet marketing, public opinion monitoring and business intelligence are increasingly attracting the attention of enterprises. Accurate identification of product name entities from massive Weibo information is the basis and prerequisite for realizing network public opinion monitoring and business intelligence. At present, the traditional media texts are still used to identify product name entities from Weibo. The way. Ignoring Weibo's lack of context information, serious ellipsis problem, nonstandard expression and other problems, the performance of identifying product name entities from Weibo is poor, and entity ambiguity is serious. This paper mainly studies the product name entity recognition method for Weibo text. The main work and innovation are as follows: 1) A product name entity recognition method based on cascading conditional random field model and product knowledge base is proposed. This method improves the performance of product name entity recognition by introducing product entity knowledge base with attribute classification. The experimental results show that the accuracy and recall rate of entity recognition of complex structures are improved by 0.6% and 3.2, respectively. A feature selection method based on word vector model is proposed, which combines global context semantic information. Aiming at the lack of context semantic information in Weibo text, this method adopts word vector and word clustering methods to select features. Word clustering method can reduce the requirement of training corpus. The experimental results show that word vector and word clustering can increase the overall recognition performance of product name entities by 3.12% and 3.34.3, respectively. A method of product name entity normalization based on global and local context information and user interaction is proposed. The experimental results show that the proposed method improves the F1 value by 6.92. 4) and implements a prototype system for product name entity recognition and standardization for Weibo text. The system synthetically considers the accuracy and recall rate of recognition and normalization as well as the time and space efficiency of the system, and realizes two processing methods of Weibo text, one by one, and the other is batch processing.
【學(xué)位授予單位】:北京理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 張朝勝;郭劍毅;線巖團(tuán);余正濤;雷春雅;王海雄;;基于條件隨機(jī)場(chǎng)的英文產(chǎn)品命名實(shí)體識(shí)別[J];計(jì)算機(jī)工程與科學(xué);2010年06期
2 劉非凡;趙軍;呂碧波;徐波;于浩;夏迎炬;;面向商務(wù)信息抽取的產(chǎn)品命名實(shí)體識(shí)別研究[J];中文信息學(xué)報(bào);2006年01期
3 趙軍;;命名實(shí)體識(shí)別、排歧和跨語(yǔ)言關(guān)聯(lián)[J];中文信息學(xué)報(bào);2009年02期
,本文編號(hào):1452166
本文鏈接:http://sikaile.net/guanlilunwen/yingxiaoguanlilunwen/1452166.html
最近更新
教材專著