實(shí)體檢索結(jié)果傾向性分析
發(fā)布時(shí)間:2018-02-12 13:05
本文關(guān)鍵詞: 信息檢索 情感分析 實(shí)體檢索 句子領(lǐng)域識(shí)別 句子情感分類 出處:《哈爾濱工業(yè)大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著論壇等互聯(lián)網(wǎng)社區(qū)的蓬勃發(fā)展,越來(lái)越多用戶參與到互聯(lián)網(wǎng)的建設(shè)中來(lái),向互聯(lián)網(wǎng)貢獻(xiàn)數(shù)據(jù)。這些數(shù)據(jù)中很大一部分是對(duì)人物和事件的評(píng)論,包含了用戶的觀點(diǎn)和態(tài)度。瀏覽這些信息能夠幫助用戶了解輿論大眾對(duì)自己所關(guān)心事物的看法;ヂ(lián)網(wǎng)中的情感信息是海量的,很難依靠人工方法收集和整理。搜索引擎是人們獲取信息的主要方式,但是搜索引擎關(guān)注的是事實(shí)相關(guān)的文檔,忽略了文檔中的情感信息。因此,本文將情感分析技術(shù)和搜索技術(shù)結(jié)合起來(lái),當(dāng)搜索引擎接入的檢索串是實(shí)體時(shí),以搜索引擎的檢索結(jié)果為研究對(duì)象,分析包含實(shí)體的句子對(duì)實(shí)體的情感傾向。分析結(jié)果可以支撐情感檢索、信息過(guò)濾等任務(wù),具有很大的實(shí)用價(jià)值。本文中研究的實(shí)體包括數(shù)碼產(chǎn)品、人物、機(jī)構(gòu)和政策法規(guī)。 首先,,本文提出了實(shí)體相關(guān)句識(shí)別問(wèn)題的解決方法。該方法采用SVM分類算法,使用實(shí)體到評(píng)價(jià)詞語(yǔ)的依存句法路徑等特征,從包含實(shí)體的句子中選取真正和實(shí)體相關(guān)的句子,即評(píng)價(jià)對(duì)象是實(shí)體的句子。該方法能夠?qū)⑾嚓P(guān)句比例由不進(jìn)行實(shí)體相關(guān)句識(shí)別時(shí)的77.5%提高到85.85%。 然后,本文提出了基于上下文擴(kuò)展的句子領(lǐng)域識(shí)別方法,該方法將包含實(shí)體的句子及其前后各兩個(gè)句子看作一個(gè)整體,并用這個(gè)整體表示包含實(shí)體的句子,并對(duì)其進(jìn)行分類。這種方法擴(kuò)充了待分類句子的內(nèi)容,一定程度上解決了數(shù)據(jù)稀疏問(wèn)題。與直接對(duì)包含實(shí)體的句子進(jìn)行分類的方法相比,該方法顯著提高了分類的準(zhǔn)確率,但是政策法規(guī)和機(jī)構(gòu)的識(shí)別效果較差。通過(guò)分析發(fā)現(xiàn),政策法規(guī)和機(jī)構(gòu)的特征分布極其相似,這也造成了這兩個(gè)類別識(shí)別性能較差。 最后,本文對(duì)包含實(shí)體的句子進(jìn)行了情感分類,將包含實(shí)體的句子分為褒義、貶義和客觀3類。本文采用SVM分類算法,使用評(píng)價(jià)詞語(yǔ)和unigram兩種特征,并采用信息增益對(duì)unigram特征進(jìn)行特征選擇。實(shí)驗(yàn)結(jié)果表明,同時(shí)使用評(píng)價(jià)詞語(yǔ)和unigram兩種特征取得的效果比單獨(dú)使用其中一種特征取得的效果好。另外,通過(guò)分析unigram特征維數(shù)對(duì)情感分類性能的影響,發(fā)現(xiàn)隨著特征維數(shù)的增加分類準(zhǔn)確率很快就達(dá)到了飽和,這也說(shuō)明特征選擇對(duì)句子級(jí)情感分類是極其必要的。
[Abstract]:With the boom of Internet communities such as forums, more and more users are involved in the construction of the Internet, contributing data to the Internet. Much of this data is about people and events. It contains the views and attitudes of the users. Browsing this information can help users understand the public opinion of what they care about. The emotional information in the Internet is huge. Search engine is the main way for people to get information, but the search engine is concerned about the documents related to facts and neglects the emotional information in the documents. In this paper, the emotion analysis technology and search technology are combined, when the search engine access search string is entity, the search results of search engine as the research object. The analysis results can support the tasks of emotional retrieval, information filtering and so on, which are of great practical value. The entities studied in this paper include digital products, people, institutions and policies and regulations. First of all, this paper proposes a method to solve the problem of entity related sentence recognition. The method uses SVM classification algorithm and the dependent syntactic path from entity to evaluative word to select the real and entity related sentence from the sentence containing entity. This method can increase the proportion of related sentences from 77.5% to 85.85. Then, a context-extended sentence domain recognition method is proposed, in which the sentences containing entities and the two sentences before and after the sentences are regarded as a whole, and the sentences containing entities are represented by this whole. This method extends the content of the sentence to be classified and solves the problem of sparse data to some extent. Compared with the method of directly classifying the sentences containing entities, this method improves the accuracy of classification significantly. Through analysis, it is found that the distribution of the characteristics of policies, regulations and institutions is very similar, which results in the poor recognition performance of these two categories. Finally, this paper classifies the sentences containing entities into three categories: positive, derogatory and objective. In this paper, we use SVM classification algorithm, use evaluation words and unigram features. The information gain is used to select the features of unigram. The experimental results show that the effect of using both evaluative words and unigram features is better than that of using one of the features alone. By analyzing the effect of unigram feature dimension on the performance of emotion classification, it is found that the accuracy of feature classification reaches saturation with the increase of feature dimension, which also shows that feature selection is extremely necessary for sentence level emotion classification.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 朱嫣嵐;閔錦;周雅倩;黃萱菁;吳立德;;基于HowNet的詞匯語(yǔ)義傾向計(jì)算[J];中文信息學(xué)報(bào);2006年01期
2 趙妍妍;秦兵;車萬(wàn)翔;劉挺;;基于句法路徑的情感評(píng)價(jià)單元識(shí)別[J];軟件學(xué)報(bào);2011年05期
本文編號(hào):1505701
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1505701.html
最近更新
教材專著