基于標題與正文的文本分類和評價對象抽取方法研究
本文選題:主題模型 切入點:文本分類 出處:《安徽大學(xué)》2017年碩士論文
【摘要】:隨著社會的發(fā)展,互聯(lián)網(wǎng)信息呈現(xiàn)爆炸式的增長,通過觀察網(wǎng)民提交的文本發(fā)現(xiàn),大多數(shù)網(wǎng)站特別是新聞和政府的網(wǎng)站,文本信息都具有結(jié)構(gòu)化的特點,通常包含標題文本和正文文本。正文通常是對事件詳細的描述,包含的語義信息比較豐富,同時具有主題多樣性,噪聲巨大。標題通常是對事件的精煉簡潔的概述,表達信息準確,語義清晰,所以充分利用標題信息就變得十分有意義。本文充分利用標題的特點,提出了基于標題和正文的主題模型應(yīng)用于文本分類研究。由于標題的特殊性,語句簡短,句法簡單,所以本文基于規(guī)則和句法依存關(guān)系可以有效的提取標題中的評價對象。本文主要工作如下:(1)本文利用一篇文檔具有標題和正文兩部分的特點,提出了基于標題和正文的主題模型,該模型可以獲得文檔正文的主題分布和標題的主題分布,使用調(diào)節(jié)參數(shù),優(yōu)化整篇文檔的主題分布。充分利用標題具有精煉簡潔、主題明確的優(yōu)點,可以有效的降低正文部分語義繁雜、主題多樣性對文本分類的影響,從而獲得整篇文檔最優(yōu)的主題分布,通過最佳的主題分布,可以提高文本分類的準確性。(2)由于標題精煉簡潔,主題明確,因此采用句法依存關(guān)系獲取標題中的評價對象。本文基于規(guī)則和詞性標注獲取標題中潛在的評價對象,因為本文標題語料的特殊性,潛在的評價對象和動詞具有很強的依賴關(guān)系,所以本文構(gòu)建動詞詞典庫,通過動詞出現(xiàn)在句法分析樹的位置,遍歷整個句法分析樹,可以從潛在的評價對象中找到標題中真實的評價對象。(3)由于本文的語料是來自某城市的政府直通車網(wǎng)站,解決當(dāng)?shù)爻鞘芯用袼媾R的問題,所以文本中出現(xiàn)了大量的當(dāng)?shù)靥赜械拿麑嶓w,為了解決這些特有的詞匯對文本分詞和句法依存關(guān)系的影響,本文加入了大量的當(dāng)?shù)靥赜械男^(qū)名,道路名,公交地鐵名等名詞作為用戶詞典,由于分詞具有較好的準確性,所以在文本分類和評價對象的抽取的任務(wù)中都獲得了不錯的效果。
[Abstract]:With the development of society, the Internet information is increasing explosively. By observing the text submitted by netizens, it is found that most websites, especially news and government websites, have structural characteristics of text information. Usually contains title text and text text. The text is usually a detailed description of the event, which contains a wealth of semantic information, at the same time, it has a variety of topics and a lot of noise. The title is usually a concise and concise overview of the event. The expression information is accurate and the meaning is clear, so it becomes very meaningful to make full use of the title information. In this paper, we put forward the topic model based on the title and the text to apply to the text classification research, because of the particularity of the title. The sentence is short and the syntax is simple, so this paper can extract the evaluation object from the title effectively based on rules and syntactic dependencies. The main work of this paper is as follows: 1) this paper uses a document with the characteristics of title and text. A topic model based on title and text is proposed. The model can obtain the topic distribution of the document body and title, and optimize the topic distribution of the whole document by adjusting the parameters. The full use of the title is concise and concise. The advantages of topic clarity can effectively reduce the semantic complexity of the text and the influence of topic diversity on text classification, so that the optimal topic distribution of the whole document can be obtained, and the optimal topic distribution can be obtained through the optimal topic distribution. It can improve the accuracy of text categorization. (2) because the title is concise and the subject is clear, the syntactic dependency relation is used to obtain the evaluation object in the title. This paper obtains the potential evaluation object in the title based on rules and part of speech tagging. Because of the particularity of the title corpus, the potential object of evaluation and the verb have very strong dependence, so this paper constructs the verb dictionary, and traverses the whole parse tree through the verb appearing in the position of the syntactic parse tree. We can find the true evaluation object in the title from the potential evaluation object.) since the corpus of this paper is a government through train website from a certain city, it can solve the problems faced by the local urban residents. In order to solve the influence of these special words on the text participle and syntactic dependency, this paper adds a large number of local unique community names, road names, in order to solve the problem that there are a lot of local naming entities in the text. As a dictionary of users, the names of public transportation subway and other nouns have achieved good results in the task of text classification and evaluation object extraction because of the good accuracy of participle.
【學(xué)位授予單位】:安徽大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【參考文獻】
相關(guān)期刊論文 前10條
1 董紅斌;滕旭陽;楊雪;;一種基于關(guān)聯(lián)信息熵度量的特征選擇方法[J];計算機研究與發(fā)展;2016年08期
2 劉世成;韓笑;王繼業(yè);張東霞;朱朝陽;鄧春宇;王曉蓉;;“互聯(lián)網(wǎng)+”行動對電力工業(yè)的影響研究[J];電力信息與通信技術(shù);2016年04期
3 蒲國林;;基于粗糙集與信息增益的情感特征選擇方法[J];微電子學(xué)與計算機;2016年01期
4 饒高琦;于東;荀恩東;;基于自然標注信息和隱含主題模型的無監(jiān)督文本特征抽取[J];中文信息學(xué)報;2015年06期
5 金元浦;;“互聯(lián)網(wǎng)+”與“創(chuàng)客”時代[J];理論導(dǎo)報;2015年10期
6 楊佳能;陽愛民;周詠梅;;基于語義分析的中文微博情感分類方法[J];山東大學(xué)學(xué)報(理學(xué)版);2014年11期
7 高海英;金晨輝;張軍琪;;基于卡方統(tǒng)計量的多差分攻擊方法[J];電子學(xué)報;2014年09期
8 肖紅;許少華;;基于句法分析和情感詞典的網(wǎng)絡(luò)輿情傾向性分析研究[J];小型微型計算機系統(tǒng);2014年04期
9 來斯惟;徐立恒;陳玉博;劉康;趙軍;;基于表示學(xué)習(xí)的中文分詞算法探索[J];中文信息學(xué)報;2013年05期
10 繆有棟;邱錫鵬;黃萱菁;;一種適用于大規(guī)模網(wǎng)頁分類的快速算法[J];計算機應(yīng)用與軟件;2012年07期
,本文編號:1695848
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1695848.html