基于標(biāo)題與正文的文本分類和評(píng)價(jià)對(duì)象抽取方法研究

發(fā)布時(shí)間：2018-04-01 13:37

本文選題：主題模型　切入點(diǎn)：文本分類　出處：《安徽大學(xué)》2017年碩士論文

【摘要】：隨著社會(huì)的發(fā)展,互聯(lián)網(wǎng)信息呈現(xiàn)爆炸式的增長(zhǎng),通過(guò)觀察網(wǎng)民提交的文本發(fā)現(xiàn),大多數(shù)網(wǎng)站特別是新聞和政府的網(wǎng)站,文本信息都具有結(jié)構(gòu)化的特點(diǎn),通常包含標(biāo)題文本和正文文本。正文通常是對(duì)事件詳細(xì)的描述,包含的語(yǔ)義信息比較豐富,同時(shí)具有主題多樣性,噪聲巨大。標(biāo)題通常是對(duì)事件的精煉簡(jiǎn)潔的概述,表達(dá)信息準(zhǔn)確,語(yǔ)義清晰,所以充分利用標(biāo)題信息就變得十分有意義。本文充分利用標(biāo)題的特點(diǎn),提出了基于標(biāo)題和正文的主題模型應(yīng)用于文本分類研究。由于標(biāo)題的特殊性,語(yǔ)句簡(jiǎn)短,句法簡(jiǎn)單,所以本文基于規(guī)則和句法依存關(guān)系可以有效的提取標(biāo)題中的評(píng)價(jià)對(duì)象。本文主要工作如下:(1)本文利用一篇文檔具有標(biāo)題和正文兩部分的特點(diǎn),提出了基于標(biāo)題和正文的主題模型,該模型可以獲得文檔正文的主題分布和標(biāo)題的主題分布,使用調(diào)節(jié)參數(shù),優(yōu)化整篇文檔的主題分布。充分利用標(biāo)題具有精煉簡(jiǎn)潔、主題明確的優(yōu)點(diǎn),可以有效的降低正文部分語(yǔ)義繁雜、主題多樣性對(duì)文本分類的影響,從而獲得整篇文檔最優(yōu)的主題分布,通過(guò)最佳的主題分布,可以提高文本分類的準(zhǔn)確性。(2)由于標(biāo)題精煉簡(jiǎn)潔,主題明確,因此采用句法依存關(guān)系獲取標(biāo)題中的評(píng)價(jià)對(duì)象。本文基于規(guī)則和詞性標(biāo)注獲取標(biāo)題中潛在的評(píng)價(jià)對(duì)象,因?yàn)楸疚臉?biāo)題語(yǔ)料的特殊性,潛在的評(píng)價(jià)對(duì)象和動(dòng)詞具有很強(qiáng)的依賴關(guān)系,所以本文構(gòu)建動(dòng)詞詞典庫(kù),通過(guò)動(dòng)詞出現(xiàn)在句法分析樹(shù)的位置,遍歷整個(gè)句法分析樹(shù),可以從潛在的評(píng)價(jià)對(duì)象中找到標(biāo)題中真實(shí)的評(píng)價(jià)對(duì)象。(3)由于本文的語(yǔ)料是來(lái)自某城市的政府直通車網(wǎng)站,解決當(dāng)?shù)爻鞘芯用袼媾R的問(wèn)題,所以文本中出現(xiàn)了大量的當(dāng)?shù)靥赜械拿麑?shí)體,為了解決這些特有的詞匯對(duì)文本分詞和句法依存關(guān)系的影響,本文加入了大量的當(dāng)?shù)靥赜械男^(qū)名,道路名,公交地鐵名等名詞作為用戶詞典,由于分詞具有較好的準(zhǔn)確性,所以在文本分類和評(píng)價(jià)對(duì)象的抽取的任務(wù)中都獲得了不錯(cuò)的效果。
[Abstract]:With the development of society, the Internet information is increasing explosively. By observing the text submitted by netizens, it is found that most websites, especially news and government websites, have structural characteristics of text information. Usually contains title text and text text. The text is usually a detailed description of the event, which contains a wealth of semantic information, at the same time, it has a variety of topics and a lot of noise. The title is usually a concise and concise overview of the event. The expression information is accurate and the meaning is clear, so it becomes very meaningful to make full use of the title information. In this paper, we put forward the topic model based on the title and the text to apply to the text classification research, because of the particularity of the title. The sentence is short and the syntax is simple, so this paper can extract the evaluation object from the title effectively based on rules and syntactic dependencies. The main work of this paper is as follows: 1) this paper uses a document with the characteristics of title and text. A topic model based on title and text is proposed. The model can obtain the topic distribution of the document body and title, and optimize the topic distribution of the whole document by adjusting the parameters. The full use of the title is concise and concise. The advantages of topic clarity can effectively reduce the semantic complexity of the text and the influence of topic diversity on text classification, so that the optimal topic distribution of the whole document can be obtained, and the optimal topic distribution can be obtained through the optimal topic distribution. It can improve the accuracy of text categorization. (2) because the title is concise and the subject is clear, the syntactic dependency relation is used to obtain the evaluation object in the title. This paper obtains the potential evaluation object in the title based on rules and part of speech tagging. Because of the particularity of the title corpus, the potential object of evaluation and the verb have very strong dependence, so this paper constructs the verb dictionary, and traverses the whole parse tree through the verb appearing in the position of the syntactic parse tree. We can find the true evaluation object in the title from the potential evaluation object.) since the corpus of this paper is a government through train website from a certain city, it can solve the problems faced by the local urban residents. In order to solve the influence of these special words on the text participle and syntactic dependency, this paper adds a large number of local unique community names, road names, in order to solve the problem that there are a lot of local naming entities in the text. As a dictionary of users, the names of public transportation subway and other nouns have achieved good results in the task of text classification and evaluation object extraction because of the good accuracy of participle.
【學(xué)位授予單位】：安徽大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 董紅斌;滕旭陽(yáng);楊雪;;一種基于關(guān)聯(lián)信息熵度量的特征選擇方法[J];計(jì)算機(jī)研究與發(fā)展;2016年08期

2 劉世成;韓笑;王繼業(yè);張東霞;朱朝陽(yáng);鄧春宇;王曉蓉;;“互聯(lián)網(wǎng)+”行動(dòng)對(duì)電力工業(yè)的影響研究[J];電力信息與通信技術(shù);2016年04期

3 蒲國(guó)林;;基于粗糙集與信息增益的情感特征選擇方法[J];微電子學(xué)與計(jì)算機(jī);2016年01期

4 饒高琦;于東;荀恩東;;基于自然標(biāo)注信息和隱含主題模型的無(wú)監(jiān)督文本特征抽取[J];中文信息學(xué)報(bào);2015年06期

5 金元浦;;“互聯(lián)網(wǎng)+”與“創(chuàng)客”時(shí)代[J];理論導(dǎo)報(bào);2015年10期

6 楊佳能;陽(yáng)愛(ài)民;周詠梅;;基于語(yǔ)義分析的中文微博情感分類方法[J];山東大學(xué)學(xué)報(bào)(理學(xué)版);2014年11期

7 高海英;金晨輝;張軍琪;;基于卡方統(tǒng)計(jì)量的多差分攻擊方法[J];電子學(xué)報(bào);2014年09期

8 肖紅;許少華;;基于句法分析和情感詞典的網(wǎng)絡(luò)輿情傾向性分析研究[J];小型微型計(jì)算機(jī)系統(tǒng);2014年04期

9 來(lái)斯惟;徐立恒;陳玉博;劉康;趙軍;;基于表示學(xué)習(xí)的中文分詞算法探索[J];中文信息學(xué)報(bào);2013年05期

10 繆有棟;邱錫鵬;黃萱菁;;一種適用于大規(guī)模網(wǎng)頁(yè)分類的快速算法[J];計(jì)算機(jī)應(yīng)用與軟件;2012年07期

，

本文編號(hào)：1695848

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1695848.html

上一篇：聚類分析對(duì)虛擬社群意見(jiàn)領(lǐng)袖的甄別與篩
下一篇：Android應(yīng)用權(quán)限提升檢測(cè)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于標(biāo)題與正文的文本分類和評(píng)價(jià)對(duì)象抽取方法研究