面向高考作文的題意分析和生成技術(shù)研究
發(fā)布時間:2018-03-14 19:44
本文選題:文本標(biāo)簽推薦 切入點:深度神經(jīng)網(wǎng)絡(luò) 出處:《哈爾濱工業(yè)大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著近幾年來人工智能的飛速發(fā)展,我們越來越想檢驗一下機(jī)器能達(dá)到一個什么樣的智能水平。為此,國家在2015年啟動了“高考答題機(jī)器人”的相關(guān)項目研究,而自動解答高考作文題則是其中的一個重點研究課題。我們針對這一課題,對作文題意分析和文本生成技術(shù)兩方面進(jìn)行了深入的研究。作文題意分析就是給定一篇作文題目,從中提煉得到一個話題詞集合,這些話題詞明確了寫作內(nèi)容。針對這一任務(wù),我們利用規(guī)則匹配和關(guān)鍵詞抽取方法能夠處理大約40%的作文題目。而對于剩下部分的作文題目分析,我們將其視為成一種特殊的文本標(biāo)簽推薦任務(wù),這也是題意分析部分的重點研究內(nèi)容?紤]到任務(wù)的特殊性,我們提出了基于層次化深度神經(jīng)網(wǎng)絡(luò)的模型。首先,我們利用GRU或CNN學(xué)習(xí)得到句子向量表示,然后以句子向量作為輸入,利用句子層GRU得到文本向量表示,將文本向量作為特征輸入到邏輯回歸模型中預(yù)測每個候選標(biāo)簽詞的置信度。實驗證明,基于層次化深度神經(jīng)網(wǎng)絡(luò)的模型在訓(xùn)練數(shù)據(jù)充足的情況下,能夠獲得優(yōu)于其他模型方法的結(jié)果,F1值最高能有8個百分點的提升。雖然基于層次化深度神經(jīng)網(wǎng)絡(luò)的模型在作文題意分析任務(wù)上能夠取得非常好的效果,但是卻需要較大規(guī)模的訓(xùn)練語料,然而大規(guī)模語料的獲取往往是費(fèi)時費(fèi)力的,所以,我們又提出了將深度神經(jīng)網(wǎng)絡(luò)的和遷移學(xué)習(xí)相結(jié)合的方法。我們首先在源領(lǐng)域訓(xùn)練深度神經(jīng)網(wǎng)絡(luò)模型,然后利用遷移學(xué)習(xí)方法在目標(biāo)領(lǐng)域再次進(jìn)行訓(xùn)練,利用源領(lǐng)域?qū)W到的知識來幫助目標(biāo)領(lǐng)域上的學(xué)習(xí)。在兩個數(shù)據(jù)集上的實驗證明了基于遷移學(xué)習(xí)方法顯著優(yōu)于有監(jiān)督學(xué)習(xí)方法,在豆瓣數(shù)據(jù)集上F1值最高能達(dá)到7個百分點的提升,在作文題目數(shù)據(jù)集上P@3值最高能提升31.4個百分點。在文本生成技術(shù)研究方面,我們主要關(guān)注符合多主題的段落級文本生成問題。我們希望模型能夠接受多個話題詞的控制,生成包含這個多個話題詞語義的一段文本。為此,我們提出了Coverage-based LSTM模型。在該模型中,我們構(gòu)建了一個多主題的Coverage向量,它學(xué)習(xí)每個話題詞的權(quán)重并且在生成過程中不斷更新。然后,該向量輸入到注意力網(wǎng)絡(luò)中,用于指導(dǎo)文本生成。此外,我們還自動構(gòu)建了兩個段落級的中文作文語料,包含305,000個作文段落和56,621個知乎文本。實驗表明,我們的模型在BLEU指標(biāo)上相比于其他模型獲得了更好的結(jié)果。而且,人工評價結(jié)果表明Coverage-based LSTM模型有能力生成連貫并且和輸入話題詞相關(guān)的文本。
[Abstract]:With the rapid development of artificial intelligence in recent years, we are more and more interested in testing what kind of intelligence level the machine can achieve. In view of this, we have made a thorough study on both the meaning analysis of composition questions and the text generation technology. The meaning analysis of composition questions is a given composition topic. Extract a collection of topic words that define the content of the writing. We use rule matching and keyword extraction methods to deal with about 40% of the composition topics. For the rest of the composition topic analysis, we see it as a special text label recommendation task. Considering the particularity of the task, we propose a model based on hierarchical depth neural network. First, we use GRU or CNN to learn sentence vector representation. Then the sentence vector is used as input, the text vector representation is obtained by sentence level GRU, and the text vector is input into the logical regression model to predict the confidence of each candidate label word. The model based on hierarchical depth neural network has sufficient training data. The F1 value can be improved by up to 8 percentage points. Although the model based on hierarchical depth neural network can achieve very good results in the task of composition meaning analysis, However, large scale training data is needed. However, the acquisition of large scale data is often time-consuming and laborious, so, We also propose a method that combines the depth neural network with the transfer learning. We first train the deep neural network model in the source domain, then we use the transfer learning method to train again in the target domain. The experiments on two datasets show that the migration-based learning method is superior to the supervised learning method. The maximum value of F1 was increased by 7 percentage points on the data set of soybean petal, and the maximum value of Pol _ 3 was increased by 31.4 percentage points on the data set of composition topic. We are mainly concerned with paragraph level text generation that conforms to multiple topics. We hope that the model can be controlled by multiple topic words to generate a text that contains the meaning of this multi-topic word. We propose the Coverage-based LSTM model. In this model, we construct a multi-topic Coverage vector, which learns the weight of each topic word and updates constantly during the generation process. Then, the vector is input into the attention network. In addition, we have automatically constructed two paragraph level Chinese composition corpus, which contains 305,000 composition paragraphs and 56,621 Zhihu texts. Our model has better results than other models in terms of BLEU index. Furthermore, the artificial evaluation results show that the Coverage-based LSTM model is capable of generating coherent text related to the input topic words.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 李素建,王厚峰,俞士汶,辛乘勝;關(guān)鍵詞自動標(biāo)引的最大熵模型應(yīng)用研究[J];計算機(jī)學(xué)報;2004年09期
,本文編號:1612635
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1612635.html
最近更新
教材專著