基于長短時記憶網(wǎng)絡(luò)的多標(biāo)簽文本分類
發(fā)布時間:2018-09-19 16:48
【摘要】:分類問題一直以來都是人工智能領(lǐng)域的核心問題,隨著文本內(nèi)容的豐富,文本的語義呈現(xiàn)出多角度、多標(biāo)簽的特征,為了自動化地索引和管理這些內(nèi)容,多標(biāo)簽文本分類問題變得重要起來。盡管文本分類技術(shù)已經(jīng)得到了廣泛研究,但隨著標(biāo)簽個數(shù)的增加,多標(biāo)簽文本分類問題的復(fù)雜程度會指數(shù)增長,以至于傳統(tǒng)技術(shù)無法很好地滿足需求。因此,本文針對多標(biāo)簽文本分類問題開展了研究,主要工作如下:(1)本文分析了傳統(tǒng)算法的缺陷,提出了基于詞向量的層次化長短時記憶網(wǎng)絡(luò)模型,分別在句子和文檔層面對文本進行建模,從而得到整個文檔的向量化表達。(2)在所提出模型的基礎(chǔ)上,本文提出了兩個對文本進行多標(biāo)簽分類的策略。一個基于多項邏輯回歸對標(biāo)簽進行排序,再利用動態(tài)閾值調(diào)整技術(shù)得到預(yù)測結(jié)果;另一個利用了標(biāo)簽之間的結(jié)構(gòu)特征構(gòu)建了一棵標(biāo)簽樹,訓(xùn)練了多個分類器在標(biāo)簽樹上進行聯(lián)合預(yù)測,還提出了多個聯(lián)合預(yù)測的準(zhǔn)則。(3)在紐約時報的新聞數(shù)據(jù)集上,文本設(shè)計了多個對比實驗將算法與基準(zhǔn)模型在多個指標(biāo)上進行了對比。除此之外,本文還設(shè)計了多個實驗探究模型在標(biāo)簽樹上進行聯(lián)合預(yù)測時,不同預(yù)測準(zhǔn)則對模型性能的影響。本文的主要貢獻有:(1)結(jié)合詞向量特征和文本結(jié)構(gòu)特征提出了層次化長短時記憶網(wǎng)絡(luò)來學(xué)習(xí)文檔的向量化表達,并結(jié)合多項邏輯回歸和基于最小二乘法的動態(tài)閾值調(diào)整技術(shù)對標(biāo)簽進行排序和預(yù)測。實驗表明此策略相對基準(zhǔn)模型給多分類效果帶來了巨大的提升(子集準(zhǔn)確率提高38%,F1分數(shù)提高23%)。(2)合理利用了標(biāo)簽之間的結(jié)構(gòu)特征建立了一棵標(biāo)簽樹,對每個內(nèi)部節(jié)點都訓(xùn)練了一個分類器,并在樹中使用內(nèi)部節(jié)點的分類器輸出結(jié)果定義了不同的對邊進行加權(quán)的方式,接著在賦權(quán)的標(biāo)簽樹上使用A*搜索算法進行最短路徑搜索來實現(xiàn)不同的聯(lián)合預(yù)測準(zhǔn)則。實驗表明此策略在之前模型的基礎(chǔ)上繼續(xù)對多分類效果帶來了顯著的提升(子集準(zhǔn)確率提高12%,F1分數(shù)提高2.5%)。
[Abstract]:Classification problem has always been the core problem in artificial intelligence field. With the enrichment of text content, text semantics presents features of multi-angle and multi-label, in order to automatically index and manage these contents. The problem of multi-label text classification is becoming more and more important. Although the technology of text classification has been widely studied, with the increase of the number of tags, the complexity of multi-label text classification problem will increase exponentially, so that the traditional technology can not meet the demand. Therefore, this paper studies the problem of multi-label text classification, the main work is as follows: (1) this paper analyzes the shortcomings of the traditional algorithm, and proposes a hierarchical long-short memory network model based on word vector. The text is modeled at the sentence and document levels, and the vectorization of the whole document is obtained. (2) based on the proposed model, this paper proposes two strategies to classify the text with multiple tags. One sorts the labels based on multiple logical regression, and then uses the dynamic threshold adjustment technique to get the prediction results; the other uses the structural features between the labels to construct a label tree. Several classifiers are trained to perform joint prediction on the label tree, and several criteria for joint prediction are proposed. (3) on the news data set of the New York Times, Several comparative experiments are designed to compare the algorithm with the benchmark model on a number of indicators. In addition, this paper also designs a number of experimental inquiry models on the label tree for joint prediction, different prediction criteria on the performance of the model. The main contributions of this paper are as follows: (1) combining word vector features and text structure features, a hierarchical long and short time memory network (LSTMN) is proposed to study the vectorization of documents. Combined with multiple logical regression and dynamic threshold adjustment based on least square method, the labels are sorted and predicted. The experimental results show that the strategy relative benchmark model has greatly improved the effectiveness of multi-classification (the accuracy of subset is increased by 38% and F1 score is increased by 23%). (2), and a label tree is established by using the structural features between tags reasonably. A classifier is trained for each internal node, and different ways of weighting edges are defined in the tree using the classifier output of the internal node. Then the shortest path search algorithm is used on the weighted label tree to realize different joint prediction criteria. The experimental results show that the strategy continues to improve the effectiveness of multi-classification based on the previous model. (the accuracy of subsets is improved by 12% and F1 score is increased by 2.5%).
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
[Abstract]:Classification problem has always been the core problem in artificial intelligence field. With the enrichment of text content, text semantics presents features of multi-angle and multi-label, in order to automatically index and manage these contents. The problem of multi-label text classification is becoming more and more important. Although the technology of text classification has been widely studied, with the increase of the number of tags, the complexity of multi-label text classification problem will increase exponentially, so that the traditional technology can not meet the demand. Therefore, this paper studies the problem of multi-label text classification, the main work is as follows: (1) this paper analyzes the shortcomings of the traditional algorithm, and proposes a hierarchical long-short memory network model based on word vector. The text is modeled at the sentence and document levels, and the vectorization of the whole document is obtained. (2) based on the proposed model, this paper proposes two strategies to classify the text with multiple tags. One sorts the labels based on multiple logical regression, and then uses the dynamic threshold adjustment technique to get the prediction results; the other uses the structural features between the labels to construct a label tree. Several classifiers are trained to perform joint prediction on the label tree, and several criteria for joint prediction are proposed. (3) on the news data set of the New York Times, Several comparative experiments are designed to compare the algorithm with the benchmark model on a number of indicators. In addition, this paper also designs a number of experimental inquiry models on the label tree for joint prediction, different prediction criteria on the performance of the model. The main contributions of this paper are as follows: (1) combining word vector features and text structure features, a hierarchical long and short time memory network (LSTMN) is proposed to study the vectorization of documents. Combined with multiple logical regression and dynamic threshold adjustment based on least square method, the labels are sorted and predicted. The experimental results show that the strategy relative benchmark model has greatly improved the effectiveness of multi-classification (the accuracy of subset is increased by 38% and F1 score is increased by 23%). (2), and a label tree is established by using the structural features between tags reasonably. A classifier is trained for each internal node, and different ways of weighting edges are defined in the tree using the classifier output of the internal node. Then the shortest path search algorithm is used on the weighted label tree to realize different joint prediction criteria. The experimental results show that the strategy continues to improve the effectiveness of multi-classification based on the previous model. (the accuracy of subsets is improved by 12% and F1 score is increased by 2.5%).
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【相似文獻】
相關(guān)會議論文 前10條
1 潘學(xué)豐;孟子厚;;噪聲干擾對數(shù)字短時記憶注意力的影響[A];運輸噪聲的預(yù)測與控制——2009全國環(huán)境聲學(xué)學(xué)術(shù)會議論文集[C];2009年
2 白娟;王u&;;短時記憶的通道組織和范疇組織的比較[A];第八屆全國心理學(xué)學(xué)術(shù)會議文摘選集[C];1997年
3 武國城;畢紅哲;鄧學(xué)廉;馬雪松;姚阿慶;田廣慶;李志紅;;年齡差異對戰(zhàn)斗機飛行員短時記憶和雙重任務(wù)作業(yè)能力的影響[A];第八屆全國心理學(xué)學(xué)術(shù)會議文摘選集[C];1997年
4 沈德立;陰國恩;林鏡秋;劉景全;;關(guān)于系列材料的長時和短時記憶的實驗研究[A];全國第五屆心理學(xué)學(xué)術(shù)會議文摘選集[C];1984年
5 龔明亮;yび蠲,
本文編號:2250729
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2250729.html
最近更新
教材專著