天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

基于語義的網(wǎng)絡(luò)流行語趨勢分析

發(fā)布時間:2018-11-03 16:56
【摘要】:在自然語言處理方面,構(gòu)建可計(jì)算的詞語、文本語義特征是多數(shù)自然語言處理任務(wù)的基礎(chǔ)。本文提出一種詞語語義相似度計(jì)算方法,通過結(jié)合文本之外的先驗(yàn)知識,提高在特征稀疏情況下的模型準(zhǔn)確率;同時結(jié)合詞語語義相似度計(jì)算與LDA(Latent Dirichlet Allocation)定義文本間的語義距離,通過K-Means聚類獲取語料中的事件。兩個方法可以結(jié)合外部知識改進(jìn)對詞語、文本向量化的過程,提升基于向量的相似度計(jì)算的效果。論文的兩個主要方面分別為:改進(jìn)詞語語義相似度計(jì)算:向量化是詞語的語義可計(jì)算的關(guān)鍵。本文提出了一種結(jié)合詞語關(guān)系的改進(jìn)詞語語義向量計(jì)算法。該方法基于Word2Vec的思路,在通過當(dāng)前詞語預(yù)測上下文詞語的基礎(chǔ)上,同時預(yù)測詞語在詞語關(guān)系中的鄰接位置。模型將詞語經(jīng)過編碼矩陣得到語義向量,再經(jīng)過解碼矩陣得到對上下文詞語與詞語關(guān)系等稀疏特征的預(yù)測。通過模型參數(shù)對誤差的梯度來迭代調(diào)整模型,最終得到詞語到語義向量的映射方法。該方法可以用過添加額外的詞語關(guān)系網(wǎng)來緩解文本本身的特征稀疏情況,提高詞語語義相似度計(jì)算的準(zhǔn)確性。改進(jìn)基于LDA的事件發(fā)現(xiàn):基于LDA的事件發(fā)現(xiàn)是通過LDA模型得到文本的主題詞向量,并通過主題詞向量之間的余弦距離聚類得到文本簇的方法。本文提出了一種融合了詞語語義相似度計(jì)算與詞語在頻域特征的文本語義距離計(jì)算方法,進(jìn)而改進(jìn)了基于LDA的事件發(fā)現(xiàn)算法。首先將文本根據(jù)時間窗分割后進(jìn)行LDA計(jì)算得到文本的主題詞向量,并根據(jù)融合詞語語義相似度的距離定義進(jìn)行K-Means聚類得到時間窗粒度的事件;然后根據(jù)主題詞的詞頻特征合并時間窗粒度的事件,最終得到事件。該方法可以通過融合額外文本中的詞語語義相似度信息,改善對短文本事件發(fā)現(xiàn)的準(zhǔn)確性。在將本文方法與對比方法進(jìn)行對照試驗(yàn)后,可以看出本文方法相比對比方法在準(zhǔn)確度上有一定的提高。同時由于模型對于關(guān)系數(shù)據(jù)格式與數(shù)量沒有特殊要求,使得模型擁有較好的通用性與可擴(kuò)展性。本文的創(chuàng)新點(diǎn)如下:1)通過向量的矩陣表示與局部點(diǎn)乘來表達(dá)詞語與其他元素之間的多種關(guān)系,并通過梯度下降來學(xué)習(xí)詞語的向量表示。2)融合詞語的語義相似度與詞語詞頻信息來重新定義主題向量之間的距離,進(jìn)而改善事件聚類的效果。
[Abstract]:In natural language processing, the construction of computable words, text semantic features are the basis of most natural language processing tasks. In this paper, a semantic similarity calculation method is proposed to improve the accuracy of the model in the case of sparse features by combining the prior knowledge outside the text. At the same time, combining the semantic distance between word semantic similarity calculation and LDA (Latent Dirichlet Allocation) definition text, the events in the corpus are obtained by K-Means clustering. The two methods can be combined with external knowledge to improve the process of word and text vectorization and improve the effect of vector based similarity calculation. The two main aspects of this paper are as follows: to improve the semantic similarity calculation of words: vectorization is the key to the semantic computability of words. In this paper, an improved semantic vector calculation method based on word relation is proposed. This method is based on the idea of Word2Vec, based on the prediction of contextual words by the current words, and the adjacent position of words in the word relationship at the same time. In the model, the semantic vector is obtained by the encoding matrix, and the sparse features such as the relation between the context words and the words are predicted by the decoding matrix. Finally, the mapping method of word to semantic vector is obtained by iteratively adjusting the model by the gradient of error between the parameters of the model and the error. This method can be used to improve the accuracy of word semantic similarity calculation by adding additional word relationship network to alleviate the sparse feature of the text itself. Improved event discovery based on LDA: event discovery based on LDA is a method to get the theme word vector of text by LDA model, and to obtain text cluster by clustering cosine distance between theme word vectors. This paper proposes a method of text semantic distance computation which combines semantic similarity calculation of words and features of words in frequency domain and improves the event discovery algorithm based on LDA. Firstly, the text is divided according to the time window and the text is computed by LDA, and the event of the time window granularity is obtained by K-Means clustering according to the distance definition of the semantic similarity of the fused words. Then the time window granularity event is merged according to the frequency feature of the subject word, and the event is finally obtained. This method can improve the accuracy of finding short text events by merging the semantic similarity information of words in extra text. After comparing the method with the contrast method, it can be seen that the accuracy of the method is higher than that of the contrast method. At the same time, because the model has no special requirements for the format and quantity of relational data, the model has better generality and extensibility. The innovations of this paper are as follows: 1) the matrix representation and local dot multiplication of vectors are used to express the relationships between words and other elements. The vector representation of words is learned by gradient descent. 2) the distance between topic vectors is redefined by combining the semantic similarity of words with word frequency to improve the effect of event clustering.
【學(xué)位授予單位】:北方工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 鄭U喚,

本文編號:2308382


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2308382.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶d0187***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com