天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

面向個(gè)性化主題的半監(jiān)督文本聚類算法研究

發(fā)布時(shí)間:2018-07-25 10:25
【摘要】:隨著互聯(lián)網(wǎng)在全球范圍的普及,上網(wǎng)人數(shù)不斷增加,互聯(lián)網(wǎng)中積累的數(shù)據(jù)也在成指數(shù)級(jí)別的增長(zhǎng)。這些數(shù)據(jù)中有相當(dāng)大的一部分?jǐn)?shù)據(jù)為文本數(shù)據(jù)。怎樣有效地分析這些文本數(shù)據(jù),并從中挖掘有價(jià)值信息成為一個(gè)熱點(diǎn)研究的問題。在數(shù)據(jù)挖掘中,作為文本分析的重要技術(shù)措施之一的半監(jiān)督文本聚類方法,能夠有效利用少量監(jiān)督信息來提高聚類的性能。因此,這種方法被廣泛關(guān)注。大部分現(xiàn)有的半監(jiān)督文本聚類算法忽視或者不能很好的利用用戶的個(gè)體意愿,從而沒有辦法很好地實(shí)現(xiàn)個(gè)性化的文本劃分,或者因?yàn)楸O(jiān)督信息的形式對(duì)用戶來說難以實(shí)現(xiàn)而導(dǎo)致算法的應(yīng)用范圍十分有限。此外,在實(shí)際的操作過程中,相對(duì)于龐大的文本數(shù)據(jù),用戶能提供的監(jiān)督信息相當(dāng)稀少,使得這些少量的監(jiān)督信息對(duì)聚類過程的影響也十分有限;趯(duì)半監(jiān)督文本聚類相關(guān)研究背景及現(xiàn)有的半監(jiān)督聚類算法所存在問題的分析,本文研究?jī)?nèi)容和研究成果體現(xiàn)在:(1)本文提出了一種新的監(jiān)督信息格式,即感興趣和不感興趣這種關(guān)鍵詞的格式。這種新的監(jiān)督信息格式不僅便于用戶提供,而且在一定程度上解決了用戶個(gè)性化的體現(xiàn)問題及監(jiān)督信息的形式問題。(2)根據(jù)用戶提供的有限的監(jiān)督信息、文本和潛在主題中詞的分布,對(duì)監(jiān)督信息進(jìn)行學(xué)習(xí)和擴(kuò)充來解決監(jiān)督信息匱乏的問題。LDA在解決聚類問題上具有良好的性能,并且能夠挖掘出文本間潛在的主題。因此,本文將LDA引入到半監(jiān)督文本聚類問題中,使用罐子模型來模擬結(jié)合新的監(jiān)督信息形式的文本聚類過程。本文針對(duì)新提出的監(jiān)督信息形式并利用詞的分布對(duì)其擴(kuò)展,提出了一種可擴(kuò)展的基于用戶偏好的半監(jiān)督文本聚類算法(extended LDA,ex LDA)。為了驗(yàn)證算法的有效性,本文從新聞數(shù)據(jù)集20-newsgroups中的不同角度選取五組真實(shí)數(shù)據(jù)集進(jìn)行實(shí)驗(yàn),首先從監(jiān)督信息形式角度分析監(jiān)督信息的合理性和有效性,最后從監(jiān)督信息的擴(kuò)展上驗(yàn)證了擴(kuò)展監(jiān)督信息對(duì)聚類結(jié)果的影響。在真實(shí)數(shù)據(jù)集上的實(shí)驗(yàn)表明,同傳統(tǒng)和最新的半監(jiān)督文本聚類算法比較,在解決文本聚類的問題上,本文提出的ex LDA算法具有更高的準(zhǔn)確度,同時(shí)能滿足用戶個(gè)性化的文本劃分。
[Abstract]:With the popularity of the Internet in the world, the number of Internet users is increasing, and the data accumulated in the Internet is also growing exponentially. A considerable portion of this data is text data. How to effectively analyze these text data and mine valuable information has become a hot issue. In data mining, semi-supervised text clustering, as one of the important technical measures of text analysis, can effectively use a small amount of supervised information to improve the clustering performance. Therefore, this method is widely concerned. Most of the existing semi-supervised text clustering algorithms ignore or can not make good use of the user's individual wishes, so there is no good way to achieve personalized text partitioning. Or because the form of supervised information is difficult for users to implement, the application scope of the algorithm is very limited. In addition, in the actual operation process, compared with the huge text data, the supervision information provided by the user is quite rare, which makes the influence of the small amount of supervision information on the clustering process very limited. Based on the analysis of the research background of semi-supervised text clustering and the problems existing in the existing semi-supervised clustering algorithms, the research contents and research results are as follows: (1) this paper proposes a new supervised information format. That is, interested and not interested in this keyword format. This new monitoring information format not only facilitates users to provide, but also solves the problem of personalization of users and the form of supervision information to some extent. (2) according to the limited supervision information provided by users, The distribution of words in text and potential topics, learning and expanding supervisory information to solve the problem of lack of supervisory information. LDA has good performance in solving clustering problems, and can mine potential topics between texts. Therefore, LDA is introduced into the semi-supervised text clustering problem, and the jar model is used to simulate the text clustering process combined with the new supervised information. In this paper, we propose an extensible semi-supervised text clustering algorithm based on user preference (extended LDA ex LDA).) for the newly proposed supervised information form and extend it by word distribution. In order to verify the validity of the algorithm, this paper selects five groups of real data sets from different angles in news data set 20-newsgroups for experiments. Firstly, the rationality and validity of supervision information are analyzed from the perspective of supervisory information form. Finally, the effect of extended supervisory information on clustering results is verified from the extension of supervisory information. Experiments on real data sets show that the proposed ex LDA algorithm is more accurate than the traditional and the latest semi-supervised text clustering algorithms in solving the problem of text clustering. At the same time, it can satisfy the user's personalized text partition.
【學(xué)位授予單位】:貴州大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP391.1


本文編號(hào):2143531

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2143531.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶7d5cd***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com