天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

針對(duì)私人微博的自動(dòng)摘要形成研究

發(fā)布時(shí)間:2018-06-15 16:37

  本文選題:私人微博 + 自動(dòng)文摘 ; 參考:《內(nèi)蒙古科技大學(xué)》2014年碩士論文


【摘要】:自2007年以來(lái),微博這種通訊形式風(fēng)靡全球。微博具有上手門(mén)檻低、交流及時(shí)、發(fā)布便捷等優(yōu)勢(shì),在全球得以普及和發(fā)展。近年來(lái)微博的發(fā)展態(tài)勢(shì)強(qiáng)勁,已成為人們生活中不可或缺的一部分。在國(guó)內(nèi),網(wǎng)民的微博用戶數(shù)量激增,每日發(fā)布的博文條數(shù)多達(dá)上億條,產(chǎn)生了大量的微博數(shù)據(jù)。大多數(shù)的微博內(nèi)容隨意,評(píng)論較多且口語(yǔ)化嚴(yán)重。如何在浩如煙海各型各色的微博數(shù)據(jù)中找到符合個(gè)人興趣并能夠提供有效信息的微博數(shù)據(jù),成為了伴隨著微博發(fā)展帶來(lái)的一個(gè)巨大的問(wèn)題。 本文以新浪微博為數(shù)據(jù)來(lái)源,以個(gè)人微博一個(gè)歷史時(shí)間段內(nèi)所發(fā)表的所有微博數(shù)據(jù)為單位進(jìn)行研究。經(jīng)過(guò)對(duì)自動(dòng)文摘技術(shù)與微博數(shù)據(jù)特點(diǎn)的研究,并且結(jié)合文本表示、聚類算法等主題進(jìn)行了探討,設(shè)計(jì)且實(shí)現(xiàn)了一個(gè)從獲取數(shù)據(jù)到數(shù)據(jù)處理到最終自動(dòng)摘要形成的完整系統(tǒng)。這個(gè)過(guò)程中主要經(jīng)歷了以下步驟:獲取數(shù)據(jù)、對(duì)數(shù)據(jù)進(jìn)行預(yù)處理、文本表示、特征選擇、相似度計(jì)算改進(jìn)、聚類算法改進(jìn)及算法實(shí)現(xiàn)和形成綜合自動(dòng)摘要。本論文主要工作有: 首先,通過(guò)新浪微博開(kāi)放平臺(tái)獲取微博原始數(shù)據(jù)。 其次,對(duì)微博數(shù)據(jù)進(jìn)行分析研究,結(jié)合私人微博文本特點(diǎn)把微博數(shù)據(jù)與評(píng)論內(nèi)容合并成偽文檔進(jìn)行分詞等一系列預(yù)處理工作。接下來(lái),將分詞后的文本轉(zhuǎn)化成數(shù)據(jù)格式。文本模型把數(shù)據(jù)從文本形式轉(zhuǎn)化成了數(shù)學(xué)的表示,反映了數(shù)據(jù)之間的關(guān)系,并在此基礎(chǔ)上采用文本相似度的計(jì)算方法。 然后,聚類算法采用了K-means聚類算法。K值的指定一直都是K-means聚類算法的最大的問(wèn)題,通常需要通過(guò)經(jīng)驗(yàn)進(jìn)行判斷。中心點(diǎn)的選取也是一個(gè)較大的問(wèn)題,通常中心點(diǎn)最好具有代表性,選取不同中心點(diǎn)的位置對(duì)算法結(jié)果的準(zhǔn)確性也有較大影響。我們對(duì)此進(jìn)行了改進(jìn),使得改進(jìn)后的算法能夠自適應(yīng)地獲取K的值,并選取中心點(diǎn)。 最后,根據(jù)微博的內(nèi)容時(shí)效性和流行度,,確定聚類簇中各個(gè)微博的權(quán)重,先得到每個(gè)聚類中的摘要,最終結(jié)合各個(gè)聚類簇形成最終針對(duì)私人微博的摘要。論文的最后通過(guò)實(shí)驗(yàn)驗(yàn)證,對(duì)論文提出的聚類算法改進(jìn)進(jìn)行了分析和實(shí)驗(yàn)。相比于原先的算法準(zhǔn)確率和適用性有所提高。通過(guò)整個(gè)系統(tǒng)開(kāi)發(fā)實(shí)現(xiàn)了私人微博摘要的形成。
[Abstract]:Since 2007, Weibo, the form of communication is popular around the world. Weibo has the advantages of low threshold, timely communication, convenient distribution, and so on, so it can be popularized and developed in the world. In recent years, the development of Weibo has become an indispensable part of people's life. In China, the number of Weibo users has soared, and hundreds of millions of blog posts have been published daily, generating a lot of Weibo data. Most of the Weibo content is random, more comments and more colloquial. How to find the Weibo data in all kinds of Weibo data that accord with personal interest and provide effective information has become a huge problem along with the development of Weibo. This paper takes Sina Weibo as the data source, and studies all the Weibo data published in a historical period of personal Weibo. Based on the study of the characteristics of automatic summarization and Weibo data, and combining with the topics of text representation and clustering algorithm, a complete system from data acquisition to data processing to automatic summarization is designed and implemented. The main steps in this process are as follows: data acquisition, data preprocessing, text representation, feature selection, similarity calculation improvement, clustering algorithm improvement and algorithm implementation and formation of a comprehensive automatic summary. The main work of this paper is as follows: first, access Weibo raw data through Sina Weibo open platform. Secondly, the Weibo data is analyzed and studied, and a series of preprocessing work, such as combining Weibo data and comments into pseudo-documents, combining with the characteristics of private Weibo texts, is carried out. Next, the participle text is transformed into data format. The text model transforms the data from text form to mathematical representation, which reflects the relationship between the data. On this basis, the text similarity calculation method is adopted. Then, K-means clustering algorithm using K-means clustering algorithm. K value assignment has always been the biggest problem of K-means clustering algorithm, which usually needs to be judged by experience. The selection of the center point is also a big problem, usually the center point should be representative, and the location of different center point has a great influence on the accuracy of the algorithm. The improved algorithm can adaptively obtain the value of K and select the center point. Finally, according to the content timeliness and popularity of Weibo, the weight of each Weibo in the cluster is determined, and the summary of each cluster is obtained first, and finally the summary for private Weibo is formed by combining each cluster. At the end of this paper, the improvement of clustering algorithm is analyzed and experimented by experiment. Compared with the original algorithm, the accuracy and applicability of the algorithm are improved. The formation of private Weibo digest is realized through the whole system development.
【學(xué)位授予單位】:內(nèi)蒙古科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 張闊;李涓子;吳剛;王克宏;;基于關(guān)鍵詞元的話題內(nèi)事件檢測(cè)[J];計(jì)算機(jī)研究與發(fā)展;2009年02期

2 常鵬;馬輝;;高效的短文本主題詞抽取方法[J];計(jì)算機(jī)工程與應(yīng)用;2011年20期

3 付劍鋒;劉宗田;付雪峰;周文;仲兆滿;;基于依存分析的事件識(shí)別[J];計(jì)算機(jī)科學(xué);2009年11期

4 鄭斐然;苗奪謙;張志飛;高燦;;一種中文微博新聞話題檢測(cè)的方法[J];計(jì)算機(jī)科學(xué);2012年01期

5 萬(wàn)小軍,楊建武,陳曉鷗;文檔聚類中k-means算法的一種改進(jìn)算法[J];計(jì)算機(jī)工程;2003年02期

6 馬玉春,宋瀚濤;Web中文文本分詞技術(shù)研究[J];計(jì)算機(jī)應(yīng)用;2004年04期

7 洪宇;張宇;劉挺;李生;;話題檢測(cè)與跟蹤的評(píng)測(cè)及研究綜述[J];中文信息學(xué)報(bào);2007年06期

8 彭澤映;俞曉明;許洪波;劉春陽(yáng);;大規(guī)模短文本的不完全聚類[J];中文信息學(xué)報(bào);2011年01期

9 謝麗星;周明;孫茂松;;基于層次結(jié)構(gòu)的多策略中文微博情感分析和特征抽取[J];中文信息學(xué)報(bào);2012年01期

10 童薇;陳威;孟小峰;;EDM:高效的微博事件檢測(cè)算法[J];計(jì)算機(jī)科學(xué)與探索;2012年12期



本文編號(hào):2022699

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2022699.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶53495***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com