基于自然語(yǔ)言處理的社交網(wǎng)絡(luò)數(shù)據(jù)挖掘研究
本文選題:微博 + 分詞 ; 參考:《華北電力大學(xué)》2017年碩士論文
【摘要】:微博是一種目前非常熱門的社交平臺(tái),用戶以短文本或多媒體信息的方式在平臺(tái)上實(shí)現(xiàn)實(shí)時(shí)的信息分享與交流。用戶發(fā)布的文本雖短,但長(zhǎng)時(shí)間積累下來(lái)的數(shù)據(jù)蘊(yùn)含著豐富的用戶的個(gè)性化特征等信息。平臺(tái)的用戶數(shù)據(jù)中蘊(yùn)含著豐富的社會(huì)信息價(jià)值,微博用戶數(shù)據(jù)挖掘?qū)τ谏缃痪W(wǎng)絡(luò)發(fā)展與社交信息分析具有重要意義。社交網(wǎng)絡(luò)數(shù)據(jù)挖掘完成的主要功能就是通過(guò)分析和挖掘用戶在微博中的海量短文本,得到用戶的個(gè)性化特征等信息。其首要工作是從網(wǎng)絡(luò)中采集大量微博數(shù)據(jù),采用特定的格式進(jìn)行信息存儲(chǔ);然后對(duì)獲取的微博信息進(jìn)行分詞處理和信息特征表示處理,最后通過(guò)數(shù)據(jù)挖掘方法進(jìn)行用戶識(shí)別和用戶類型分析。本文利用網(wǎng)絡(luò)爬蟲技術(shù)設(shè)計(jì)了基于模擬登錄的用戶數(shù)據(jù)爬取系統(tǒng),提供了從網(wǎng)絡(luò)中獲取大量用戶微博數(shù)據(jù)的方法。根據(jù)用戶數(shù)據(jù)結(jié)構(gòu)特征,采用基于JSON格式的NOSQL數(shù)據(jù)庫(kù)進(jìn)行存儲(chǔ)。針對(duì)目前分詞方法存在的新詞發(fā)現(xiàn)困難的問(wèn)題,提出了基于詞典匹配與統(tǒng)計(jì)標(biāo)注相融合的中文分詞方法。本方法以字典匹配方法為基礎(chǔ),融入CRF標(biāo)注算法,并在分詞過(guò)程中迭代訓(xùn)練實(shí)現(xiàn)算法自學(xué)習(xí)能力。通過(guò)將匹配方法與標(biāo)注方法相融合,根據(jù)漢語(yǔ)語(yǔ)義規(guī)律選取分詞結(jié)果,有效改善了中文分詞在分詞準(zhǔn)確性和未登錄詞發(fā)現(xiàn)等方面的分詞效果。在測(cè)試語(yǔ)料上實(shí)驗(yàn)結(jié)果表明,文中提出的方法與最大正向匹配算法相比,F值提高了9.6%,且比CRF標(biāo)注算法提高了2.9%,能更好地滿足實(shí)際應(yīng)用需求。目前的微博數(shù)據(jù)挖掘中主要采用One-hot representation特征表示方法,其缺點(diǎn)是不能表達(dá)上下文語(yǔ)義。本文采用基于word2vec的用戶特征表示方法,在用戶特征表示中加入了上下文信息并且降低了用戶信息向量維度,提高了后續(xù)數(shù)據(jù)挖掘算法的計(jì)算效率。通過(guò)對(duì)微博用戶數(shù)據(jù)的分析,發(fā)現(xiàn)用戶中存在部分垃圾用戶會(huì)對(duì)數(shù)據(jù)挖掘帶來(lái)噪聲干擾,本文設(shè)計(jì)了基于SVM的垃圾用戶識(shí)別模型對(duì)垃圾用戶進(jìn)行識(shí)別,在測(cè)試集上F值達(dá)到0.94。然后根據(jù)微博用戶關(guān)注內(nèi)容,利用K-means聚類分析算法進(jìn)行了用戶社區(qū)劃分。由于用戶社區(qū)劃分的不確定性,通過(guò)DB-index算法計(jì)算最優(yōu)聚類中心數(shù)值,提高了聚類結(jié)果的類間辨識(shí)度和類內(nèi)相似度。
[Abstract]:Weibo is a very popular social platform at present. Users share and communicate real-time information on the platform by short text or multimedia information. Although the text published by users is short, the data accumulated for a long time contains rich information such as personalized features of users. The user data of the platform contains rich social information value. Weibo user data mining is of great significance to the development of social network and the analysis of social information. The main function of social network data mining is to get the personalized features of users by analyzing and mining the massive short text books of users in Weibo. The first task is to collect a large amount of Weibo data from the network and store the information in a specific format. Then the word segmentation and information feature representation of the obtained Weibo information are processed. Finally, user identification and user type analysis are carried out by data mining method. In this paper, a user data crawling system based on simulated login is designed by using web crawler technology, which provides a method to obtain a large number of user Weibo data from the network. According to the characteristics of user data structure, NOSQL database based on JSON format is used for storage. Aiming at the difficulty of finding new words in word segmentation methods, a Chinese word segmentation method based on the combination of dictionary matching and statistical tagging is proposed. Based on the dictionary matching method, the algorithm is integrated with CRF tagging algorithm, and the self-learning ability of the algorithm is realized by iterative training in the process of word segmentation. By combining the matching method with the tagging method and selecting the segmentation results according to the Chinese semantic rules, the segmentation effect of Chinese word segmentation in terms of the accuracy of word segmentation and the discovery of unrecorded words is effectively improved. The experimental results on the test corpus show that compared with the maximum forward matching algorithm, the proposed method can increase the F value by 9.6, and the CRF tagging algorithm by 2.9 points, which can better meet the practical application requirements. One-hot representation feature representation is mainly used in Weibo data mining, but its disadvantage is that it can not express context semantics. In this paper, the user feature representation method based on word2vec is adopted. The context information is added to the user feature representation and the dimension of user information vector is reduced, which improves the computational efficiency of the subsequent data mining algorithm. Through the analysis of Weibo user data, it is found that there are some garbage users in the user who will bring noise interference to the data mining. In this paper, the garbage user identification model based on Weibo is designed to identify the garbage user, and the F value on the test set reaches 0.94. Then according to the Weibo user focus, K-means clustering algorithm is used to divide the user community. Due to the uncertainty of user community division, the optimal clustering center value is calculated by DB-index algorithm, which improves the inter-class identification and intra-class similarity of the clustering results.
【學(xué)位授予單位】:華北電力大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 杜長(zhǎng)燕;李祥龍;;基于WEB的網(wǎng)絡(luò)爬蟲的設(shè)計(jì)[J];無(wú)線互聯(lián)科技;2015年05期
2 周慶燕;何利力;胡靖楓;;搜索引擎中網(wǎng)絡(luò)爬蟲策略在煙草行業(yè)中的應(yīng)用研究[J];工業(yè)控制計(jì)算機(jī);2014年12期
3 陳克寒;韓盼盼;吳健;;基于用戶聚類的異構(gòu)社交網(wǎng)絡(luò)推薦算法[J];計(jì)算機(jī)學(xué)報(bào);2013年02期
4 孫曉瑩;李大展;王水;;國(guó)內(nèi)微博研究的發(fā)展與機(jī)遇[J];情報(bào)雜志;2012年07期
5 李志義;;網(wǎng)絡(luò)爬蟲的優(yōu)化策略探略[J];現(xiàn)代情報(bào);2011年10期
6 閆幸;常亞平;;微博研究綜述[J];情報(bào)雜志;2011年09期
7 王曉蘭;;2010年中國(guó)微博客研究綜述[J];國(guó)際新聞界;2011年01期
8 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期
9 遲呈英;于長(zhǎng)遠(yuǎn);戰(zhàn)學(xué)剛;;基于條件隨機(jī)場(chǎng)的中文分詞方法[J];情報(bào)雜志;2008年05期
10 羅桂瓊;費(fèi)洪曉;戴弋;;基于反序詞典的中文分詞技術(shù)研究[J];計(jì)算機(jī)技術(shù)與發(fā)展;2008年01期
,本文編號(hào):2067949
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2067949.html