基于自然語言處理的社交網(wǎng)絡(luò)數(shù)據(jù)挖掘研究
本文選題:微博 + 分詞 ; 參考:《華北電力大學(xué)》2017年碩士論文
【摘要】:微博是一種目前非常熱門的社交平臺,用戶以短文本或多媒體信息的方式在平臺上實現(xiàn)實時的信息分享與交流。用戶發(fā)布的文本雖短,但長時間積累下來的數(shù)據(jù)蘊含著豐富的用戶的個性化特征等信息。平臺的用戶數(shù)據(jù)中蘊含著豐富的社會信息價值,微博用戶數(shù)據(jù)挖掘?qū)τ谏缃痪W(wǎng)絡(luò)發(fā)展與社交信息分析具有重要意義。社交網(wǎng)絡(luò)數(shù)據(jù)挖掘完成的主要功能就是通過分析和挖掘用戶在微博中的海量短文本,得到用戶的個性化特征等信息。其首要工作是從網(wǎng)絡(luò)中采集大量微博數(shù)據(jù),采用特定的格式進(jìn)行信息存儲;然后對獲取的微博信息進(jìn)行分詞處理和信息特征表示處理,最后通過數(shù)據(jù)挖掘方法進(jìn)行用戶識別和用戶類型分析。本文利用網(wǎng)絡(luò)爬蟲技術(shù)設(shè)計了基于模擬登錄的用戶數(shù)據(jù)爬取系統(tǒng),提供了從網(wǎng)絡(luò)中獲取大量用戶微博數(shù)據(jù)的方法。根據(jù)用戶數(shù)據(jù)結(jié)構(gòu)特征,采用基于JSON格式的NOSQL數(shù)據(jù)庫進(jìn)行存儲。針對目前分詞方法存在的新詞發(fā)現(xiàn)困難的問題,提出了基于詞典匹配與統(tǒng)計標(biāo)注相融合的中文分詞方法。本方法以字典匹配方法為基礎(chǔ),融入CRF標(biāo)注算法,并在分詞過程中迭代訓(xùn)練實現(xiàn)算法自學(xué)習(xí)能力。通過將匹配方法與標(biāo)注方法相融合,根據(jù)漢語語義規(guī)律選取分詞結(jié)果,有效改善了中文分詞在分詞準(zhǔn)確性和未登錄詞發(fā)現(xiàn)等方面的分詞效果。在測試語料上實驗結(jié)果表明,文中提出的方法與最大正向匹配算法相比,F值提高了9.6%,且比CRF標(biāo)注算法提高了2.9%,能更好地滿足實際應(yīng)用需求。目前的微博數(shù)據(jù)挖掘中主要采用One-hot representation特征表示方法,其缺點是不能表達(dá)上下文語義。本文采用基于word2vec的用戶特征表示方法,在用戶特征表示中加入了上下文信息并且降低了用戶信息向量維度,提高了后續(xù)數(shù)據(jù)挖掘算法的計算效率。通過對微博用戶數(shù)據(jù)的分析,發(fā)現(xiàn)用戶中存在部分垃圾用戶會對數(shù)據(jù)挖掘帶來噪聲干擾,本文設(shè)計了基于SVM的垃圾用戶識別模型對垃圾用戶進(jìn)行識別,在測試集上F值達(dá)到0.94。然后根據(jù)微博用戶關(guān)注內(nèi)容,利用K-means聚類分析算法進(jìn)行了用戶社區(qū)劃分。由于用戶社區(qū)劃分的不確定性,通過DB-index算法計算最優(yōu)聚類中心數(shù)值,提高了聚類結(jié)果的類間辨識度和類內(nèi)相似度。
[Abstract]:Weibo is a very popular social platform at present. Users share and communicate real-time information on the platform by short text or multimedia information. Although the text published by users is short, the data accumulated for a long time contains rich information such as personalized features of users. The user data of the platform contains rich social information value. Weibo user data mining is of great significance to the development of social network and the analysis of social information. The main function of social network data mining is to get the personalized features of users by analyzing and mining the massive short text books of users in Weibo. The first task is to collect a large amount of Weibo data from the network and store the information in a specific format. Then the word segmentation and information feature representation of the obtained Weibo information are processed. Finally, user identification and user type analysis are carried out by data mining method. In this paper, a user data crawling system based on simulated login is designed by using web crawler technology, which provides a method to obtain a large number of user Weibo data from the network. According to the characteristics of user data structure, NOSQL database based on JSON format is used for storage. Aiming at the difficulty of finding new words in word segmentation methods, a Chinese word segmentation method based on the combination of dictionary matching and statistical tagging is proposed. Based on the dictionary matching method, the algorithm is integrated with CRF tagging algorithm, and the self-learning ability of the algorithm is realized by iterative training in the process of word segmentation. By combining the matching method with the tagging method and selecting the segmentation results according to the Chinese semantic rules, the segmentation effect of Chinese word segmentation in terms of the accuracy of word segmentation and the discovery of unrecorded words is effectively improved. The experimental results on the test corpus show that compared with the maximum forward matching algorithm, the proposed method can increase the F value by 9.6, and the CRF tagging algorithm by 2.9 points, which can better meet the practical application requirements. One-hot representation feature representation is mainly used in Weibo data mining, but its disadvantage is that it can not express context semantics. In this paper, the user feature representation method based on word2vec is adopted. The context information is added to the user feature representation and the dimension of user information vector is reduced, which improves the computational efficiency of the subsequent data mining algorithm. Through the analysis of Weibo user data, it is found that there are some garbage users in the user who will bring noise interference to the data mining. In this paper, the garbage user identification model based on Weibo is designed to identify the garbage user, and the F value on the test set reaches 0.94. Then according to the Weibo user focus, K-means clustering algorithm is used to divide the user community. Due to the uncertainty of user community division, the optimal clustering center value is calculated by DB-index algorithm, which improves the inter-class identification and intra-class similarity of the clustering results.
【學(xué)位授予單位】:華北電力大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 杜長燕;李祥龍;;基于WEB的網(wǎng)絡(luò)爬蟲的設(shè)計[J];無線互聯(lián)科技;2015年05期
2 周慶燕;何利力;胡靖楓;;搜索引擎中網(wǎng)絡(luò)爬蟲策略在煙草行業(yè)中的應(yīng)用研究[J];工業(yè)控制計算機(jī);2014年12期
3 陳克寒;韓盼盼;吳健;;基于用戶聚類的異構(gòu)社交網(wǎng)絡(luò)推薦算法[J];計算機(jī)學(xué)報;2013年02期
4 孫曉瑩;李大展;王水;;國內(nèi)微博研究的發(fā)展與機(jī)遇[J];情報雜志;2012年07期
5 李志義;;網(wǎng)絡(luò)爬蟲的優(yōu)化策略探略[J];現(xiàn)代情報;2011年10期
6 閆幸;常亞平;;微博研究綜述[J];情報雜志;2011年09期
7 王曉蘭;;2010年中國微博客研究綜述[J];國際新聞界;2011年01期
8 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計算機(jī)科學(xué);2009年08期
9 遲呈英;于長遠(yuǎn);戰(zhàn)學(xué)剛;;基于條件隨機(jī)場的中文分詞方法[J];情報雜志;2008年05期
10 羅桂瓊;費洪曉;戴弋;;基于反序詞典的中文分詞技術(shù)研究[J];計算機(jī)技術(shù)與發(fā)展;2008年01期
,本文編號:2067949
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2067949.html