基于WEB的中文社交網站用戶屬性推測的研究與分析
發(fā)布時間:2019-03-19 10:32
【摘要】:隨著互聯網的發(fā)展,社交網站日益普及。這些網站的用戶每天會產生海量數據,用戶數據潛藏著巨大的價值。由于用戶數據往往涉及到個人隱私,他們通常選擇不填寫或填寫虛假信息等方式來隱藏其個人信息,這導致用戶的部分有價值的屬性信息難以直接獲取。如何推測用戶的這些屬性信息已經成為當下研究的熱門課題。本文主要以新浪微博用戶為研究對象,對用戶的屬性進行推測。主要包括用戶的性別推測、年齡分布推測以及教育程度分布推測。本文主要工作如下:1)對于中文用戶的性別推測,本文提出了四個基于文本的用戶性別推測算法。它們分別是基于昵稱的用戶性別推測算法(GIABON)、基于標簽的用戶性別推測算法(GIABOL)、基于微博文本的用戶性別推測算法(GIABOWT)、基于均值的用戶性別推測算法(GIABOM)。前三個算法只考慮了單個屬性對于用戶性別推測的影響,這其實是有局限性的,而GIABOM綜合考慮了各種類型的文本對于用戶性別推測的影響。實驗表明,GIABOM的準確率達到85.55%,遠高于其它三個算法。這說明在進行用戶性別推測時,綜合考慮一些屬性更加合理。2)對于中文用戶的年齡分布推測,本文提出了基于遺傳算法優(yōu)化支持向量機組合參數和特征屬性的用戶年齡分布推測算法。本文分別取線性核函數、徑向基核函數(RBF)、以及基于遺傳算法優(yōu)化參數的RBF作為SVM算法的核函數。實驗表明,使用線性核函數的SVM算法的準確率可以達到75.38%,使用RBF的SVM算法的準確率可以達到86.14%。而基于遺傳算法優(yōu)化支持向量機組合參數和特征屬性的用戶年齡分布推測算法的準確率可以達到89.11%。實驗結果驗證了該算法對于SVM參數和特征優(yōu)化的有效性與合理性。3)對于中文用戶的教育程度分布推測,本文提出了基于遺傳算法優(yōu)化支持向量機組合參數和特征屬性的用戶教育程度分布推測算法。其思路同中文用戶的年齡分布推測算法類似。實驗表明,使用線性核函數的SVM算法的準確率達到81.38%,使用RBF的SVM算法的準確率達到92.14%,基于遺傳算法優(yōu)化支持向量機組合參數和特征屬性的用戶教育程度分布推測算法的準確率達到93.03%。這說明該算法在推測用戶的教育程度方面依然有很好的效果。
[Abstract]:With the development of the Internet, social networking sites are becoming more and more popular. Users of these sites generate huge amounts of data every day, and the user data lurks a great deal of value. Because user data often involves personal privacy, they usually choose not to fill in or fill in false information to hide their personal information, which makes it difficult to obtain some valuable attribute information directly. How to infer the user's attribute information has become a hot topic in the current research. This article mainly takes Sina Weibo user as the research object, carries on the conjecture to the user's attribute. It mainly includes users' sex speculation, age distribution theory and education level distribution theory. The main work of this paper is as follows: 1) for the gender estimation of Chinese users, four text-based gender inference algorithms are proposed in this paper. They are nicknames based on the user gender inference algorithm (GIABON), tag-based user gender inference algorithm (GIABOL), based on Weibo text user gender inference algorithm (GIABOWT), mean-based user gender inference algorithm (GIABOM). The first three algorithms only consider the impact of a single attribute on the user's gender conjecture, which is actually limited, while GIABOM takes into account the effects of various types of texts on the user's gender conjecture. Experimental results show that the accuracy of GIABOM is 85.55%, which is much higher than the other three algorithms. This shows that it is more reasonable to consider some attributes in the user's gender estimation. 2) the age distribution of Chinese users is estimated. In this paper, a user age distribution estimation algorithm based on genetic algorithm is proposed to optimize the combination parameters and characteristic attributes of support vector machines (SVM). In this paper, linear kernel function, radial basis kernel function (RBF),) and genetic algorithm-based optimization parameter (RBF) are used as kernel functions of SVM algorithm respectively. Experiments show that the accuracy of SVM algorithm using linear kernel function can reach 75.38%, and the accuracy of SVM algorithm using RBF can reach 86.14%. Based on the genetic algorithm, the accuracy of the user age distribution estimation algorithm based on the combination parameters and characteristic attributes of support vector machine can reach 89.11%. The experimental results verify the validity and rationality of the proposed algorithm for the optimization of SVM parameters and features. 3) the educational level distribution of Chinese users is speculated. In this paper, a genetic algorithm based on genetic algorithm to optimize the combination parameters and characteristic attributes of support vector machines (SVM) is proposed to predict the user's educational level distribution. The idea is similar to the age distribution estimation algorithm for Chinese users. Experiments show that the accuracy of SVM algorithm using linear kernel function is 81.38%, and that of SVM algorithm using RBF is 92.14%. Based on genetic algorithm, the accuracy of the user education degree distribution prediction algorithm based on the combination parameters and feature attributes of support vector machine is 93.03%. This shows that the algorithm still has a good effect in predicting the education level of users.
【學位授予單位】:南京航空航天大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP18;TP393.09
本文編號:2443437
[Abstract]:With the development of the Internet, social networking sites are becoming more and more popular. Users of these sites generate huge amounts of data every day, and the user data lurks a great deal of value. Because user data often involves personal privacy, they usually choose not to fill in or fill in false information to hide their personal information, which makes it difficult to obtain some valuable attribute information directly. How to infer the user's attribute information has become a hot topic in the current research. This article mainly takes Sina Weibo user as the research object, carries on the conjecture to the user's attribute. It mainly includes users' sex speculation, age distribution theory and education level distribution theory. The main work of this paper is as follows: 1) for the gender estimation of Chinese users, four text-based gender inference algorithms are proposed in this paper. They are nicknames based on the user gender inference algorithm (GIABON), tag-based user gender inference algorithm (GIABOL), based on Weibo text user gender inference algorithm (GIABOWT), mean-based user gender inference algorithm (GIABOM). The first three algorithms only consider the impact of a single attribute on the user's gender conjecture, which is actually limited, while GIABOM takes into account the effects of various types of texts on the user's gender conjecture. Experimental results show that the accuracy of GIABOM is 85.55%, which is much higher than the other three algorithms. This shows that it is more reasonable to consider some attributes in the user's gender estimation. 2) the age distribution of Chinese users is estimated. In this paper, a user age distribution estimation algorithm based on genetic algorithm is proposed to optimize the combination parameters and characteristic attributes of support vector machines (SVM). In this paper, linear kernel function, radial basis kernel function (RBF),) and genetic algorithm-based optimization parameter (RBF) are used as kernel functions of SVM algorithm respectively. Experiments show that the accuracy of SVM algorithm using linear kernel function can reach 75.38%, and the accuracy of SVM algorithm using RBF can reach 86.14%. Based on the genetic algorithm, the accuracy of the user age distribution estimation algorithm based on the combination parameters and characteristic attributes of support vector machine can reach 89.11%. The experimental results verify the validity and rationality of the proposed algorithm for the optimization of SVM parameters and features. 3) the educational level distribution of Chinese users is speculated. In this paper, a genetic algorithm based on genetic algorithm to optimize the combination parameters and characteristic attributes of support vector machines (SVM) is proposed to predict the user's educational level distribution. The idea is similar to the age distribution estimation algorithm for Chinese users. Experiments show that the accuracy of SVM algorithm using linear kernel function is 81.38%, and that of SVM algorithm using RBF is 92.14%. Based on genetic algorithm, the accuracy of the user education degree distribution prediction algorithm based on the combination parameters and feature attributes of support vector machine is 93.03%. This shows that the algorithm still has a good effect in predicting the education level of users.
【學位授予單位】:南京航空航天大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP18;TP393.09
【參考文獻】
相關期刊論文 前2條
1 張磊;陳貞翔;楊波;;社交網絡用戶的人格分析與預測[J];計算機學報;2014年08期
2 沈翠華,劉廣利,鄧乃揚;一種改進的支持向量分類方法及其應用[J];計算機工程;2005年08期
相關會議論文 前1條
1 趙云龍;李艷兵;;社交網絡用戶的人格預測與關系強度研究[A];第七屆(2012)中國管理學年會商務智能分會場論文集(選編)[C];2012年
相關博士學位論文 前1條
1 萬懷宇;社會網絡中基于鏈接的分類問題研究[D];北京交通大學;2012年
相關碩士學位論文 前4條
1 張曉;社會網絡上的用戶屬性推測方法研究[D];哈爾濱工業(yè)大學;2015年
2 夏勇;基于手機應用日志的用戶基礎屬性預測[D];電子科技大學;2015年
3 許盛伍;在線熱點新聞推薦系統研究和實現[D];南京航空航天大學;2015年
4 壽泉;在線網絡用戶作者身份鑒定方法研究[D];南京航空航天大學;2012年
,本文編號:2443437
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2443437.html
最近更新
教材專著