面向在線社區(qū)的用戶信息挖掘及應(yīng)用研究
發(fā)布時(shí)間:2018-06-05 22:04
本文選題:在線社區(qū) + 用戶信息; 參考:《哈爾濱工業(yè)大學(xué)》2014年博士論文
【摘要】:近些年,隨著各種在線社區(qū)的發(fā)展,網(wǎng)絡(luò)上積累了海量的用戶信息,包括了用戶賬戶信息(例如用戶名)、用戶人口信息(例如性別和年齡等)、用戶社交關(guān)系(例如朋友關(guān)系和回復(fù)關(guān)系等)以及用戶生成內(nèi)容等。一方面,這些用戶信息可以幫助企業(yè)更好的理解和定位客戶,另外一方面可以為用戶提供更好的個(gè)性化信息系統(tǒng),同時(shí)可以幫助社會(huì)學(xué)家更好的理解人類行為。因此,挖掘在線社區(qū)中的用戶信息是構(gòu)建新的社會(huì)化應(yīng)用以及理解人類行為的關(guān)鍵。 然而,在線社區(qū)中的用戶信息挖掘存在著各種挑戰(zhàn),包括了非結(jié)構(gòu)化的挑戰(zhàn)、跨社區(qū)的挑戰(zhàn)和非度量化的挑戰(zhàn)。非結(jié)構(gòu)化的挑戰(zhàn)是指在線社區(qū)中的用戶信息以非結(jié)構(gòu)化的形式呈現(xiàn)在各種不同類型的網(wǎng)頁(yè)中,這些網(wǎng)頁(yè)的布局結(jié)構(gòu)的多樣性和動(dòng)態(tài)性為用戶信息的自動(dòng)抽取帶來(lái)了困難?缟鐓^(qū)的挑戰(zhàn)是指一個(gè)用戶的信息碎片化的分布在不同的社區(qū)中,這為全方面理解一個(gè)用戶帶來(lái)了很大的困難。非度量化的挑戰(zhàn)是指各種用戶屬性信息(例如影響力、專業(yè)水平等)缺少顯式的直接度量,這為用戶屬性信息的直接應(yīng)用帶來(lái)了困難。本文主要針對(duì)這三個(gè)挑戰(zhàn)進(jìn)行了研究,并對(duì)用戶信息的應(yīng)用研究進(jìn)行了一定的探索。具體的,本文的主要研究?jī)?nèi)容可概括如下: (1)針對(duì)用戶信息的非結(jié)構(gòu)化挑戰(zhàn),本文研究了面向用戶生成內(nèi)容網(wǎng)頁(yè)的用戶名抽取問題。本文提出了一種基于弱指導(dǎo)學(xué)習(xí)的方法。該方法利用少量的、由統(tǒng)計(jì)意義上稀有的字符串構(gòu)成的用戶名,自動(dòng)收集和標(biāo)注大量訓(xùn)練數(shù)據(jù),解決了目前有指導(dǎo)學(xué)習(xí)方法需要人工標(biāo)注訓(xùn)練數(shù)據(jù)的問題。同時(shí),本文方法僅依賴于從單頁(yè)面中抽取出的特征,克服了已有方法對(duì)于多頁(yè)面特征的依賴性。實(shí)驗(yàn)結(jié)果表明,本文方法顯著性優(yōu)于僅基于單頁(yè)面特征的有指導(dǎo)學(xué)習(xí)方法,并且和基于多頁(yè)面特征的有指導(dǎo)學(xué)習(xí)方法性能相當(dāng)。 (2)針對(duì)用戶信息跨社區(qū)的挑戰(zhàn),,本文研究了跨社區(qū)的用戶鏈指問題。本文將用戶鏈指問題分為兩步:(a)同名消歧,即判斷使用相同用戶名的用戶是否屬于同一個(gè)自然人;(b)不同名消解,即收集一個(gè)自然人所使用的所有不同的用戶名。本文關(guān)注解決同名消歧任務(wù)。首先,本文進(jìn)行了用戶問卷調(diào)查和基于About.me數(shù)據(jù)的分析,量化的說(shuō)明了解決同名消歧任務(wù)的重要性。這是第一個(gè)量化的研究人們使用用戶名行為習(xí)慣的工作。然后,本文提出根據(jù)用戶名的語(yǔ)言模型概率自動(dòng)獲取訓(xùn)練數(shù)據(jù)的方法。同時(shí),本文在Yahoo! Answers的數(shù)據(jù)集上實(shí)驗(yàn)驗(yàn)證了該方法所基于的假設(shè)的合理性。本文方法解決了目前有指導(dǎo)學(xué)習(xí)方法需要人工標(biāo)注數(shù)據(jù)的困難。實(shí)驗(yàn)結(jié)果表明,本文方法在自動(dòng)標(biāo)注的訓(xùn)練集上學(xué)習(xí)到的分類器是有效的。 (3)針對(duì)用戶信息非度量化的挑戰(zhàn),本文以用戶專業(yè)水平估計(jì)為例研究了用戶信息的度量。具體的,本文研究了問答社區(qū)中用戶專業(yè)水平的估計(jì)問題。本文提出了基于競(jìng)賽模型的用戶專業(yè)水平估計(jì)方法。該方法將用戶專業(yè)水平的估計(jì)問題轉(zhuǎn)換成了根據(jù)一系列二人競(jìng)賽的比賽結(jié)果估計(jì)選手的能力水平的問題。具體的,本文方法克服了基于鏈接分析的方法不能將問答關(guān)系和答案質(zhì)量信息等異構(gòu)信息進(jìn)行統(tǒng)一建模的問題。同時(shí),本文方法通過對(duì)每場(chǎng)比賽的難度進(jìn)行建模,克服了基于答案質(zhì)量的方法將每個(gè)問題相等對(duì)待的問題。實(shí)驗(yàn)結(jié)果表明,與基于鏈接分析的方法和基于答案質(zhì)量的估計(jì)方法相比,本文提出的競(jìng)賽模型在估計(jì)活躍用戶的專業(yè)水平時(shí)性能有顯著性提高。 (4)本文從應(yīng)用的角度出發(fā),在結(jié)構(gòu)化、度量化、跨社區(qū)鏈指的用戶信息基礎(chǔ)上,研究了基于用戶信息的眾包任務(wù)難度估計(jì)。具體的,本文以問答社區(qū)中的問題難度估計(jì)為例進(jìn)行了研究。本文利用用戶專業(yè)水平的度量信息,提出了基于用戶競(jìng)賽的模型估計(jì)問題的難度。用戶專業(yè)水平的度量為問題難度的估計(jì)提供了指導(dǎo),解決了之前方法不能處理觀察值為偏序關(guān)系的問題。實(shí)驗(yàn)結(jié)果驗(yàn)證了本文所提出的模型的有效性。最后,本文利用跨社區(qū)的用戶鏈指信息,研究了跨社區(qū)的問題難度估計(jì)問題。 總之,本文一方面致力于解決用戶信息挖掘中非結(jié)構(gòu)化、跨社區(qū)和非度量化的挑戰(zhàn),另一方面從應(yīng)用的角度出發(fā),嘗試了將結(jié)構(gòu)化、度量化、跨社區(qū)鏈指的用戶信息應(yīng)用到眾包任務(wù)難度估計(jì)的問題上來(lái)。本研究取得了一些初步的成果,期待這些成果能對(duì)本領(lǐng)域的其他研究者提供借鑒。隨著用戶信息挖掘技術(shù)的不斷完善,相信用戶信息挖掘技術(shù)會(huì)為各種社會(huì)化應(yīng)用以及社會(huì)計(jì)算相關(guān)的研究帶來(lái)更大的幫助。
[Abstract]:In recent years, with the development of various online communities, the network has accumulated a huge amount of user information, including user account information (such as username), user population information (such as gender and age, etc.), user social relationships (such as friends and reply relationships, etc.) to generate content and so on. On the one hand, these user information can help A better understanding and positioning of the customer, on the other hand, can provide a better personalized information system for the user and help sociologists to better understand human behavior. Therefore, mining the user information in the online community is the key to the construction of new social applications and understanding of human behavior.
However, there are various challenges in user information mining in the online community, including unstructured challenges, cross community challenges and non quantitative challenges. The unstructured challenge is that the user information in the online community is presented in a variety of different types of web pages in an unstructured form, and the diversity of the layout of these pages. The challenge of cross community is that the fragmentation of a user's information is distributed in different communities, which brings great difficulties to a user in all aspects. The challenge of non quantification refers to the lack of explicit user attribute information, such as influence, professional level, etc. The direct measurement of the user's attribute information is difficult. This paper focuses on the three challenges and explores the application of the user information.
(1) aiming at the unstructured challenge of user information, this paper studies user name extraction for user generated content web pages. In this paper, a method based on weak guidance learning is proposed. This method uses a small number of usernames made up of rare strings in statistical sense to automatically collect and label a large number of training data, and solve the problem. At the same time, the proposed method needs to manually annotate the training data. At the same time, this method relies only on the feature extracted from a single page and overcomes the dependence of the existing methods on multi page features. The experimental results show that the method is superior to the supervised learning method based on the single page feature only, and is based on more than one page feature. A page feature has the same performance as a guiding learning method.
(2) in view of the challenge of user information across the community, this paper studies the problem of cross community user chain reference. This paper divides the user chain finger into two steps: (a) the same name disambiguation, that is, to judge whether the user who uses the same username belongs to the same natural person; (b) the different name elimination, that is, to collect all the different usernames used by a natural person. This paper focuses on solving the same name disambiguation task. First, this paper makes a user questionnaire survey and analysis based on About.me data, which quantifies the importance of solving the same name disambiguation task. This is the first quantified study of people using user name behavior habits. Then, this paper proposes that the probability of the language model based on the username is automatically obtained. At the same time, this paper tests the rationality of the hypothesis based on the method in the data set of Yahoo! Answers. This method solves the difficulty of the manual annotation of the data in the present guiding learning method. The experimental results show that the classifier this method has learned on the automatic tagged training set is effective. Yes.
(3) aiming at the challenge of non degree of user information quantification, this paper takes the user professional level as an example to study the measurement of user information. In this paper, this paper studies the estimation of user professional level in the question and answer community. This paper proposes a user professional level estimation method based on competition model. This method will estimate the problem of user's professional level. The problem of estimating the player's ability level based on a series of competition results in a series of two person competitions is transformed. This method overcomes the problem that the method based on link analysis can not model the isomerism information such as question and answer relationship and the quality information of answer. The experimental results show that the performance of the competition model proposed in this paper is significantly higher in estimating the professional level of active users compared with the method based on link analysis and the method based on the quality of answer based on the answer quality based approach.
(4) from the perspective of application, this paper studies the task difficulty estimation based on user information on the basis of user information which is structured, quantified and cross community chain. In this paper, this paper studies the problem of difficulty estimation in the question and answer community. This paper uses the measurement information of the user's professional level, and proposes a user competition based on the measurement information of user's professional level. The model of the game is used to estimate the difficulty of the problem. The measurement of the user's professional level provides guidance for the estimation of the difficulty of the problem. It solves the problem that the previous method can not handle the observation value as the partial order relation. The experimental results verify the validity of the model proposed in this paper. Finally, this paper uses the information of the user chain in the cross community area to study the cross community. The problem of the problem of difficulty estimation.
In a word, on the one hand, this paper tries to solve the unstructured, cross community and non quantitative challenges in user information mining. On the other hand, from the perspective of the application, we try to apply the structured, quantified, and cross community chain user information to the task difficulty estimation of the public packet from the perspective of application. These results can provide reference for other researchers in this field. With the continuous improvement of user information mining technology, it is believed that the user information mining technology will bring more help to various socialized applications and social computing related research.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 王允;李弼程;林琛;;基于網(wǎng)頁(yè)布局相似度的Web論壇數(shù)據(jù)抽取[J];中文信息學(xué)報(bào);2010年02期
2 李棟;徐志明;李生;劉挺;王秀文;;在線社會(huì)網(wǎng)絡(luò)中信息擴(kuò)散[J];計(jì)算機(jī)學(xué)報(bào);2014年01期
3 吳信東;李毅;李磊;;在線社交網(wǎng)絡(luò)影響力分析[J];計(jì)算機(jī)學(xué)報(bào);2014年04期
相關(guān)博士學(xué)位論文 前4條
1 曹云波;關(guān)于網(wǎng)絡(luò)社區(qū)問答知識(shí)重用的研究[D];上海交通大學(xué);2011年
2 王寶勛;面向網(wǎng)絡(luò)社區(qū)問答對(duì)的語(yǔ)義挖掘研究[D];哈爾濱工業(yè)大學(xué);2013年
3 宋鑫瑩;網(wǎng)絡(luò)信息自動(dòng)化高效抽取技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2013年
4 孫韜;社會(huì)化媒體中提升用戶參與度的關(guān)鍵因素研究[D];北京大學(xué);2013年
本文編號(hào):1983552
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1983552.html
最近更新
教材專著