基于個人檔案的信息提取和可視化分析

發(fā)布時間：2018-07-13 12:34

【摘要】：隨著互聯(lián)網(wǎng)的普及,網(wǎng)上的信息呈爆炸式增長。除了數(shù)量的膨脹,信息的類型也呈現(xiàn)了越發(fā)多樣化的趨勢。在多種多樣的數(shù)據(jù)類型中,有一類數(shù)據(jù)可以被稱作"個人檔案",例如簡歷、個人主頁、在線百科上的人物介紹頁等等。這類數(shù)據(jù)為推測人物之間的社交關(guān)系提供了可能。舉例而言,如果兩個人曾在重疊的時間段內(nèi)在同一所大學學習,則他們很有可能是同學。通過這種分析所得到的社交網(wǎng)絡(luò)蘊含巨大的價值,可以被應(yīng)用于多個問題領(lǐng)域,如社交網(wǎng)絡(luò)分析中常見的最具影響力分析、社區(qū)發(fā)現(xiàn)等。本文介紹了一個針對個人檔案數(shù)據(jù)進行信息提取和可視化分析的系統(tǒng),并詳細闡述了系統(tǒng)涉及的主要算法。該系統(tǒng)主要包含兩大功能:對個人檔案進行信息提取,構(gòu)建基于實體的關(guān)聯(lián)網(wǎng)絡(luò),并借此預(yù)測人物之間的社交關(guān)系;基于此網(wǎng)絡(luò),通過計算PageRank對人物的重要性或影響力進行淺層分析。我們將建立上述網(wǎng)絡(luò)的過程分成了兩步。首先,建立由多種類型實體共同構(gòu)成的關(guān)聯(lián)網(wǎng)絡(luò),這可以視作針對特定領(lǐng)域的一個異構(gòu)的信息網(wǎng)絡(luò)。這個步驟涉及到對個人檔案數(shù)據(jù)的結(jié)構(gòu)化處理,包括實體識別、事件提取等過程。我們針對數(shù)據(jù)特點,選擇了基于句法解析樹相似度進行聚類并結(jié)合規(guī)則提取的方法實現(xiàn)事件提取。第二步是基于已構(gòu)建的關(guān)聯(lián)網(wǎng)絡(luò),通過路徑分析建立人名節(jié)點之間的關(guān)系。在此之前,我們需要補充其他類型節(jié)點之間的關(guān)系以便得到較為全面的路徑信息。考慮到異構(gòu)網(wǎng)絡(luò)的特點,我們使用了不同的方法構(gòu)建不同類型節(jié)點之間的關(guān)系。對上述信息網(wǎng)絡(luò)的可視化分析主要是通過計算PageRank對人物的重要度或者說是影響力進行排名。在可視化的環(huán)境下,限于人的認知能力以及顯示設(shè)備的精度等因素,我們認為節(jié)點的排名順序比實際的PageRank值更為重要。因此,PageRank的計算應(yīng)當在保證節(jié)點相對順序基本不再發(fā)生變化時就提前停止�，F(xiàn)有針對PageRank進行改進的研究有兩個分支。一類研究傾向于從數(shù)學角度加快傳統(tǒng)的Power方法的收斂速度;另一類基于Monte Carlo方法來近似PageRank的計算結(jié)果。然而,他們都不適合用來近似節(jié)點的排名順序。第一種方法致力于在維持準確率的前提下加快收斂速度;而第二種方法雖然效率很高,但它更擅長高排名節(jié)點的識別,對高排名節(jié)點之間的順序近似不夠理想。因此,文章第二部分提出了 Early-stop算法。該算法可以分為兩個步驟:Grouping和Parallel Updating。Grouping通過模擬隨機游走確定節(jié)點順序的大致范圍;Parallel Updating通過并行更新的方法在小范圍內(nèi)調(diào)整排名臨近的節(jié)點的順序。實驗結(jié)果證明Early-stop算法有效地提高了高排名節(jié)點順序近似的準確性。本文的貢獻主要有以下幾點:提出了一個基于個人檔案進行數(shù)據(jù)抽取和分析的系統(tǒng),完成了從信息提取到可視化分析的整個過程;指出可視化分析降低了對計算結(jié)果的精度要求,進而提出了快速近似PageRank的Early-stop算法;通過大量實驗證明Early-stop算法在近似節(jié)點排名方面的準確率高于當前最新的隨機模擬算法。
[Abstract]:With the popularity of the Internet, the information on the Internet has exploded. In addition to the expansion of the number, the types of information are becoming more diverse. In a variety of data types, one kind of data can be called "personal files", such as resume, personal home page, personage introduction page on online encyclopedia, and so on. Social relationships among people are possible. For example, if two people have been learning from the same university in the overlapping period of time, they are likely to be classmates. The social network obtained through this analysis is valuable and can be applied to a number of problems, such as the most common in social network analysis. This paper introduces a system for information extraction and visual analysis of personal file data, and describes the main algorithms involved in the system. The system includes two main functions: extracting information from personal files, building an entity based association network, and predicting among people. Social relationships; based on this network, a shallow analysis of the importance or influence of PageRank on people is carried out. The process of building the above network is divided into two steps. First, the establishment of an association network composed of various types of entities, which can be considered as a heterogeneous information network for a specific domain. This step involves To the structured processing of personal file data, including entity recognition and event extraction, we select the method of clustering based on syntactic parsing tree similarity and combine rules extraction to extract the event. The second step is based on the established association network and through path analysis to establish name nodes between people. Before this, we need to supplement the relationship between other types of nodes in order to get more comprehensive path information. Considering the characteristics of heterogeneous networks, we use different methods to build the relationship between different types of nodes. The visual analysis of the information network is mainly through the calculation of the importance of PageRank to the characters. In a visual environment, limited to human cognitive ability and the accuracy of display devices, we think that the ranking of nodes is more important than the actual PageRank value. Therefore, the calculation of PageRank should stop in advance when the relative order of the node is no longer changed. There are two branches of research on the improvement of PageRank. One class of studies tends to speed up the convergence rate of traditional Power methods from a mathematical point of view; another is based on the Monte Carlo method to approximate the results of PageRank. However, they are not suitable for the approximate ranking of nodes. The first method is committed to maintaining the accuracy. Under the premise, the speed of convergence is accelerated; while the second method is very efficient, but it is better at the recognition of high ranking nodes, and the order of the high ranking nodes is not ideal. Therefore, the second part of the article puts forward the Early-stop algorithm. The algorithm can be divided into two steps: Grouping and Parallel Updating.Grouping are simulated random by random. Walk to determine the general range of node order; Parallel Updating adjusts the order of nodes near the ranking in a small range by parallel update methods. The experimental results prove that the Early-stop algorithm effectively improves the accuracy of the order approximation of high ranking nodes. The main contributions of this paper are as follows: a personal file is proposed. The system that carries out data extraction and analysis, completes the whole process from information extraction to visual analysis. It points out that visual analysis reduces the precision requirements of the calculation results, and then proposes a fast approximate PageRank Early-stop algorithm. Through a large number of experiments, it is proved that the accuracy of the Early-stop algorithm in the approximate node ranking is higher than that of when. The latest stochastic simulation algorithm.
【學位授予單位】：山東大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP391.1

【相似文獻】

相關(guān)期刊論文前10條

1 趙麗華;聶建國;;可視化技術(shù)在圖書館中的應(yīng)用[J];圖書館學刊;2011年03期

2 趙倩;任磊;滕東興;;基于筆式界面的交互式可視化分析系統(tǒng)[J];計算機工程與應(yīng)用;2009年06期

3 袁順波;蔣定福;董文鴛;;期刊影響因子研究演進的可視化分析[J];嘉興學院學報;2011年05期

4 王偉軍;官思發(fā);李亞芳;;知識共享研究熱點與前沿的可視化分析[J];圖書情報知識;2012年01期

5 李琰;趙龍釗;李紅霞;;1991—2012年《中國安全科學學報》發(fā)表論文可視化分析[J];中國安全科學學報;2013年09期

6 邱均平;呂紅;;基于知識圖譜的知識網(wǎng)絡(luò)研究可視化分析[J];情報科學;2013年12期

7 侯筱蓉;趙德春;胡虹;;專利引證類型可視化分析[J];科技管理研究;2011年17期

8 張婷;;國際核心期刊中云計算研究的可視化分析[J];農(nóng)業(yè)圖書情報學刊;2012年03期

9 劉真真;;探討園藝植物可視化技術(shù)的應(yīng)用[J];現(xiàn)代園藝;2013年16期

10 程業(yè)炳;;國內(nèi)外知識轉(zhuǎn)移研究現(xiàn)狀的可視化分析[J];內(nèi)蒙古財經(jīng)大學學報;2013年03期

相關(guān)會議論文前7條

1 郭建勇;劉俊;張鑒;遲學斌;;5·12汶川地震的可視化與分析[A];圖像圖形技術(shù)研究與應(yīng)用(2010)[C];2010年

2 張振龍;楊波;;可視化智能化機構(gòu)分析與設(shè)計系統(tǒng)的研制[A];第十三屆全國機構(gòu)學學術(shù)研討會論文集[C];2002年

3 孫傳諄;鄭新奇;鄧紅蒂;左玉強;蘇航;;土地節(jié)約集約利用研究進展的可視化分析[A];中國山區(qū)土地資源開發(fā)利用與人地協(xié)調(diào)發(fā)展研究[C];2010年

4 孫傳諄;鄭新奇;鄧紅蒂;左玉強;蘇航;;土地節(jié)約集約利用研究進展的可視化分析[A];中國山區(qū)土地資源開發(fā)利用與人地協(xié)調(diào)發(fā)展研究[C];2010年

5 柳輝;;基于AutoCAD的維修性人機可視化分析[A];面向制造業(yè)的自動化與信息化技術(shù)創(chuàng)新設(shè)計的基礎(chǔ)技術(shù)——2001年中國機械工程學會年會暨第九屆全國特種加工學術(shù)年會論文集[C];2001年

6 楊璐;伍蓓;杜杰麗;;IT外包決策研究回顧和模型評介——基于CiteSpaceⅡ的可視化分析[A];第九屆中國科技政策與管理學術(shù)年會論文集[C];2013年

7 李紅綱;鮑玉斌;焦洪國;于戈;鄭懷遠;;維分析樹導(dǎo)航下的可視化OLAP分析[A];第十八屆全國數(shù)據(jù)庫學術(shù)會議論文集（研究報告篇）[C];2001年

相關(guān)碩士學位論文前10條

1 王舒可;新聞可視化研究[D];河北大學;2015年

2 夏晴;科研工作成功原因挖掘及可視化[D];上海大學;2015年

3 楊宏偉;宜賓電網(wǎng)可視化分析預(yù)警系統(tǒng)的設(shè)計與實現(xiàn)[D];電子科技大學;2014年

4 楊陽;微博內(nèi)容的采集、分析及其可視化研究[D];大連理工大學;2015年

5 趙玨;區(qū)域經(jīng)濟普查數(shù)據(jù)可視化分析系統(tǒng)的設(shè)計與實現(xiàn)[D];電子科技大學;2015年

6 朱美玲;近十五年來我國高等教育質(zhì)量研究的可視化分析[D];西北師范大學;2015年

7 李潔;基于SNA的館藏數(shù)字資源知識聚合可視化研究[D];吉林大學;2016年

8 孫偉偉;圖結(jié)構(gòu)數(shù)據(jù)的可視化分析系統(tǒng)的設(shè)計與實現(xiàn)[D];東南大學;2016年

9 呂朝陽;基于個人檔案的信息提取和可視化分析[D];山東大學;2017年

10 馬井剛;面向復(fù)雜網(wǎng)絡(luò)的可視化分析工具的設(shè)計與實現(xiàn)[D];北京郵電大學;2010年

，

本文編號：2119385

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2119385.html

上一篇：人臉識別系統(tǒng)的研究與開發(fā)
下一篇：“你借書,我買單”讀者自主借購系統(tǒng)構(gòu)建研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于個人檔案的信息提取和可視化分析