天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于條件隨機(jī)場(chǎng)的信息抽取與情報(bào)信息可視化

發(fā)布時(shí)間:2018-05-21 15:55

  本文選題:CRFs + CRFsuite ; 參考:《北方工業(yè)大學(xué)》2017年碩士論文


【摘要】:近年來,網(wǎng)絡(luò)的發(fā)展日新月異,網(wǎng)絡(luò)安全威脅與日俱增。網(wǎng)絡(luò)數(shù)據(jù)的數(shù)據(jù)量、速度、種類的迅速膨脹帶來了如何對(duì)海量異構(gòu)數(shù)據(jù)進(jìn)行融合、存儲(chǔ)和管理等問題。在爆炸式增長(zhǎng)的互聯(lián)網(wǎng)信息中,人物信息也以幾何式增長(zhǎng),但總是數(shù)據(jù)豐富而信息貧乏。人們獲取信息的主要來源仍然是文本類型數(shù)據(jù),如何對(duì)海量的人物文本信息進(jìn)行有效的提取成為人們關(guān)心的熱點(diǎn)問題。傳統(tǒng)方法即采用人工統(tǒng)計(jì)方法提取并分析這些文本類型數(shù)據(jù),雖然準(zhǔn)確率較高,但是需要耗費(fèi)大量的人力資源,導(dǎo)致信息抽取效率很低。這種方式已經(jīng)無法滿足人們對(duì)信息獲取效率的要求,由此產(chǎn)生了信息抽取技術(shù)。經(jīng)過對(duì)網(wǎng)絡(luò)數(shù)據(jù)及信息抽取模型的研究,本文的主要成果如下:1、提出了一種人物信息的抽取規(guī)則。通過對(duì)網(wǎng)絡(luò)數(shù)據(jù)的格式及特點(diǎn)進(jìn)行研究,建立人物信息抽取規(guī)則。規(guī)則主要包括人物信息的特征前導(dǎo)詞,出現(xiàn)位置以及方法三部分。其中出現(xiàn)位置主要包括三種類型:Body、Cookies、Url;方法是指當(dāng)前會(huì)話類型采用GET方式還是POST方式;特征前導(dǎo)詞為相關(guān)人物信息值所在位置的前三個(gè)關(guān)鍵詞,利用分詞過濾的方式分離提取特征前導(dǎo)詞。使用該規(guī)則進(jìn)行抽取,能夠準(zhǔn)確地得到人物信息。2、提出了基于CRFSuite的面向人物屬性的信息抽取方法。CRFSuite是條件隨機(jī)場(chǎng)(CRFs)算法對(duì)序列數(shù)據(jù)標(biāo)記的一種實(shí)現(xiàn),該模型具有訓(xùn)練速度快,準(zhǔn)確率高等特點(diǎn)。通過對(duì)已有域的學(xué)習(xí),提取出人物信息在網(wǎng)絡(luò)數(shù)據(jù)中的特征前導(dǎo)詞、位置、以及方法,從而建立人物信息抽取規(guī)則。應(yīng)用CRFsuite將其訓(xùn)練為模型,并將模型應(yīng)用到網(wǎng)絡(luò)數(shù)據(jù)中將人物信息匹配出來,建立結(jié)構(gòu)化人物信息庫(kù)。最終得到結(jié)構(gòu)化形式的情報(bào)數(shù)據(jù)。3、設(shè)計(jì)并實(shí)現(xiàn)了可視化分析系統(tǒng)。該系統(tǒng)將經(jīng)過信息抽取之后結(jié)構(gòu)化"情報(bào)"間的關(guān)系以圖形化的形式展現(xiàn)出來,將虛擬人物信息與現(xiàn)實(shí)人物信息關(guān)聯(lián)起來。實(shí)現(xiàn)"信息"到"情報(bào)"的轉(zhuǎn)換,最終將信息資源優(yōu)勢(shì)轉(zhuǎn)化為決策優(yōu)勢(shì)。
[Abstract]:In recent years, with the rapid development of the network, network security threats are increasing. The rapid expansion of the data volume, speed and type of network data brings problems such as how to fuse, store and manage the massive heterogeneous data. In the explosive growth of Internet information, character information also grows in geometric form, but it is always rich in data and poor in information. Text type data is still the main source for people to obtain information. How to extract the massive human text information effectively has become a hot issue. The traditional method is to extract and analyze these text type data by artificial statistics. Although the accuracy is high, it needs a lot of human resources, which leads to the low efficiency of information extraction. This method can not meet the requirements of the efficiency of information acquisition, resulting in information extraction technology. Through the research on the model of network data and information extraction, the main achievements of this paper are as follows: 1. Through the research on the format and characteristics of network data, the rules of character information extraction are established. The rules mainly include the character leading word, the position of appearance and the method of character information. There are mainly three types of: body / Cookies-Url; the method refers to whether the current conversation type is GET or POST; the leading word is the first three keywords of the position where the information value of the relevant person is located. Feature leading words are separated and extracted by word segmentation. Using this rule to extract the character information, we can get the character information accurately. The method of character attribute oriented information extraction based on CRFSuite. CRF Suite is an implementation of conditional Random Field (CRF) algorithm to mark the sequence data. The model has fast training speed. High accuracy and other characteristics. By learning the existing fields, the characteristic leading words, positions and methods of character information in the network data are extracted, and the rules of character information extraction are established. CRFsuite is used to train it as a model, and the model is applied to the network data to match the information of people, and the structured information base of people is established. Finally, the structured intelligence data. 3 is obtained, and the visual analysis system is designed and implemented. In this system, the relationship between structured "information" after information extraction is displayed graphically, and the virtual character information is associated with the real person information. Finally, the advantages of information resources are transformed into decision-making advantages.
【學(xué)位授予單位】:北方工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP393.08

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 喬磊;李存華;仲兆滿;王俊;劉冬冬;;基于規(guī)則的人物信息抽取算法的研究[J];南京師大學(xué)報(bào)(自然科學(xué)版);2012年04期

2 張釗;唐文;溫巧燕;;一種基于長(zhǎng)度語義約束的報(bào)文格式挖掘方法[J];北京郵電大學(xué)學(xué)報(bào);2012年06期

3 潘t,

本文編號(hào):1919845


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1919845.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶002cb***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com