面向行業(yè)的信息融合原型系統(tǒng)的研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-09-05 07:08

【摘要】：隨著信息產(chǎn)業(yè)的不斷飛速發(fā)展壯大,網(wǎng)絡(luò)上的數(shù)據(jù)每天都在以驚人的速度不斷的增長。用戶在查詢中越來越多的包含實(shí)體的信息,例如人名、機(jī)構(gòu)名等,試圖通過圍繞實(shí)體來構(gòu)建有意義的查詢條件,從語義的方面查找到與這些實(shí)體相關(guān)的信息,而不僅僅通過關(guān)鍵詞來進(jìn)行信息搜索與查詢�；谖臋n級別進(jìn)行索引的通用搜索引擎,例如谷歌、百度、雅虎等,都是基于關(guān)鍵詞匹配的文檔檢索,在一定程度上已經(jīng)開始不能滿足互聯(lián)網(wǎng)用戶的搜索需要,人們期望以實(shí)體為中心的搜索系統(tǒng)的出現(xiàn)。本文調(diào)研了上述搜索引擎的不足以及用戶搜索的習(xí)慣,提出了基于實(shí)體關(guān)聯(lián)模型的信息融合方法,通過機(jī)器學(xué)習(xí)構(gòu)建面向行業(yè)的網(wǎng)頁信息融合原型系統(tǒng),以實(shí)體為中心將信息進(jìn)行融合,目的在于利用實(shí)體的概念將信息以實(shí)體為中心集成起來,更方便于普通互聯(lián)網(wǎng)用戶有效的進(jìn)行以實(shí)體為中心的搜索。本文主要進(jìn)行的研究工作如下：首先,基于百度百科,通過詞條的抽取、分類、整理,得到一個(gè)基于IT行業(yè)領(lǐng)域的實(shí)體詞典。其次,收集各大門戶網(wǎng)站中的IT新聞文本以及IT行業(yè)知名博客,通過網(wǎng)頁抽取技術(shù),整理并構(gòu)建了面向行業(yè)的中文新聞領(lǐng)域的語料庫。然后,通過機(jī)器學(xué)習(xí)的方法構(gòu)建面向行業(yè)的網(wǎng)頁信息融合原型系統(tǒng),利用基于圖的排序算法計(jì)算出文本與實(shí)體的相關(guān)度,在語義理解的基礎(chǔ)上得到文本中實(shí)體的權(quán)重,并根據(jù)實(shí)體在所出現(xiàn)的文本的權(quán)重計(jì)算出實(shí)體間的關(guān)聯(lián)度。最后,在上述研究基礎(chǔ)上,完成一個(gè)以實(shí)體為中心的搜索系統(tǒng)原型。本文在系統(tǒng)的實(shí)驗(yàn)中,使用已經(jīng)構(gòu)建好的基于中文新聞領(lǐng)域的語料庫作為測試集,對該面向行業(yè)的信息融合原型系統(tǒng)進(jìn)行了測試,實(shí)驗(yàn)結(jié)果表明,通過與人工標(biāo)注的實(shí)體關(guān)聯(lián)度進(jìn)行對比,本文所構(gòu)建的實(shí)體模型中,文本與實(shí)體的相關(guān)度以及實(shí)體間的關(guān)聯(lián)度與人工標(biāo)注的結(jié)果偏差大部分小于0.1,計(jì)算結(jié)果與人們的認(rèn)知結(jié)果基本吻合,具有較高的準(zhǔn)確率。
[Abstract]:With the rapid development of the information industry, the data on the network is growing at an alarming rate every day. More and more users in the query contain entity information, such as person name, organization name and so on. They try to construct meaningful query conditions around the entity, and find the relationship between these entities from the semantic aspect. General search engines based on document-level indexing, such as Google, Baidu, Yahoo, etc., are all based on keyword matching. To a certain extent, they have begun to fail to meet the search needs of Internet users, and people expect to be entity-centric. The emergence of search system.
This paper investigates the insufficiency of the above search engines and the user's habit of searching, and proposes an information fusion method based on entity Association model. Through machine learning, an industry-oriented web information fusion prototype system is constructed, which integrates information with entity as the center. The purpose is to use the concept of entity to set information with entity as the center. It is more convenient for ordinary Internet users to effectively conduct entity centric search.
The main research work of this paper is as follows: Firstly, based on Baidu Encyclopedia, an entity dictionary based on IT industry domain is obtained by extracting, classifying and sorting out entries. Secondly, the IT news texts and famous blogs of IT industry in major portals are collected, and the industry-oriented new Chinese language is sorted out and constructed by Web page extraction technology. Then, an industry-oriented web information fusion prototype system is constructed by means of machine learning. The relativity between text and entity is calculated by graph-based sorting algorithm, and the entity weight in text is obtained on the basis of semantic understanding. Finally, based on the above research, an entity centered search system prototype is completed.
In this paper, we use the corpus based on Chinese news domain as test set to test the industry-oriented information fusion prototype system. The experimental results show that the correlation between text and entity in the entity model constructed in this paper is better than that of manual annotation. And the deviation between the correlation degree between entities and the result of manual labeling is mostly less than 0.1. The calculated results are basically consistent with people's cognitive results and have high accuracy.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 芮璋現(xiàn);肖海波;;支持向量機(jī)(SVM)及其應(yīng)用[J];福建電腦;2007年04期

2 劉群,張華平,俞鴻魁,程學(xué)旗;基于層疊隱馬模型的漢語詞法分析[J];計(jì)算機(jī)研究與發(fā)展;2004年08期

3 陳永超;劉貴全;;一種基于命名實(shí)體的搜索結(jié)果聚類算法[J];計(jì)算機(jī)工程;2009年07期

4 夏天,樊孝忠,劉林;利用JNI實(shí)現(xiàn)ICTCLAS系統(tǒng)的Java調(diào)用[J];計(jì)算機(jī)應(yīng)用;2004年S2期

5 徐冰;郭紹忠;黃永忠;;基于樸素貝葉斯分類算法的活躍網(wǎng)絡(luò)結(jié)構(gòu)挖掘[J];計(jì)算機(jī)應(yīng)用;2007年06期

6 張華平,劉群;基于N-最短路徑方法的中文詞語粗分模型[J];中文信息學(xué)報(bào);2002年05期

7 孫承杰,關(guān)毅;基于統(tǒng)計(jì)的網(wǎng)頁正文信息抽取方法的研究[J];中文信息學(xué)報(bào);2004年05期

8 張學(xué)工;關(guān)于統(tǒng)計(jì)學(xué)習(xí)理論與支持向量機(jī)[J];自動化學(xué)報(bào);2000年01期

9 李劍波;李小華;董樹明;楊科華;;一種基于XML的Web信息抽取方法[J];情報(bào)雜志;2006年08期

10 寇月;申德榮;李冬;聶鐵錚;;一種基于語義及統(tǒng)計(jì)分析的Deep Web實(shí)體識別機(jī)制[J];軟件學(xué)報(bào);2008年02期

相關(guān)博士學(xué)位論文前1條

1 包勝華;基于Web的實(shí)體信息搜索與挖掘研究[D];上海交通大學(xué);2008年

相關(guān)碩士學(xué)位論文前2條

1 劉治華;面向主題的文檔摘要技術(shù)研究[D];北方工業(yè)大學(xué);2011年

2 劉占山;基于XML搜索引擎的研究[D];吉林大學(xué);2007年

，

本文編號：2223551

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2223551.html

上一篇：智慧海洋搜索引擎的分析與設(shè)計(jì)
下一篇：機(jī)票票價(jià)預(yù)測系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向行業(yè)的信息融合原型系統(tǒng)的研究與實(shí)現(xiàn)