面向行業(yè)的信息融合原型系統(tǒng)的研究與實(shí)現(xiàn)
[Abstract]:With the rapid development of the information industry, the data on the network is growing at an alarming rate every day. More and more users in the query contain entity information, such as person name, organization name and so on. They try to construct meaningful query conditions around the entity, and find the relationship between these entities from the semantic aspect. General search engines based on document-level indexing, such as Google, Baidu, Yahoo, etc., are all based on keyword matching. To a certain extent, they have begun to fail to meet the search needs of Internet users, and people expect to be entity-centric. The emergence of search system.
This paper investigates the insufficiency of the above search engines and the user's habit of searching, and proposes an information fusion method based on entity Association model. Through machine learning, an industry-oriented web information fusion prototype system is constructed, which integrates information with entity as the center. The purpose is to use the concept of entity to set information with entity as the center. It is more convenient for ordinary Internet users to effectively conduct entity centric search.
The main research work of this paper is as follows: Firstly, based on Baidu Encyclopedia, an entity dictionary based on IT industry domain is obtained by extracting, classifying and sorting out entries. Secondly, the IT news texts and famous blogs of IT industry in major portals are collected, and the industry-oriented new Chinese language is sorted out and constructed by Web page extraction technology. Then, an industry-oriented web information fusion prototype system is constructed by means of machine learning. The relativity between text and entity is calculated by graph-based sorting algorithm, and the entity weight in text is obtained on the basis of semantic understanding. Finally, based on the above research, an entity centered search system prototype is completed.
In this paper, we use the corpus based on Chinese news domain as test set to test the industry-oriented information fusion prototype system. The experimental results show that the correlation between text and entity in the entity model constructed in this paper is better than that of manual annotation. And the deviation between the correlation degree between entities and the result of manual labeling is mostly less than 0.1. The calculated results are basically consistent with people's cognitive results and have high accuracy.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 芮璋現(xiàn);肖海波;;支持向量機(jī)(SVM)及其應(yīng)用[J];福建電腦;2007年04期
2 劉群,張華平,俞鴻魁,程學(xué)旗;基于層疊隱馬模型的漢語詞法分析[J];計(jì)算機(jī)研究與發(fā)展;2004年08期
3 陳永超;劉貴全;;一種基于命名實(shí)體的搜索結(jié)果聚類算法[J];計(jì)算機(jī)工程;2009年07期
4 夏天,樊孝忠,劉林;利用JNI實(shí)現(xiàn)ICTCLAS系統(tǒng)的Java調(diào)用[J];計(jì)算機(jī)應(yīng)用;2004年S2期
5 徐冰;郭紹忠;黃永忠;;基于樸素貝葉斯分類算法的活躍網(wǎng)絡(luò)結(jié)構(gòu)挖掘[J];計(jì)算機(jī)應(yīng)用;2007年06期
6 張華平,劉群;基于N-最短路徑方法的中文詞語粗分模型[J];中文信息學(xué)報(bào);2002年05期
7 孫承杰,關(guān)毅;基于統(tǒng)計(jì)的網(wǎng)頁正文信息抽取方法的研究[J];中文信息學(xué)報(bào);2004年05期
8 張學(xué)工;關(guān)于統(tǒng)計(jì)學(xué)習(xí)理論與支持向量機(jī)[J];自動化學(xué)報(bào);2000年01期
9 李劍波;李小華;董樹明;楊科華;;一種基于XML的Web信息抽取方法[J];情報(bào)雜志;2006年08期
10 寇月;申德榮;李冬;聶鐵錚;;一種基于語義及統(tǒng)計(jì)分析的Deep Web實(shí)體識別機(jī)制[J];軟件學(xué)報(bào);2008年02期
相關(guān)博士學(xué)位論文 前1條
1 包勝華;基于Web的實(shí)體信息搜索與挖掘研究[D];上海交通大學(xué);2008年
相關(guān)碩士學(xué)位論文 前2條
1 劉治華;面向主題的文檔摘要技術(shù)研究[D];北方工業(yè)大學(xué);2011年
2 劉占山;基于XML搜索引擎的研究[D];吉林大學(xué);2007年
,本文編號:2223551
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2223551.html