時(shí)空要素驅(qū)動(dòng)的事件網(wǎng)頁信息檢索方法研究

發(fā)布時(shí)間：2018-01-21 23:33

本文關(guān)鍵詞： 網(wǎng)頁文本事件時(shí)空要素檢索 “時(shí)間—空間—主題”索引　出處：《南京師范大學(xué)》2013年碩士論文　論文類型：學(xué)位論文

【摘要】：本文依托國家“863”課題“泛在空間信息關(guān)聯(lián)更新與面向主題時(shí)空信息挖掘研究”,探索面向事件的網(wǎng)頁文本獲取與檢索服務(wù)方法,為多源網(wǎng)絡(luò)信息的結(jié)構(gòu)化表達(dá)、事件時(shí)空序列重構(gòu)、可視化和挖掘分析提供數(shù)據(jù)支撐。本文圍繞事件網(wǎng)頁文本“數(shù)據(jù)獲取—組織管理—檢索服務(wù)”的技術(shù)主線,通過分析中文網(wǎng)頁文本中事件信息的語言描述和信息組織特征,以自然災(zāi)害事件為例,開展了時(shí)空要素驅(qū)動(dòng)的事件網(wǎng)頁信息檢索引擎關(guān)鍵技術(shù)研究。主要研究內(nèi)容與結(jié)論包括以下幾個(gè)方面： (1)時(shí)空要素驅(qū)動(dòng)的事件網(wǎng)頁獲�。和ㄟ^對(duì)描述事件網(wǎng)頁文本內(nèi)容及特征進(jìn)行分析,構(gòu)建以時(shí)間、空間位置和事件主題為基本要素的事件表達(dá)模板；依據(jù)事件表達(dá)模板中的內(nèi)容,定制網(wǎng)絡(luò)爬蟲以獲取描述事件的網(wǎng)頁文本。實(shí)驗(yàn)表明,與傳統(tǒng)爬蟲相比,基于事件表達(dá)模板構(gòu)建的事件主題爬蟲具有良好的網(wǎng)頁過濾功能,獲取的網(wǎng)頁具有較高的精度,但是因?yàn)樵谥黝}爬蟲中引入了大量的計(jì)算,導(dǎo)致該爬蟲的性能相對(duì)有所下降。 (2)事件網(wǎng)頁“時(shí)間—空間—主題”分布式索引與存儲(chǔ)：利用規(guī)則模型和條件隨機(jī)場模型實(shí)現(xiàn)了網(wǎng)頁文本中事件相關(guān)時(shí)間、空間位置和主題信息抽取,提出了基于支持向量機(jī)模型的網(wǎng)頁文本事件分類方法；構(gòu)建了基于“時(shí)間—空間—主題”的分布式索引,以解決檢索效率低的問題；基于HBase數(shù)據(jù)庫和HDFS文件系統(tǒng),實(shí)現(xiàn)了海量網(wǎng)頁文本的分布式存儲(chǔ)。 (3)“文—圖”交互式事件網(wǎng)頁信息檢索服務(wù)：通過歸納總結(jié)用戶檢索語句的描述特點(diǎn),實(shí)現(xiàn)了事件信息檢索語句的自動(dòng)解析；借鑒同義詞林的詞匯組織方式,構(gòu)建自然災(zāi)害事件領(lǐng)域詞匯知識(shí)庫和相似度檢索模型,實(shí)現(xiàn)了候選網(wǎng)頁文本和檢索條件的相似度計(jì)算與排序。 (4)原型系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)：基于本文提出的事件網(wǎng)頁獲取方法、分布式索引與存儲(chǔ)方法、檢索服務(wù)方法,利用Google Map API,設(shè)計(jì)了相應(yīng)的原型系統(tǒng)；探討了原型系統(tǒng)的體系架構(gòu),以及主要功能模塊。
[Abstract]:Based on the national "863" project, "Research on the updating of Spatial Information Association and Topic-Oriented Spatio-temporal Information Mining", this paper explores the event-oriented web page text acquisition and retrieval services. This paper provides data support for structured expression of multi-source network information, reconstruction of temporal and spatial sequence of events, visualization and mining analysis. This paper focuses on the technology of "data acquisition, organization management and retrieval service" of event web page text. By analyzing the language description and information organization features of event information in Chinese web text, the natural disaster event is taken as an example. The key technologies of event information retrieval engine driven by spatiotemporal factors are studied. The main contents and conclusions include the following aspects: (1) event page acquisition driven by spatio-temporal elements: by analyzing the content and features of the text describing event pages, we construct an event expression template with time, space location and event theme as the basic elements; According to the content of the event expression template, the web crawler is customized to obtain the web page text describing the event. The experiment shows that compared with the traditional crawler. The event topic crawler based on the event expression template has a good web page filtering function, and the obtained web page has a high accuracy, but because of the introduction of a large number of calculations in the topic crawler. As a result, the performance of the reptile is relatively poor. 2) distributed index and storage of event page "time-space-topic": using rule model and conditional random field model to extract information of event related time, space and topic in web page text. A method of web page text event classification based on support vector machine (SVM) model is proposed. In order to solve the problem of low retrieval efficiency, a distributed index based on "time-space-topic" is constructed. Based on HBase database and HDFS file system, distributed storage of massive web page text is realized. (3) "text-Graph" interactive event page information retrieval service: by summarizing the description characteristics of user retrieval statements, the automatic parsing of event information retrieval statements is realized; The lexical knowledge base and similarity retrieval model of natural disaster event domain are constructed based on the lexical organization of synonym forest, and the similarity calculation and ranking of candidate web page text and retrieval conditions are realized. Design and implementation of prototype system: based on the event page acquisition method proposed in this paper, distributed index and storage method, retrieval service method, using Google Map API. The corresponding prototype system is designed. The architecture and main function modules of the prototype system are discussed.
【學(xué)位授予單位】：南京師范大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.1;P208

【參考文獻(xiàn)】

相關(guān)期刊論文前5條

1 付劍鋒;劉宗田;付雪峰;周文;仲兆滿;;基于依存分析的事件識(shí)別[J];計(jì)算機(jī)科學(xué);2009年11期

2 車慶男;;基于Lucene的索引系統(tǒng)分析和研究[J];內(nèi)蒙古石油化工;2010年18期

3 譚紅葉;趙鐵軍;王浩暢;;基于向量相似度計(jì)算的半監(jiān)督的名實(shí)體識(shí)別[J];計(jì)算機(jī)工程與設(shè)計(jì);2008年19期

4 邵秀麗;劉彬;張濤;;基于Nutch的垂直搜索引擎的設(shè)計(jì)和實(shí)現(xiàn)[J];計(jì)算機(jī)工程與設(shè)計(jì);2011年02期

5 沈達(dá)陽,孫茂松,黃昌寧;基于統(tǒng)計(jì)的漢語分詞模型及實(shí)現(xiàn)方法[J];中文信息;1998年Z1期

相關(guān)碩士學(xué)位論文前1條

1 李勇君;基于Hadoop的海量期貨數(shù)據(jù)的分布式存儲(chǔ)和算法分析[D];天津大學(xué);2012年

，

本文編號(hào)：1452880

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/dizhicehuilunwen/1452880.html

上一篇：利用Android藍(lán)牙實(shí)現(xiàn)全站儀數(shù)據(jù)傳輸?shù)姆治?/a>
下一篇：基于傾斜攝影測(cè)量技術(shù)構(gòu)建實(shí)景三維模型的方法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

時(shí)空要素驅(qū)動(dòng)的事件網(wǎng)頁信息檢索方法研究