時(shí)空要素驅(qū)動(dòng)的事件網(wǎng)頁(yè)信息檢索方法研究
發(fā)布時(shí)間:2018-01-21 23:33
本文關(guān)鍵詞: 網(wǎng)頁(yè)文本 事件 時(shí)空要素 檢索 “時(shí)間—空間—主題”索引 出處:《南京師范大學(xué)》2013年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:本文依托國(guó)家“863”課題“泛在空間信息關(guān)聯(lián)更新與面向主題時(shí)空信息挖掘研究”,探索面向事件的網(wǎng)頁(yè)文本獲取與檢索服務(wù)方法,為多源網(wǎng)絡(luò)信息的結(jié)構(gòu)化表達(dá)、事件時(shí)空序列重構(gòu)、可視化和挖掘分析提供數(shù)據(jù)支撐。本文圍繞事件網(wǎng)頁(yè)文本“數(shù)據(jù)獲取—組織管理—檢索服務(wù)”的技術(shù)主線(xiàn),通過(guò)分析中文網(wǎng)頁(yè)文本中事件信息的語(yǔ)言描述和信息組織特征,以自然災(zāi)害事件為例,開(kāi)展了時(shí)空要素驅(qū)動(dòng)的事件網(wǎng)頁(yè)信息檢索引擎關(guān)鍵技術(shù)研究。主要研究?jī)?nèi)容與結(jié)論包括以下幾個(gè)方面: (1)時(shí)空要素驅(qū)動(dòng)的事件網(wǎng)頁(yè)獲。和ㄟ^(guò)對(duì)描述事件網(wǎng)頁(yè)文本內(nèi)容及特征進(jìn)行分析,構(gòu)建以時(shí)間、空間位置和事件主題為基本要素的事件表達(dá)模板;依據(jù)事件表達(dá)模板中的內(nèi)容,定制網(wǎng)絡(luò)爬蟲(chóng)以獲取描述事件的網(wǎng)頁(yè)文本。實(shí)驗(yàn)表明,與傳統(tǒng)爬蟲(chóng)相比,基于事件表達(dá)模板構(gòu)建的事件主題爬蟲(chóng)具有良好的網(wǎng)頁(yè)過(guò)濾功能,獲取的網(wǎng)頁(yè)具有較高的精度,但是因?yàn)樵谥黝}爬蟲(chóng)中引入了大量的計(jì)算,導(dǎo)致該爬蟲(chóng)的性能相對(duì)有所下降。 (2)事件網(wǎng)頁(yè)“時(shí)間—空間—主題”分布式索引與存儲(chǔ):利用規(guī)則模型和條件隨機(jī)場(chǎng)模型實(shí)現(xiàn)了網(wǎng)頁(yè)文本中事件相關(guān)時(shí)間、空間位置和主題信息抽取,提出了基于支持向量機(jī)模型的網(wǎng)頁(yè)文本事件分類(lèi)方法;構(gòu)建了基于“時(shí)間—空間—主題”的分布式索引,以解決檢索效率低的問(wèn)題;基于HBase數(shù)據(jù)庫(kù)和HDFS文件系統(tǒng),實(shí)現(xiàn)了海量網(wǎng)頁(yè)文本的分布式存儲(chǔ)。 (3)“文—圖”交互式事件網(wǎng)頁(yè)信息檢索服務(wù):通過(guò)歸納總結(jié)用戶(hù)檢索語(yǔ)句的描述特點(diǎn),實(shí)現(xiàn)了事件信息檢索語(yǔ)句的自動(dòng)解析;借鑒同義詞林的詞匯組織方式,構(gòu)建自然災(zāi)害事件領(lǐng)域詞匯知識(shí)庫(kù)和相似度檢索模型,實(shí)現(xiàn)了候選網(wǎng)頁(yè)文本和檢索條件的相似度計(jì)算與排序。 (4)原型系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn):基于本文提出的事件網(wǎng)頁(yè)獲取方法、分布式索引與存儲(chǔ)方法、檢索服務(wù)方法,利用Google Map API,設(shè)計(jì)了相應(yīng)的原型系統(tǒng);探討了原型系統(tǒng)的體系架構(gòu),以及主要功能模塊。
[Abstract]:Based on the national "863" project, "Research on the updating of Spatial Information Association and Topic-Oriented Spatio-temporal Information Mining", this paper explores the event-oriented web page text acquisition and retrieval services. This paper provides data support for structured expression of multi-source network information, reconstruction of temporal and spatial sequence of events, visualization and mining analysis. This paper focuses on the technology of "data acquisition, organization management and retrieval service" of event web page text. By analyzing the language description and information organization features of event information in Chinese web text, the natural disaster event is taken as an example. The key technologies of event information retrieval engine driven by spatiotemporal factors are studied. The main contents and conclusions include the following aspects: (1) event page acquisition driven by spatio-temporal elements: by analyzing the content and features of the text describing event pages, we construct an event expression template with time, space location and event theme as the basic elements; According to the content of the event expression template, the web crawler is customized to obtain the web page text describing the event. The experiment shows that compared with the traditional crawler. The event topic crawler based on the event expression template has a good web page filtering function, and the obtained web page has a high accuracy, but because of the introduction of a large number of calculations in the topic crawler. As a result, the performance of the reptile is relatively poor. 2) distributed index and storage of event page "time-space-topic": using rule model and conditional random field model to extract information of event related time, space and topic in web page text. A method of web page text event classification based on support vector machine (SVM) model is proposed. In order to solve the problem of low retrieval efficiency, a distributed index based on "time-space-topic" is constructed. Based on HBase database and HDFS file system, distributed storage of massive web page text is realized. (3) "text-Graph" interactive event page information retrieval service: by summarizing the description characteristics of user retrieval statements, the automatic parsing of event information retrieval statements is realized; The lexical knowledge base and similarity retrieval model of natural disaster event domain are constructed based on the lexical organization of synonym forest, and the similarity calculation and ranking of candidate web page text and retrieval conditions are realized. Design and implementation of prototype system: based on the event page acquisition method proposed in this paper, distributed index and storage method, retrieval service method, using Google Map API. The corresponding prototype system is designed. The architecture and main function modules of the prototype system are discussed.
【學(xué)位授予單位】:南京師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP391.1;P208
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 付劍鋒;劉宗田;付雪峰;周文;仲兆滿(mǎn);;基于依存分析的事件識(shí)別[J];計(jì)算機(jī)科學(xué);2009年11期
2 車(chē)慶男;;基于Lucene的索引系統(tǒng)分析和研究[J];內(nèi)蒙古石油化工;2010年18期
3 譚紅葉;趙鐵軍;王浩暢;;基于向量相似度計(jì)算的半監(jiān)督的名實(shí)體識(shí)別[J];計(jì)算機(jī)工程與設(shè)計(jì);2008年19期
4 邵秀麗;劉彬;張濤;;基于Nutch的垂直搜索引擎的設(shè)計(jì)和實(shí)現(xiàn)[J];計(jì)算機(jī)工程與設(shè)計(jì);2011年02期
5 沈達(dá)陽(yáng),孫茂松,黃昌寧;基于統(tǒng)計(jì)的漢語(yǔ)分詞模型及實(shí)現(xiàn)方法[J];中文信息;1998年Z1期
相關(guān)碩士學(xué)位論文 前1條
1 李勇君;基于Hadoop的海量期貨數(shù)據(jù)的分布式存儲(chǔ)和算法分析[D];天津大學(xué);2012年
,本文編號(hào):1452880
本文鏈接:http://sikaile.net/kejilunwen/dizhicehuilunwen/1452880.html
最近更新
教材專(zhuān)著