基于本體進化的專題信息采集方法研究
發(fā)布時間:2019-02-08 18:56
【摘要】:互聯(lián)網(wǎng)的出現(xiàn),為人們提供了一個獲取信息的新渠道。人們在擁有一個呈爆炸式增長的信息源的同時,也面臨著如何從中快速準確地獲取與特定專題相關(guān)信息的難題。通用搜索引擎是目前最為常用的信息檢索工具,但由于其自身是面向大眾,很難及時、準確地為人們提供特定的專題信息。在這種情形下,面向?qū)n}的信息采集已然成為當(dāng)前的研究熱點之一。 本文中,首先對國內(nèi)外專題信息采集技術(shù)和本體進化的研究現(xiàn)狀作了簡單概述,介紹了網(wǎng)絡(luò)信息采集技術(shù)的基本原理和結(jié)構(gòu),以及主要的發(fā)展方向,同時對文本相似度計算理論和本體相關(guān)理論進行了梳理。然后,針對互聯(lián)網(wǎng)上幾種信息來源設(shè)計相應(yīng)的采集策略,包括目標網(wǎng)站全站遍歷、目標版塊定向跟蹤、RSS源定時增量更新。然后設(shè)計專題本體進化方案,主要內(nèi)容有網(wǎng)頁內(nèi)容提取、正文特征詞抽取、初始專題本體構(gòu)建以及專題本體的進化。最后,設(shè)計實現(xiàn)實驗系統(tǒng),選取示例專題,構(gòu)建初始專題本體,對本文提出的方法進行實驗驗證。 本文的主要工作在于:①針對不同的信息源設(shè)計相應(yīng)的采集策略,使信息采集器能適應(yīng)互聯(lián)網(wǎng)上復(fù)雜的信息采集環(huán)境,在專題本體的指導(dǎo)下,從互聯(lián)網(wǎng)上的多種信息源中采集專題相關(guān)信息;②提出了專題本體半自動進化的方法,基于網(wǎng)頁集和用戶行為日志,結(jié)合特征詞抽取技術(shù),在用戶的指導(dǎo)下實現(xiàn)專題本體的進化,,并通過實驗驗證方案的有效性。
[Abstract]:The emergence of the Internet provides a new channel for people to obtain information. At the same time, people are faced with the problem of how to obtain information related to a specific topic quickly and accurately. General search engine is the most commonly used information retrieval tool at present, but it is difficult to provide specific information for people in time and accurately because it is oriented to the public. In this case, subject-oriented information collection has become one of the current research hotspots. In this paper, first of all, the research status of thematic information collection technology and ontology evolution at home and abroad is briefly summarized, and the basic principle and structure of network information collection technology, as well as the main development direction, are introduced. At the same time, the theory of text similarity calculation and ontology theory are combed. Then, the corresponding acquisition strategies are designed for several information sources on the Internet, including the target site traversing the whole station, the target block orientation tracking, and the RSS source timing incremental update. Then we design an evolutionary scheme of thematic ontology, which includes web page content extraction, text feature extraction, initial topic ontology construction and thematic ontology evolution. Finally, the experimental system is designed and implemented, and the experimental verification of the proposed method is carried out by selecting the sample topic and constructing the initial thematic ontology. The main work of this paper is as follows: 1 according to different information sources, the information collector can adapt to the complex information collection environment on the Internet, under the guidance of the subject ontology, Collecting relevant information from a variety of information sources on the Internet; 2. A semi-automatic evolution method of thematic ontology is proposed. Based on web pages and user behavior logs, the evolution of thematic ontology is realized under the guidance of users, and the effectiveness of the scheme is verified by experiments.
【學(xué)位授予單位】:南京航空航天大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP391.1
本文編號:2418683
[Abstract]:The emergence of the Internet provides a new channel for people to obtain information. At the same time, people are faced with the problem of how to obtain information related to a specific topic quickly and accurately. General search engine is the most commonly used information retrieval tool at present, but it is difficult to provide specific information for people in time and accurately because it is oriented to the public. In this case, subject-oriented information collection has become one of the current research hotspots. In this paper, first of all, the research status of thematic information collection technology and ontology evolution at home and abroad is briefly summarized, and the basic principle and structure of network information collection technology, as well as the main development direction, are introduced. At the same time, the theory of text similarity calculation and ontology theory are combed. Then, the corresponding acquisition strategies are designed for several information sources on the Internet, including the target site traversing the whole station, the target block orientation tracking, and the RSS source timing incremental update. Then we design an evolutionary scheme of thematic ontology, which includes web page content extraction, text feature extraction, initial topic ontology construction and thematic ontology evolution. Finally, the experimental system is designed and implemented, and the experimental verification of the proposed method is carried out by selecting the sample topic and constructing the initial thematic ontology. The main work of this paper is as follows: 1 according to different information sources, the information collector can adapt to the complex information collection environment on the Internet, under the guidance of the subject ontology, Collecting relevant information from a variety of information sources on the Internet; 2. A semi-automatic evolution method of thematic ontology is proposed. Based on web pages and user behavior logs, the evolution of thematic ontology is realized under the guidance of users, and the effectiveness of the scheme is verified by experiments.
【學(xué)位授予單位】:南京航空航天大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP391.1
【參考文獻】
相關(guān)期刊論文 前9條
1 陳巖;魏海平;孫殿哲;;基于元搜索的專業(yè)搜索引擎的設(shè)計[J];遼寧石油化工大學(xué)學(xué)報;2010年02期
2 李盛韜;余智華;程學(xué)旗;白碩;;Web信息采集研究進展[J];計算機科學(xué);2003年02期
3 鄭家恒,盧嬌麗;關(guān)鍵詞抽取方法的研究[J];計算機工程;2005年18期
4 李衛(wèi);劉建毅;何華燦;王樅;;基于主題的智能Web信息采集系統(tǒng)的研究與實現(xiàn)[J];計算機應(yīng)用研究;2006年02期
5 馬文峰;杜小勇;;領(lǐng)域本體進化研究[J];圖書情報工作;2006年06期
6 拜戰(zhàn)勝;徐德智;彭佳紅;陳光儀;;基于主題本體的信息采集模型研究[J];計算機技術(shù)與發(fā)展;2009年10期
7 徐猛;胡平;;基于VSM的網(wǎng)頁主題相關(guān)性算法的研究[J];微計算機信息;2009年12期
8 傅向華,馮博琴,馬兆豐,何明;可在線增量自學(xué)習(xí)的聚焦爬行方法[J];西安交通大學(xué)學(xué)報;2004年06期
9 徐德智;郭渭莉;;基于本體的主題相關(guān)度算法研究[J];云南大學(xué)學(xué)報(自然科學(xué)版);2007年S1期
本文編號:2418683
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2418683.html
最近更新
教材專著