基于本體的主題相關(guān)度算法研究

發(fā)布時(shí)間：2019-07-01 15:38

【摘要】：專業(yè)搜索引擎針對(duì)某一特定領(lǐng)域、某一特定人群或某一特定需求提供有價(jià)值的信息和服務(wù)，是網(wǎng)絡(luò)信息搜索未來發(fā)展的方向之一。在網(wǎng)絡(luò)資源規(guī)模巨大且資源總量迅速增加的情況下，專業(yè)搜索引擎首要解決的問題是如何高效準(zhǔn)確的獲取特定領(lǐng)域或特定主題的網(wǎng)絡(luò)信息——目標(biāo)網(wǎng)絡(luò)資源，包括網(wǎng)頁和鏈接。此問題的核心和關(guān)鍵點(diǎn)是如何計(jì)算目標(biāo)網(wǎng)絡(luò)資源的主題相關(guān)度，包括評(píng)估網(wǎng)頁的主題相關(guān)度與預(yù)測(cè)鏈接的主題相關(guān)度。現(xiàn)有的主題相關(guān)度算法基本在字符層次上計(jì)算主題相關(guān)度，處理概念或語義的能力相對(duì)不足，結(jié)果是主題相關(guān)度判斷不準(zhǔn)確，導(dǎo)致獲取主題信息的準(zhǔn)確率較低。由于本體優(yōu)秀的語義表達(dá)能力，研究引入本體工具，利用本體表達(dá)主題并將網(wǎng)頁概念化，在比較分析各個(gè)經(jīng)典主題相關(guān)度算法的基礎(chǔ)上，最終比選出具備更高準(zhǔn)確率和效率的主題相關(guān)度算法，包括網(wǎng)頁主題相關(guān)度評(píng)估算法和鏈接主題相關(guān)度預(yù)測(cè)算法，進(jìn)而設(shè)計(jì)并實(shí)現(xiàn)具備更高效率和語義處理能力的主題網(wǎng)絡(luò)信息抓取系統(tǒng)——基于本體的主題爬蟲系統(tǒng)，最后通過實(shí)驗(yàn)驗(yàn)證算法的有效性。在歸納和評(píng)述相關(guān)文獻(xiàn)的基礎(chǔ)上，針對(duì)獲取主題信息時(shí)準(zhǔn)確率和效率不高的問題，以收獲率和時(shí)間效率為指標(biāo)分別比選出合適的主題相關(guān)度算法予以解決。在提高主題信息獲取準(zhǔn)確率方面，通過比較KNN分類算法、概念空間向量模型CSVM算法和基于本體的主題相關(guān)度評(píng)估算法，選定基于本體的主題相關(guān)度評(píng)估算法，算法將網(wǎng)頁中的概念映射到本體中計(jì)算網(wǎng)頁主題相關(guān)度。在提高主題信息獲取效率方面，通過比較主題敏感的PageRank算法、基于鏈接文本內(nèi)容的算法和基于本體的鏈接主題相關(guān)度預(yù)測(cè)算法，選定基于本體的鏈接主題相關(guān)度預(yù)測(cè)算法，算法結(jié)合了Q學(xué)習(xí)和樸素貝葉斯分類器以預(yù)測(cè)鏈接的長期價(jià)值，通過比較鏈接的長期價(jià)值選取待抓取的鏈接，其中Q學(xué)習(xí)器通過基于本體的網(wǎng)頁主題相關(guān)度評(píng)估算法算出的網(wǎng)頁主題相關(guān)度值獲得反饋。在選定的算法基礎(chǔ)上，研究應(yīng)用此算法設(shè)計(jì)基于本體的主題爬蟲系統(tǒng)，通過構(gòu)建小型蘋果本體，以蘋果主題為例詳細(xì)闡述了主題爬蟲系統(tǒng)的運(yùn)行流程，最后實(shí)現(xiàn)系統(tǒng)并以收獲率為指標(biāo)與寬度優(yōu)先算法指導(dǎo)的爬蟲以及Best-First算法指導(dǎo)的爬蟲相比較，實(shí)驗(yàn)結(jié)果顯示，基于本體的主題相關(guān)度算法指導(dǎo)的主題爬蟲具備更高的收獲率，在抓取主題相關(guān)網(wǎng)絡(luò)資源時(shí)具備更大的潛力。
[Abstract]:Professional search engine provides valuable information and services for a specific field, a specific group or a specific demand, which is one of the development directions of network information search in the future. With the large scale of network resources and the rapid increase of the total amount of resources, the first problem solved by professional search engines is how to obtain the network information of specific fields or topics efficiently and accurately-the target network resources, including web pages and links. The core and key point of this problem is how to calculate the topic correlation of the target network resources, including evaluating the topic correlation of the web page and predicting the topic correlation of the link. The existing topic correlation algorithms basically calculate the topic correlation at the character level, and the ability to deal with concepts or semantics is relatively insufficient. The result is that the judgment of topic correlation is not accurate, resulting in low accuracy of obtaining topic information. Because of the excellent semantic expression ability of ontology, the ontology tool is introduced, and the web page is conceptualized by using ontology to express the topic and conceptualize the web page. on the basis of comparing and analyzing the classical topic correlation algorithms, the topic correlation algorithm with higher accuracy and efficiency is finally selected, including the web page topic correlation evaluation algorithm and the link topic correlation prediction algorithm. Furthermore, a topic crawler system based on ontology, which has higher efficiency and semantic processing ability, is designed and implemented. Finally, the effectiveness of the algorithm is verified by experiments. On the basis of summing up and reviewing the relevant literature, aiming at the problem of low accuracy and efficiency in obtaining subject information, the harvest rate and time efficiency are compared with the appropriate topic correlation algorithm to solve the problem. In order to improve the accuracy of topic information acquisition, by comparing KNN classification algorithm, concept space vector model CSVM algorithm and ontology-based topic correlation evaluation algorithm, the ontology-based topic correlation evaluation algorithm is selected, and the concept in web page is mapped to ontology to calculate the topic correlation degree of web page. In order to improve the efficiency of topic information acquisition, by comparing the topic-sensitive PageRank algorithm, the linked text content-based algorithm and the ontology-based link topic correlation prediction algorithm, the ontology-based link topic correlation prediction algorithm is selected. The algorithm combines Q learning and naive Bayesian classifiers to predict the long-term value of the link, and selects the link to be grasped by comparing the long-term value of the link. Among them, the Q learner obtains feedback through the web topic correlation value calculated by ontology-based web topic correlation evaluation algorithm. On the basis of the selected algorithm, the ontology-based topic crawler system is designed by using this algorithm. By constructing the small apple ontology, the running flow of the subject crawler system is described in detail by taking the apple theme as an example. Finally, the system is realized and compared with the crawler guided by the width first algorithm and the crawler guided by the Best-First algorithm. The experimental results show that the crawler guided by the width first algorithm and the crawler guided by the width first algorithm are compared with the crawler guided by the width first algorithm. The topic crawler guided by ontology-based topic correlation algorithm has higher harvest rate and greater potential when grasping topic-related network resources.
【學(xué)位授予單位】：中國農(nóng)業(yè)科學(xué)院
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 汪濤,樊孝忠,顧益軍,劉林;基于概念分析的主題爬蟲設(shè)計(jì)[J];北京理工大學(xué)學(xué)報(bào);2004年10期

2 張超;閆宏印;;多線程網(wǎng)絡(luò)爬蟲的設(shè)計(jì)與實(shí)現(xiàn)[J];電腦開發(fā)與應(yīng)用;2012年06期

3 胡金濱,唐旭清;人工神經(jīng)網(wǎng)絡(luò)的BP算法及其應(yīng)用[J];信息技術(shù);2004年04期

4 余靜;劉萬軍;;基于網(wǎng)頁分塊的主題爬蟲研究[J];計(jì)算機(jī)與信息技術(shù);2008年10期

5 朱大奇;人工神經(jīng)網(wǎng)絡(luò)研究現(xiàn)狀及其展望[J];江南大學(xué)學(xué)報(bào);2004年01期

6 張劍;李春平;;基于WordNet概念向量空間模型的文本分類[J];計(jì)算機(jī)工程與應(yīng)用;2006年04期

7 張宇翔;知識(shí)工程中的本體綜述[J];計(jì)算機(jī)工程;2005年S1期

8 盧鵬,孫明勇,陸汝占;基于知網(wǎng)的詞匯語義自動(dòng)分類系統(tǒng)[J];計(jì)算機(jī)仿真;2004年02期

9 劉朋;林泓;高德威;;基于內(nèi)容和鏈接分析的主題爬蟲策略[J];計(jì)算機(jī)與數(shù)字工程;2009年01期

10 曹浪財(cái),羅鍵,李天成;智能螞蟻算法——蟻群算法的改進(jìn)[J];計(jì)算機(jī)應(yīng)用研究;2003年10期

，

本文編號(hào)：2508606

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2508606.html

上一篇：基于重要句群檢索性能比較研究
下一篇：基于網(wǎng)絡(luò)爬蟲的地名數(shù)據(jù)庫維護(hù)方法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于本體的主題相關(guān)度算法研究