天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于領(lǐng)域本體和相似概念背景圖的主題爬行策略研究

發(fā)布時(shí)間:2018-05-24 13:12

  本文選題:主題爬行蟲 + 形式概念分析 ; 參考:《西華大學(xué)》2012年碩士論文


【摘要】:近年來,隨著互聯(lián)網(wǎng)中的信息以指數(shù)數(shù)量級(jí)的增長(zhǎng),互聯(lián)網(wǎng)中所包含的信息量越來越大,這給人們尋找有用信息帶來了困難,因此一個(gè)高效準(zhǔn)確的用于組織和檢索有用信息的搜索引擎就變得越來越必要。爬行蟲是搜索引擎中的一個(gè)重要組件,它主要用于從網(wǎng)上搜集文檔信息。由于用于通用搜索引擎的爬行蟲耗費(fèi)大量的磁盤空間和網(wǎng)絡(luò)帶寬,并且搜索結(jié)果的準(zhǔn)確率也比較低,因此主題搜索引擎以其智能化、個(gè)性化、領(lǐng)域化、專業(yè)化等特點(diǎn)很快成為了當(dāng)前學(xué)術(shù)界和產(chǎn)業(yè)界研究的熱點(diǎn)。 主題爬行蟲致力于搜集與預(yù)先給定的主題相關(guān)的網(wǎng)頁,而不是遍歷整個(gè)網(wǎng)絡(luò),它基于這樣的一個(gè)事實(shí):一個(gè)主題相關(guān)的網(wǎng)頁總趨向于鏈向相同主題的其他網(wǎng)頁。主題爬行蟲需要解決的一個(gè)主要問題就是在爬行過程中如何為未訪問的URLs賦予一個(gè)適當(dāng)?shù)膬?yōu)先級(jí)分值以維持比較高的收獲率。為了解決這個(gè)問題,本文提出了一種基于領(lǐng)域本體和形式概念分析技術(shù)的主題爬行策略,該策略首先通過WordNet和概念相關(guān)度構(gòu)建核心相似圖,然后結(jié)合概念格知識(shí)構(gòu)建相似概念背景圖,最后結(jié)合URL對(duì)應(yīng)的錨文本與主題的相關(guān)度以及鏈接分析技術(shù)計(jì)算待爬行URLs的優(yōu)先級(jí)分值,并最終決定URLs的訪問順序。 論文的主要研究?jī)?nèi)容包括以下幾點(diǎn): 1.提出了一種度量語義相關(guān)度的方法。語義相關(guān)度是用來衡量文檔或詞語之間語義相關(guān)性的一個(gè)概念,它反映了兩個(gè)對(duì)象之間的關(guān)聯(lián)程度。本文借助WordNet領(lǐng)域本體所包含的豐富語義,借鑒了多種度量語義相關(guān)度的方法,并最終總結(jié)出了應(yīng)用于本文的度量語義相關(guān)度的方法。 2.提出了一種構(gòu)建相似概念背景圖的方法。本文通過對(duì)搜集回的代表爬行主題的基礎(chǔ)網(wǎng)頁和基礎(chǔ)網(wǎng)頁鏈向的當(dāng)前網(wǎng)頁進(jìn)行分析處理后得到的基礎(chǔ)概念格、當(dāng)前概念格以及能描述爬行主題的特征詞集后,首先將特征詞集基于WordNet詞庫進(jìn)行同義詞擴(kuò)展,生成擴(kuò)展特征詞集,然后再使用度量語義相關(guān)度的方法構(gòu)建核心相似圖,最后根據(jù)本文提出的算法利用核心相似圖、基礎(chǔ)概念格和當(dāng)前概念格構(gòu)建相似概念背景圖。 3.提出了一種基于語義鏈接分析和相似概念背景圖的預(yù)測(cè)URLs優(yōu)先級(jí)分值的策略。錨文本一般是網(wǎng)頁的引用者從另一個(gè)角度對(duì)網(wǎng)頁主題進(jìn)行的簡(jiǎn)短概述,因此它最能體現(xiàn)網(wǎng)頁的主題。本文提出了一種計(jì)算錨文本和主題相關(guān)度的方法,并結(jié)合上文中生成的相似概念背景圖,提出了一種計(jì)算URLs優(yōu)先級(jí)分值的方法按照優(yōu)先級(jí)分值的大小指導(dǎo)主題爬行。 最后,論文利用召回率、recall-precision、F-Measure等三種度量指標(biāo)對(duì)比分析了本文提出的主題爬行策略和基于寬度優(yōu)先的爬行策略、基于背景圖的主題爬行策略、基于相關(guān)背景圖的主題爬行策略以及基于概念背景圖的主題爬行策略。實(shí)驗(yàn)表明,,在同等條件下,本文提出的主題爬行策略具有一定的優(yōu)勢(shì),這也論證了該方法的有效性和可行性。
[Abstract]:In recent years, as the information in the Internet is increasing exponentially, the amount of information contained in the Internet is becoming more and more large, which brings difficulties for people to find useful information. Therefore, a efficient and accurate search engine used to organize and retrieve useful information is becoming more and more necessary. Crawler is an important search engine. Component, which is mainly used to collect document information from the Internet. Because crawlers used in general search engines consume a lot of disk space and network bandwidth, and the accuracy of search results is relatively low, so the theme search engine quickly becomes the current academic and industrial community with its intelligence, personalization, domain and specialization. The hot spot of research.
A topic crawler aims to collect web pages related to a given topic rather than traversing the entire network. It is based on the fact that a topic related web page tends to chain to the other pages of the same topic. One of the main questions that the subject crawler needs to address is how to use the UR in the crawl process. In order to solve this problem, a topic crawling strategy based on domain ontology and formal concept analysis technology is proposed in this paper. In order to solve this problem, this strategy first constructs the core similar graph through WordNet and concept correlation, and then constructs similar concept back with concept lattice knowledge. It finally combines the correlation between the anchor text and the theme of the URL and the link analysis technique to calculate the priority value of the URLs to be crawled, and ultimately determines the order of access of the URLs.
The main contents of this paper include the following points:
1. a method of measuring semantic correlation is proposed. Semantic correlation is a concept used to measure the semantic relevance between documents and words. It reflects the degree of association between two objects. This paper draws on the rich semantics contained in the domain ontology of WordNet and draws on the methods of the semantic correlation of a variety of degrees. A method used to measure semantic correlation in this paper.
2. a method of building a similar concept background map is proposed. By analyzing the basic concept lattice, the current concept lattice and the feature words that can describe the crawling subject, the feature word set is first based on the WordNet lexicon. To expand the synonym, generate the set of extended feature words, and then use the method of measuring semantic correlation to construct the core similar graph. Finally, according to the algorithm proposed in this paper, we use the core similarity graph, the basic concept lattice and the current concept lattice to construct the similar concept background map.
3. a strategy for predicting URLs priority based on semantic link analysis and similar concept background map is proposed. The anchor text is generally a brief overview of web pages from another angle. Therefore, it can most reflect the theme of the web page. This paper presents a method for calculating the correlation between the anchor text and the topic. Combined with the similar concept background map generated in the above, we propose a method to calculate the priority score of URLs, which guides the topic crawling according to the size of the priority value.
Finally, the thesis uses the recall, recall-precision, F-Measure and other three metrics to compare the theme crawling strategy and the crawl strategy based on the width first, the theme crawling strategy based on the background map, the theme crawling strategy based on the related background map and the theme crawling strategy based on the concept background map. Ming, under the same conditions, the theme crawling strategy proposed in this paper has certain advantages, which also demonstrates the effectiveness and feasibility of the method.
【學(xué)位授予單位】:西華大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文 前2條

1 魏玲;祁建軍;張文修;;概念格與粗糙集的關(guān)系研究[J];計(jì)算機(jī)科學(xué);2006年03期

2 費(fèi)靜婷;顧君忠;楊靜;黃俊春;;基于WordNet和聚焦爬蟲的半自動(dòng)領(lǐng)域本體構(gòu)建[J];計(jì)算機(jī)應(yīng)用;2008年S2期

相關(guān)博士學(xué)位論文 前3條

1 杜亞軍;搜索引擎智能行為的研究及實(shí)現(xiàn)[D];西南交通大學(xué);2005年

2 王斌;漢英雙語語料庫自動(dòng)對(duì)齊研究[D];中國(guó)科學(xué)院研究生院(計(jì)算技術(shù)研究所);1999年

3 宋玲;語義相似度計(jì)算及其應(yīng)用研究[D];山東大學(xué);2009年

相關(guān)碩士學(xué)位論文 前4條

1 董占兵;基于形式概念分析的主題搜索策略研究[D];西華大學(xué);2007年

2 宮玲;概念格建格算法的研究[D];遼寧師范大學(xué);2007年

3 楊月奎;基于語義的主題爬行方向研究[D];西華大學(xué);2009年

4 彭強(qiáng)強(qiáng);基于概念背景圖的主題爬行策略研究[D];西華大學(xué);2010年



本文編號(hào):1929182

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1929182.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4bb12***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com