基于Web的空間數(shù)據(jù)爬取與度量研究
本文選題:空間敏感爬蟲 + 空間數(shù)據(jù)爬取; 參考:《武漢大學(xué)》2013年博士論文
【摘要】:Web技術(shù)的飛速發(fā)展,為人們提供了豐富的信息,同時(shí)帶來大量的信息冗余。如何快速定位用戶需求,是目前網(wǎng)絡(luò)檢索中常見的問題之一。尤其在空間信息領(lǐng)域,空間數(shù)據(jù)涉及幾何與屬性兩種信息,這種信息的獨(dú)特性,在網(wǎng)絡(luò)環(huán)境下只能通過文字描述信息與幾何圖形信息兩方面分別表現(xiàn)。當(dāng)前,對(duì)于空間信息的檢索,主要集中在文字描述匹配方面,針對(duì)空間幾何信息檢索研究相對(duì)較少。 本文在分析當(dāng)前網(wǎng)絡(luò)環(huán)境下空間信息檢索存在問題的基礎(chǔ)上,探討了解決空間信息檢索所涉及的主要研究領(lǐng)域,以及這些領(lǐng)域國(guó)內(nèi)外的研究進(jìn)展。論文從網(wǎng)絡(luò)信息爬取入手,討論空間信息在網(wǎng)絡(luò)化環(huán)境下的主要特征與分類體系,探討不同類型空間數(shù)據(jù)的解析與識(shí)別方法,針對(duì)不同數(shù)據(jù)類型與對(duì)應(yīng)頁面,闡述數(shù)據(jù)置信度度量基本方法,同時(shí)擴(kuò)展空間數(shù)據(jù)分類體系,提出爬取空間數(shù)據(jù)分類標(biāo)簽體系思想,基于此體系,實(shí)現(xiàn)空間數(shù)據(jù)存儲(chǔ)管理與后期應(yīng)用,最后通過實(shí)例模型驗(yàn)證了空間數(shù)據(jù)爬取的某些過程,并做了相應(yīng)質(zhì)量評(píng)價(jià)與分析。 論文針對(duì)不同空間數(shù)據(jù)類型,深入探討了基于空間信息敏感爬蟲爬取數(shù)據(jù)的基本原理與方法。首先引入空間敏感爬蟲概念,介紹其與傳統(tǒng)爬蟲的異同與工作流程,以及空間敏感頁面和網(wǎng)頁鏈接空間信息與空間檢索詞的相似度度量。其次重點(diǎn)論述了不同類型空間數(shù)據(jù)發(fā)現(xiàn)機(jī)制,即空間數(shù)據(jù)服務(wù)、柵格、矢量及其他數(shù)據(jù)的發(fā)現(xiàn)方法,針對(duì)不同類型,討論其在網(wǎng)頁中的表現(xiàn)形式,解析的基本過程,其中對(duì)涉及主要算法與模型,給出了必要說明與闡述。 論文提出了Web空間數(shù)據(jù)的置信度度量方法。Web空間數(shù)據(jù)由于描述信息缺乏,其數(shù)據(jù)質(zhì)量很難準(zhǔn)確衡量,后期數(shù)據(jù)檢索與應(yīng)用相對(duì)困難。結(jié)合空間數(shù)據(jù)質(zhì)量的一些基本方法,綜合考慮空間數(shù)據(jù)文本描述與數(shù)據(jù)本身信息,提出了定性度量矢量、柵格數(shù)據(jù)的方法。其次,對(duì)不同空間數(shù)據(jù)類型置信度做了分析比較,對(duì)鏈接到同一空間敏感頁面的不同資源,選取較大置信度對(duì)整個(gè)頁面最佳匹配。 論文結(jié)合元數(shù)據(jù)模型與目前空間數(shù)據(jù)分類體系,提出了Web空間數(shù)據(jù)的分類標(biāo)簽思想。Web環(huán)境下空間數(shù)據(jù)由于表達(dá)尺度、范圍、要素等等差異,很難采用傳統(tǒng)的分類體系對(duì)其劃分,必須采用新的方式記錄其數(shù)據(jù)描述信息,借助元數(shù)據(jù)模型及數(shù)據(jù)應(yīng)用相關(guān)的分類體系,提出了分類標(biāo)簽體系模型。在此基礎(chǔ)上,對(duì)Web數(shù)據(jù)獲取后,數(shù)據(jù)的存儲(chǔ)管理,后期數(shù)據(jù)檢索與應(yīng)用做了簡(jiǎn)單說明。 通過實(shí)例模型,對(duì)整個(gè)空間敏感爬蟲從頁面過濾,到信息提取,再到質(zhì)量的基本評(píng)價(jià),進(jìn)行了必要的驗(yàn)證。分析、總結(jié)了相關(guān)理論與實(shí)踐之間存在的不一致性問題,表明了網(wǎng)絡(luò)空間數(shù)據(jù)爬取問題的復(fù)雜性,為后續(xù)研究奠定一定的理論與實(shí)踐基礎(chǔ)。 最后論文對(duì)基于空間信息爬取基本整體流程的各個(gè)環(huán)節(jié)進(jìn)行了總結(jié),提出了下一步研究的幾個(gè)方向。
[Abstract]:The rapid development of Web technology provides a wealth of information and brings a lot of information redundancy. It is one of the common problems in the network retrieval that how to quickly locate the user's needs. Especially in the space information field, the spatial data involves two kinds of information, geometry and property. The uniqueness of this information can only be passed in the network environment. Two aspects of text description information and geometric graphic information are presented respectively. At present, the retrieval of spatial information mainly focuses on the matching of text description, and the research on spatial geometric information retrieval is relatively small.
On the basis of analyzing the existing problems of spatial information retrieval under the current network environment, this paper discusses the main research fields in solving spatial information retrieval and the progress of research at home and abroad in these fields. The paper starts with the crawling of network information, and discusses the main features and classification system of spatial information in the network environment. In the same type of spatial data analysis and recognition method, the basic method of data confidence measurement is expounded for different data types and corresponding pages. At the same time, the spatial data classification system is extended, and the idea of crawling spatial data classification and labeling system is proposed. Based on this system, spatial data storage management and later application are realized. Finally, an example model is adopted. The process of spatial data crawling is verified, and the corresponding quality evaluation and analysis are made.
In view of different spatial data types, the basic principles and methods of crawling data based on spatial information sensitive crawlers are deeply discussed. Firstly, the concept of space sensitive crawler is introduced, and the similarities and differences with traditional crawlers are introduced, and the similarity measure between space sensitive pages and web link space information and space retrieval words is also introduced. Secondly, the similarity measure of space sensitive pages and Web links space information and space retrieval words is introduced. This paper focuses on different types of spatial data discovery mechanism, that is, spatial data service, grid, vector and other data discovery methods. In view of different types, it discusses its form in the web page and the basic process of parsing. It gives the necessary explanation and exposition of the main algorithms and models.
The paper puts forward the confidence measure of Web spatial data,.Web spatial data is difficult to accurately measure the data quality because of lack of description information. The later data retrieval and application is relatively difficult. Combined with some basic methods of spatial data quality, the qualitative measurement vector is put forward with the comprehensive consideration of the text description of spatial data and the information of data itself. Secondly, the confidence degree of different spatial data types is analyzed and compared, and the different resources linked to the same space sensitive page are used to select the best confidence for the best matching of the whole page.
Based on the metadata model and the current spatial data classification system, this paper puts forward the classification label idea of Web spatial data, which is difficult to use traditional classification system to divide the spatial data in.Web environment. On the basis of Web data acquisition, data storage and management, and later data retrieval and application are simply explained.
Through the example model, the necessary verification is carried out on the whole space sensitive crawler from page filtering, information extraction, and then to the basic evaluation of quality. Analysis is made and the inconsistency between the related theory and practice is summarized, which shows the complexity of the network spatial data crawling problem and lays a certain theory and Practice for the follow-up research. Basics.
Finally, the paper summarizes the links of the basic process based on spatial information crawling, and puts forward several directions for further research.
【學(xué)位授予單位】:武漢大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2013
【分類號(hào)】:P208
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 聶青;戰(zhàn)守義;;基于區(qū)域特征的圖像分類技術(shù)[J];北京理工大學(xué)學(xué)報(bào);2008年10期
2 王卉,王家耀;無縫GIS發(fā)展的兩個(gè)關(guān)鍵技術(shù)[J];測(cè)繪通報(bào);2002年04期
3 李清泉,謝智穎,左小清,王沖;基于SVG的空間信息描述與可視化表達(dá)[J];測(cè)繪學(xué)報(bào);2005年01期
4 韓李濤,趙軍;空間數(shù)據(jù)質(zhì)量相關(guān)問題探討[J];東北測(cè)繪;2003年01期
5 蔣玲;龔健雅;;基于OWL-S的地理信息服務(wù)描述和發(fā)現(xiàn)[J];測(cè)繪與空間地理信息;2007年05期
6 廖順寶;蔣林;;地球系統(tǒng)科學(xué)數(shù)據(jù)分類體系研究[J];地理科學(xué)進(jìn)展;2005年06期
7 劉三民;王杰文;;空間數(shù)據(jù)存儲(chǔ)管理研究綜述[J];電腦與信息技術(shù);2006年03期
8 孫立偉;何國(guó)輝;吳禮發(fā);;網(wǎng)絡(luò)爬蟲技術(shù)的研究[J];電腦知識(shí)與技術(shù);2010年15期
9 張征杰;王自強(qiáng);;文本分類及算法綜述[J];電腦知識(shí)與技術(shù);2012年04期
10 張春菊;張雪英;朱少楠;徐希濤;;基于網(wǎng)絡(luò)爬蟲的地名數(shù)據(jù)庫維護(hù)方法[J];地球信息科學(xué)學(xué)報(bào);2011年04期
相關(guān)博士學(xué)位論文 前3條
1 傅明;基于Web的空間數(shù)據(jù)挖掘研究[D];中南大學(xué);2004年
2 王建濤;基于Web的地理信息服務(wù)的研究與實(shí)踐[D];中國(guó)人民解放軍信息工程大學(xué);2005年
3 張霞;地理信息服務(wù)組合與空間分析服務(wù)研究[D];武漢大學(xué);2004年
相關(guān)碩士學(xué)位論文 前10條
1 王佳;支持Ajax技術(shù)的主題網(wǎng)絡(luò)爬蟲系統(tǒng)研究與實(shí)現(xiàn)[D];北京交通大學(xué);2011年
2 管翠花;支持Ajax技術(shù)的Deep Web網(wǎng)絡(luò)爬蟲模型研究[D];大連海事大學(xué);2011年
3 張媚;Ajax友好的網(wǎng)絡(luò)爬蟲設(shè)計(jì)與實(shí)現(xiàn)[D];暨南大學(xué);2011年
4 黃海英;基于概念空間的文本分類的應(yīng)用研究[D];廣西師范大學(xué);2002年
5 周欽強(qiáng);基于人工智能技術(shù)Naive Bayes文本自動(dòng)分類系統(tǒng)研究[D];廣東工業(yè)大學(xué);2005年
6 朱霞;文圖掛接的空間元數(shù)據(jù)目錄服務(wù)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];武漢大學(xué);2005年
7 朱大龍;基于結(jié)構(gòu)相似性的圖像質(zhì)量評(píng)價(jià)方法的研究[D];安徽大學(xué);2006年
8 孟慶崧;基于Web Service的空間信息服務(wù)描述和發(fā)現(xiàn)機(jī)制研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2006年
9 唐永鶴;基于特征點(diǎn)的圖像匹配算法研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2007年
10 王翔;基于BP神經(jīng)網(wǎng)絡(luò)的遙感影像模式識(shí)別方法研究[D];太原科技大學(xué);2009年
,本文編號(hào):1813791
本文鏈接:http://sikaile.net/kejilunwen/dizhicehuilunwen/1813791.html