面向深層網(wǎng)絡(luò)的查詢規(guī)劃策略的研究

發(fā)布時間：2018-01-21 08:18

本文關(guān)鍵詞： 網(wǎng)絡(luò)數(shù)據(jù)庫查詢能力可執(zhí)行查詢規(guī)劃　出處：《哈爾濱工程大學(xué)》2012年碩士論文　論文類型：學(xué)位論文

【摘要】：當(dāng)今，在線數(shù)據(jù)源(又稱為網(wǎng)絡(luò)數(shù)據(jù)庫)越來越盛行，它們把數(shù)據(jù)隱藏在查詢表單之后，從而形成了所謂的深層網(wǎng)絡(luò)，和表層網(wǎng)絡(luò)相比，表層網(wǎng)絡(luò)的HTML頁面是靜態(tài)的，數(shù)據(jù)存儲在文檔中，而深層網(wǎng)絡(luò)中的數(shù)據(jù)則是存儲在后臺數(shù)據(jù)庫中，只有用戶在表單上提交了查詢后，它才生成動態(tài)HTML頁面。根據(jù)BrightPlanet公司的統(tǒng)計表明，深層網(wǎng)絡(luò)蘊(yùn)含的信息量是表層網(wǎng)絡(luò)的500倍，并且數(shù)量每年仍在飛快地增長，所以研究深層網(wǎng)絡(luò)是必需的而且意義深遠(yuǎn)。由于Web數(shù)據(jù)庫具有規(guī)模大、自治性、異構(gòu)性、動態(tài)性以及不同的數(shù)據(jù)源具有不同有限的查詢能力等特點(diǎn)，使得深層網(wǎng)絡(luò)數(shù)據(jù)集成中的查詢處理比傳統(tǒng)的分布環(huán)境下的查詢處理更具挑戰(zhàn)性。為了解決數(shù)據(jù)源的自治異構(gòu)問題，本文提出了一種數(shù)據(jù)源的描述方法。為了統(tǒng)計每個領(lǐng)域中屬性詞匯的大小，本文進(jìn)行了一項調(diào)查：使用搜索引擎（例如：Google和bing）和Web目錄(例如：invisibleweb.com)，收集了200個關(guān)于電影、書籍銷售、汽車銷售和音樂四個領(lǐng)域的數(shù)據(jù)源，其中每個領(lǐng)域含50個。調(diào)查結(jié)果表明：隨著數(shù)據(jù)源的增多，它們的總共詞匯數(shù)量收斂于一個相對較小的范圍內(nèi)。受此啟發(fā)，為每個屬性詞匯建立倒排索引。此外，本文還提出了一個模塊化的方法，，來為目標(biāo)查詢生成可執(zhí)行的查詢規(guī)劃，它有五個模塊共同工作完成這些任務(wù)：查詢擴(kuò)展、預(yù)處理、查詢重寫、查找相關(guān)數(shù)據(jù)源和生成模塊。本文還設(shè)計了一種基于倒排索引高效生成邏輯規(guī)劃的算法和一種為邏輯規(guī)劃找出可執(zhí)行次序的算法。在本文中，因為數(shù)據(jù)源存在訪問限制，所以沒有出現(xiàn)在邏輯規(guī)劃中的數(shù)據(jù)源可能提供有用的綁定屬性，可能有利于可執(zhí)行查詢規(guī)劃的生成。此外，我們也表明了這些off-query訪問在什么情況下是沒必要的，以及在這些情況下只使用邏輯規(guī)劃中的數(shù)據(jù)源就可以生成可執(zhí)行的查詢規(guī)劃；也表明了這些off-query訪問在什么情況下是必要的，我們提出了一個算法來找到和邏輯規(guī)劃相關(guān)的數(shù)據(jù)源。最后實驗表明本文的算法具有良好的效率、準(zhǔn)確率和擴(kuò)展性。
[Abstract]:Today, online data sources (also known as network databases) are becoming more and more popular, they hide data behind the query form, thus forming a so-called deep network, compared with the surface network. The HTML page of the surface network is static, the data is stored in the document, while the data in the deep network is stored in the background database, only after the user has submitted the query on the form. It generates dynamic HTML pages. According to BrightPlanet, deep networks contain 500 times as much information as surface networks and continue to grow rapidly each year. Therefore, it is necessary and far-reaching to study the deep network. Because Web database has the characteristics of large scale, autonomy, heterogeneity, dynamic and different data sources have different limited query ability and so on. The query processing in deep network data integration is more challenging than that in the traditional distributed environment. In order to solve the problem of autonomous heterogeneity of data sources, a description method of data sources is proposed in this paper. In order to measure the size of attribute vocabulary in each domain. This article conducted a survey using search engines (e.g.: Google and bing) and the Web directory (e.g.: invisibleweb.com). Collected 200 data sources on film, book sales, car sales and music, with 50 in each. The results show that: as data sources increase. Their total number of words converges to a relatively small range. Inspired by this, an inverted index is established for each attribute vocabulary. In addition, this paper proposes a modularization method. It has five modules working together to complete these tasks: query expansion, preprocessing, query rewriting. This paper also designs an efficient algorithm for generating logical programming based on inverted index and an algorithm for finding executable order for logic programming. In this article, data sources that do not appear in logical planning may provide useful binding properties that may facilitate the generation of executable query planning because of access restrictions to the data source. We have also shown where these off-query access is not necessary and where only the data sources in the logical planning can be used to generate executable query planning; We also show that these off-query access is necessary under what circumstances, we propose an algorithm to find the data source related to logical programming. Finally, experiments show that the algorithm has good efficiency, accuracy and expansibility.
【學(xué)位授予單位】：哈爾濱工程大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前3條

1 宋暉,張嶺,葉允明,馬范援;基于標(biāo)記樹對象抽取技術(shù)的Hidden Web獲取研究[J];計算機(jī)工程與應(yīng)用;2002年23期

2 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計算機(jī)學(xué)報;2007年09期

3 鄭冬冬,趙朋朋,崔志明;Deep Web爬蟲研究與設(shè)計[J];清華大學(xué)學(xué)報(自然科學(xué)版);2005年S1期

本文編號：1450959

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1450959.html

上一篇：基于知識庫與文本分類算法的用戶興趣點(diǎn)挖掘研究
下一篇：用PAT Tree構(gòu)建Internet搜索引擎分布式數(shù)據(jù)庫

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向深層網(wǎng)絡(luò)的查詢規(guī)劃策略的研究