數(shù)據(jù)空間集成與查詢(xún)關(guān)鍵技術(shù)研究
本文選題:數(shù)據(jù)空間 + 數(shù)據(jù)模型; 參考:《哈爾濱工程大學(xué)》2016年博士論文
【摘要】:在過(guò)去的十年,互聯(lián)網(wǎng)、云計(jì)算、大數(shù)據(jù)以及移動(dòng)互聯(lián)等技術(shù)得到蓬勃發(fā)展,這使得當(dāng)前數(shù)據(jù)呈現(xiàn)出體量巨大、種類(lèi)繁多、動(dòng)態(tài)演化和松散關(guān)聯(lián)等新特點(diǎn)。傳統(tǒng)的數(shù)據(jù)庫(kù)管理技術(shù)無(wú)法管理這樣的數(shù)據(jù),因此,研究新的數(shù)據(jù)管理技術(shù)來(lái)駕馭這些數(shù)據(jù)就顯得尤為必要。數(shù)據(jù)空間技術(shù)應(yīng)運(yùn)而生,并引起數(shù)據(jù)庫(kù)社區(qū)和工業(yè)界廣泛關(guān)注。然而,數(shù)據(jù)空間在數(shù)據(jù)集成與數(shù)據(jù)查詢(xún)方面仍然存在許多尚未(或未完全)解決的問(wèn)題。例如,缺少表示異構(gòu)數(shù)據(jù)以及復(fù)雜語(yǔ)義關(guān)系的數(shù)據(jù)模型;缺少面向動(dòng)態(tài)演化環(huán)境下的數(shù)據(jù)空間實(shí)體劃分技術(shù);缺少支持具有高傾斜分布、大規(guī)模異構(gòu)數(shù)據(jù)的多維索引技術(shù);缺少無(wú)縫搜索異構(gòu)數(shù)據(jù)、表達(dá)力較強(qiáng)的近似查詢(xún)技術(shù)等。本文立足于數(shù)據(jù)空間集成與數(shù)據(jù)查詢(xún)方面的研究,旨在能夠統(tǒng)一地管理各種結(jié)構(gòu)化、半結(jié)構(gòu)化與非結(jié)構(gòu)化數(shù)據(jù),并且能夠高效地、無(wú)縫地搜索這些異構(gòu)數(shù)據(jù),從而為“Pay-as-you-go”方式集成數(shù)據(jù)提供基本保障,進(jìn)而提供“Best-effort”的數(shù)據(jù)空間查詢(xún)服務(wù)。針對(duì)上述問(wèn)題,本文將從以下方面展開(kāi)深入細(xì)致的研究。首先,針對(duì)數(shù)據(jù)空間中異構(gòu)數(shù)據(jù)具有上下文依賴(lài)性以及語(yǔ)義關(guān)系具有復(fù)雜性特點(diǎn),對(duì)數(shù)據(jù)空間表示模型進(jìn)行了研究。通過(guò)一個(gè)案例分析了傳統(tǒng)數(shù)據(jù)空間模型(如解釋對(duì)象模型)的缺陷,提出了一種基于上下文感知的復(fù)雜語(yǔ)義關(guān)聯(lián)網(wǎng)絡(luò)模型(COSAN)。具體而言,(1)在傳統(tǒng)解釋對(duì)象模型基礎(chǔ)上,考慮異構(gòu)數(shù)據(jù)的上下文依賴(lài)性,形式化地定義了上下文感知的異構(gòu)數(shù)據(jù)表示方法。該方法把上下文信息與數(shù)據(jù)源的結(jié)構(gòu)化、半結(jié)構(gòu)化以及非結(jié)構(gòu)化信息統(tǒng)一封裝為上下文感知的解釋對(duì)象,從而表達(dá)上下文感知的異構(gòu)信息;(2)為克服傳統(tǒng)數(shù)據(jù)模型只能表示簡(jiǎn)單二元語(yǔ)義關(guān)系的缺陷,通過(guò)一組約束組件(如上下文約束、順序約束和聚合約束等)擴(kuò)展了傳統(tǒng)的二元語(yǔ)義關(guān)系,形式化地表示了復(fù)雜語(yǔ)義關(guān)系;(3)在公開(kāi)數(shù)據(jù)集DBLP上進(jìn)行了大量實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果驗(yàn)證了該模型的有效性和可行性。其次,針對(duì)數(shù)據(jù)空間實(shí)體具有信息豐富性、類(lèi)別滯后性以及動(dòng)態(tài)演化性特點(diǎn),對(duì)面向數(shù)據(jù)空間的實(shí)體劃分技術(shù)進(jìn)行了研究,提出了一種基于演化K-Means的數(shù)據(jù)空間實(shí)體劃分方法。具體而言,(1)提出了一種基于輪廓值和KL-散度的演化K-Means聚類(lèi)框架。該框架不僅考慮當(dāng)前聚簇的質(zhì)量(即,快照代價(jià)),還考慮了若干典型的歷史聚簇結(jié)構(gòu)的時(shí)間平滑性(即,歷史代價(jià));(2)通過(guò)綜合使用實(shí)體自身的豐富信息和實(shí)體間的歷史出現(xiàn)模式信息,設(shè)計(jì)了一種面向數(shù)據(jù)空間實(shí)體的相似性度量方法,從而較準(zhǔn)確地度量實(shí)體間的相似性;(3)根據(jù)啟發(fā)式規(guī)則,提出了一種基于相似性密度的演化K-Means聚類(lèi)算法,較好地解決了初始點(diǎn)選擇問(wèn)題和在演化環(huán)境中數(shù)據(jù)空間實(shí)體劃分問(wèn)題;(4)擴(kuò)展了演化K-Means聚類(lèi)框架,以處理簇?cái)?shù)量隨時(shí)間發(fā)生變化、快照實(shí)體隨時(shí)間加入或移除的情況;(5)在公開(kāi)數(shù)據(jù)集DBLP上進(jìn)行了大量實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果表明本方法優(yōu)于傳統(tǒng)已有的方法,它不僅能高質(zhì)量地捕獲當(dāng)前實(shí)體聚類(lèi)結(jié)果,還能健壯地反映歷史聚簇情況。再次,針對(duì)傳統(tǒng)數(shù)據(jù)空間索引方法無(wú)法適用于高傾斜分布的大規(guī)模數(shù)據(jù)的問(wèn)題,從負(fù)載均衡和劃分角度對(duì)數(shù)據(jù)空間多維索引技術(shù)進(jìn)行了研究,提出了一種基于負(fù)載均衡和查詢(xún)?nèi)罩镜臄?shù)據(jù)空間多維索引方法,旨在保持各個(gè)索引節(jié)點(diǎn)負(fù)載均衡、減少查詢(xún)通信開(kāi)銷(xiāo)、提高數(shù)據(jù)空間查詢(xún)處理性能。具體而言,(1)在垂直劃分中,聚合在查詢(xún)?nèi)罩竞蛯?shí)體中頻繁出現(xiàn)的token詞,以減少查詢(xún)涉及倒排列表的聚合/合并開(kāi)銷(xiāo)。在此基礎(chǔ)上,結(jié)合超圖理論和用戶(hù)查詢(xún)與倒排列表間訪問(wèn)模式信息,把垂直劃分問(wèn)題進(jìn)一步歸約為超圖劃分問(wèn)題,從而保持垂直劃分的負(fù)載均衡;(2)在水平劃分中,結(jié)合超圖理論和用戶(hù)查詢(xún)與實(shí)體間訪問(wèn)模式信息,把水平劃分問(wèn)題歸約為超圖劃分問(wèn)題,從而保持水平劃分的負(fù)載均衡;(3)結(jié)合垂直劃分和水平劃分策略,構(gòu)建了二維混合索引。在此基礎(chǔ)上,從查詢(xún)吞吐量與容錯(cuò)率角度考慮,利用索引副本策略,進(jìn)一步擴(kuò)展為三維索引;(4)在公開(kāi)數(shù)據(jù)集DBLP上進(jìn)行了大量實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果表明本方法在吞吐量、查詢(xún)響應(yīng)時(shí)間及擴(kuò)展性等方面優(yōu)于已有方法。最后,針對(duì)傳統(tǒng)數(shù)據(jù)空間查詢(xún)語(yǔ)義、查詢(xún)結(jié)構(gòu)較簡(jiǎn)單的缺陷,對(duì)面向數(shù)據(jù)空間的top-k近似子圖查詢(xún)技術(shù)進(jìn)行了研究,提出了一種基于鄰域結(jié)構(gòu)的top-k近似子圖查詢(xún)方法。具體而言,(1)形式化地定義了數(shù)據(jù)空間中top-k近似子圖查詢(xún)問(wèn)題,在圖管理理論基礎(chǔ)上,提出了一種新型的數(shù)據(jù)空間查詢(xún)語(yǔ)言GQL;(2)通過(guò)綜合利用頂點(diǎn)距離鄰近性信息和邊標(biāo)簽分布性信息,設(shè)計(jì)了一種基于鄰域結(jié)構(gòu)的圖相似性函數(shù);(3)基于索引技術(shù)和鄰域結(jié)構(gòu)特征,提出了一種基于鄰域結(jié)構(gòu)的匹配頂點(diǎn)剪枝算法,從而剪枝掉大量無(wú)希望的候選匹配頂點(diǎn);(4)通過(guò)考慮頂點(diǎn)剪枝策略和頂點(diǎn)匹配順序,提出了一種面向數(shù)據(jù)空間的top-k近似子圖搜索算法;(5)在真實(shí)數(shù)據(jù)集DBLP上進(jìn)行了大量實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果表明該方法在查詢(xún)效果、查詢(xún)效率和擴(kuò)展性方面明顯優(yōu)于已有方法。
[Abstract]:In the past ten years, the Internet, cloud computing, big data and mobile Internet technology is developing rapidly, which makes the data showing a huge volume, variety, new characteristics of dynamic evolution and loosely related. The traditional database management technology to manage such data, therefore, research on new data management technology to manage these data is particularly necessary. Data space came into being, and cause the database community and industry wide attention. However, many still do not exist in the data space, data integration and data query (or not) to solve the problem. For example, the lack of heterogeneous data and complex semantic relation data model; lack of data space entity techniques for dynamic evolution of environment; lack of support with high skew distribution, multidimensional indexing technology of heterogeneous data; lack of seamless search Cable heterogeneous data, expressive approximate query technology. This paper is based on the spatial data integration and data query research, to unified management of various structured, semi-structured and unstructured data, and can efficiently and seamlessly search these heterogeneous data, so as to provide the basic guarantee for the integration of "Pay-as-you-go" the data, which provides "Best-effort" data spatial query service. Aiming at the above problems, this paper will research deeply from the following aspects. Firstly, according to the data space of heterogeneous data with context dependent complexity characteristics and semantic relations, said model to study data space. Data were analyzed by the traditional space model a case (such as object model) defects, presents a complex semantic correlation network model based on context awareness (CO SAN). Specifically, (1) based on the traditional interpretation of the object model, considering the heterogeneous data context dependent, the formal definition of the context aware heterogeneous data representation method. This method takes the structure of context information and data sources, semi-structured and unstructured information unified package for context aware object thus, the expression of heterogeneous information context; (2) to overcome the traditional data model can express the defect simple $two semantic relations, through a set of constraint components (such as context constraints, sequence constraints and polymerization constraints etc.) extends the semantic relations of the traditional two yuan, to formally represent the complex semantic relations; (3) in the public data set DBLP on a large number of experiments, the experimental results verify the feasibility and validity of the model. Secondly, with rich information in data space lag and dynamic entity category Evolution characteristics of oriented data space entity partitioning technology research, put forward a method of data space entity partitioning based on evolutionary K-Means. Specifically, (1) proposed a framework for clustering and evolution of K-Means profile based on KL- divergence. The framework not only consider the quality of the clusters (that is, the price), snapshot also takes into account the time smoothness of some typical historical cluster structure (i.e., historical cost); (2) through the comprehensive use of solid rich information and the history between the pattern information, design a similarity measure method for spatial entity data, thus more accurate to measure the similarity between entities; (3) according to the heuristic rules, presents a similar evolution of K-Means clustering algorithm based on density, better solves the initial selection problem and divides the data space entities in the evolution of environment The problem; (4) expansion of the evolution of the K-Means clustering framework to deal with the number of clusters change with time, the snapshot entity with time to add or remove the situation; (5) in the public data set DBLP on a large number of experiments, the experimental results show that this method is superior to the traditional method of existing, it can not only capture the current high quality the entity clustering results, but also robust to reflect the history of cluster. Thirdly, the traditional data spatial index method is not suitable for large-scale data in high inclined distribution problems, from the angle of load balancing and division of data space Treviso cited Technology research, this paper proposes a data space multidimensional index method of load balancing and query log based on each index node to keep the load balance, reduce the query communication costs, improving the spatial data query processing performance. Specifically, (1) in the vertical partition, aggregation in the query The frequent log and entities in token, in order to reduce the overhead associated with queries involving aggregation / inverted list. On this basis, combining hypergraph theory and user query and inverted list access pattern information, the vertical partition problem is further reduced to a hypergraph partitioning problem, and keep the load balance from the vertical division; (2 at the level of division,) combining hypergraph theory and user query and entity access pattern information, the hypergraph partitioning problem level partition problem reduction, so as to maintain load balancing level; (3) combined with the vertical and horizontal partition partition strategy, construct the two-dimensional hybrid index. On this basis, from the query throughput and fault tolerance point of view, using the index replication strategy, further extended to 3D index; (4) in the public data set DBLP on a large number of experiments, the experimental results show that this method in throughput and query response time And scalability is superior to existing methods. Finally, in view of the traditional spatial data query semantics, query the defect structure is relatively simple, Top-k oriented data space approximate subgraph query technology, proposes a neighborhood structure based on the Top-k approximate subgraph query method. Specifically, (1) formal the definition of the data space Top-k approximate subgraph query problem in graph management based on the theory, proposed a new spatial data query language GQL; (2) by using vertex distance proximity information and edge label distribution information, a similarity function is designed based on the neighborhood structure diagram (; 3) and neighborhood index technology based on the structure characteristic and propose a matching vertex pruning algorithm based on neighborhood structures, thus pruning out a lot of hopeless candidate matching points; (4) by considering the vertex and vertex matching pruning strategy In order to match the order, we propose a data oriented Top-k approximate subgraph search algorithm. (5) a lot of experiments have been done on the real data set DBLP. The experimental results show that this method is superior to the existing methods in query efficiency, query efficiency and scalability.
【學(xué)位授予單位】:哈爾濱工程大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類(lèi)號(hào)】:TP311.13
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 楊紅爵;;略論數(shù)據(jù)空間[J];成功(教育);2009年01期
2 郭瑩;;數(shù)據(jù)空間關(guān)鍵問(wèn)題探究[J];軟件導(dǎo)刊;2012年07期
3 厲劍;張紹雄;劉俊杰;李成柱;;大數(shù)據(jù)引發(fā)信息時(shí)代新變革[J];大眾科技;2013年12期
4 李斌;;大數(shù)據(jù)及其發(fā)展趨勢(shì)研究[J];廣西教育;2013年35期
5 張曉軍;孟祥武;;數(shù)字化周期[J];計(jì)算機(jī)科學(xué);2002年05期
6 崔晨;吳揚(yáng)揚(yáng);;基于活動(dòng)的數(shù)據(jù)空間數(shù)據(jù)關(guān)系發(fā)現(xiàn)[J];微型機(jī)與應(yīng)用;2011年11期
7 賈云得;;微型數(shù)字存貯遙測(cè)裝置數(shù)據(jù)預(yù)存貯方法[J];遙測(cè)遙控;1989年06期
8 靳小龍;王元卓;程學(xué)旗;;大數(shù)據(jù)的研究體系與現(xiàn)狀[J];信息通信技術(shù);2013年06期
9 朝樂(lè)門(mén);;數(shù)據(jù)空間及其信息資源管理視角研究[J];情報(bào)理論與實(shí)踐;2013年11期
10 黃一凡;;合并分區(qū) 數(shù)據(jù)無(wú)損有妙招[J];電腦愛(ài)好者;2011年23期
相關(guān)會(huì)議論文 前5條
1 李鴻奎;陳洪艷;;大連市房地產(chǎn)基礎(chǔ)地理信息系統(tǒng)的設(shè)計(jì)和建設(shè)[A];中國(guó)地理信息系統(tǒng)協(xié)會(huì)第九屆年會(huì)論文集[C];2005年
2 董彥磊;申德榮;寇月;聶鐵錚;;數(shù)據(jù)空間中數(shù)據(jù)組織模型以及關(guān)聯(lián)關(guān)系發(fā)現(xiàn)模型的研究[A];第26屆中國(guó)數(shù)據(jù)庫(kù)學(xué)術(shù)會(huì)議論文集(B輯)[C];2009年
3 龐怡;許洪光;張志敏;;針對(duì)海量科技信息的存儲(chǔ)研究[A];信息時(shí)代——科技情報(bào)研究學(xué)術(shù)論文集(第三輯)[C];2008年
4 季承;;Oracle利用HWM高水標(biāo)記收縮數(shù)據(jù)空間方案[A];2013電力行業(yè)信息化年會(huì)論文集[C];2013年
5 季承;;Oracle利用HWM高水標(biāo)記收縮數(shù)據(jù)空間方案[A];2013電力行業(yè)信息化年會(huì)論文集[C];2013年
相關(guān)重要報(bào)紙文章 前6條
1 牛澤亞;用戶(hù)如何在數(shù)據(jù)空間里“被遺忘”?[N];人民郵電;2014年
2 風(fēng)格;指引大數(shù)據(jù)未來(lái)發(fā)展方向的九大真理[N];中華讀書(shū)報(bào);2013年
3 錄音整理 本報(bào)記者 劉文強(qiáng) 楊豐源;創(chuàng)新驅(qū)動(dòng),,奮力奔向大數(shù)據(jù)時(shí)代[N];貴陽(yáng)日?qǐng)?bào);2014年
4 中國(guó)人民大學(xué)信息學(xué)院 李玉坤;云計(jì)算與數(shù)據(jù)空間[N];中國(guó)計(jì)算機(jī)報(bào);2008年
5 整理 本報(bào)記者 蘇丹丹;把握大數(shù)據(jù)機(jī)遇 推動(dòng)文化產(chǎn)業(yè)跨越發(fā)展[N];中國(guó)文化報(bào);2013年
6 安徽國(guó)稅局 趙為民;稅務(wù)綜合數(shù)據(jù)平臺(tái)的設(shè)想[N];計(jì)算機(jī)世界;2007年
相關(guān)博士學(xué)位論文 前10條
1 祝官文;數(shù)據(jù)空間集成與查詢(xún)關(guān)鍵技術(shù)研究[D];哈爾濱工程大學(xué);2016年
2 李曉娜;面向SaaS應(yīng)用的多租戶(hù)數(shù)據(jù)放置機(jī)制研究[D];山東大學(xué);2015年
3 張德兵;基于機(jī)器學(xué)習(xí)的數(shù)據(jù)補(bǔ)全、標(biāo)注和檢索若干問(wèn)題研究[D];浙江大學(xué);2015年
4 劉思彤;空間文本數(shù)據(jù)的查詢(xún)處理技術(shù)研究[D];清華大學(xué);2015年
5 侯振隆;重力全張量梯度數(shù)據(jù)的并行反演算法研究及應(yīng)用[D];吉林大學(xué);2016年
6 柯余洋;面向三類(lèi)應(yīng)用數(shù)據(jù)的智能分析與優(yōu)化研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2016年
7 劉正濤;構(gòu)建Web數(shù)據(jù)空間的若干關(guān)鍵技術(shù)研究[D];南京航空航天大學(xué);2016年
8 姜朔;數(shù)據(jù)空間中數(shù)據(jù)集成若干關(guān)鍵問(wèn)題研究[D];東華大學(xué);2014年
9 陳鵬;面向情景感知計(jì)算的時(shí)空數(shù)據(jù)管理、查詢(xún)、分析與相關(guān)算法研究[D];華東師范大學(xué);2013年
10 楊丹;數(shù)據(jù)空間中基于語(yǔ)義的實(shí)體搜索關(guān)鍵技術(shù)研究[D];東北大學(xué);2012年
相關(guān)碩士學(xué)位論文 前10條
1 權(quán)西瑞;云環(huán)境下數(shù)據(jù)版權(quán)保護(hù)方法的研究[D];西安建筑科技大學(xué);2015年
2 向兵;中藥顆粒調(diào)劑設(shè)備中輔助硬件及自動(dòng)封口機(jī)的設(shè)計(jì)[D];東北師范大學(xué);2015年
3 朱躍龍;公安情報(bào)自動(dòng)分類(lèi)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2015年
4 張鵬遠(yuǎn);大數(shù)據(jù)分類(lèi)存儲(chǔ)及檢索方法研究[D];西安電子科技大學(xué);2014年
5 王夢(mèng)佳;DOA下數(shù)據(jù)注冊(cè)方法的初步研究與實(shí)現(xiàn)[D];成都理工大學(xué);2015年
6 陳啟偉;電機(jī)狀態(tài)云監(jiān)測(cè)系統(tǒng)研究與實(shí)現(xiàn)[D];浙江大學(xué);2016年
7 王照清;大數(shù)據(jù)環(huán)境下數(shù)據(jù)查詢(xún)優(yōu)化技術(shù)應(yīng)用研究[D];北方工業(yè)大學(xué);2016年
8 賈振美;面向稀疏軌跡數(shù)據(jù)的位置預(yù)測(cè)方法研究[D];東北大學(xué);2014年
9 雷德龍;矢量空間數(shù)據(jù)云存儲(chǔ)與馬爾可夫并行聚類(lèi)算法研究[D];福州大學(xué);2014年
10 王甜甜;國(guó)家地理大數(shù)據(jù)戰(zhàn)略平臺(tái)研究[D];中共中央黨校;2016年
本文編號(hào):1752371
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/1752371.html