基于知識庫與云平臺的海量數(shù)據(jù)存儲與查詢設(shè)計與實現(xiàn)
發(fā)布時間:2018-07-03 12:51
本文選題:RDF + 存儲。 參考:《北京郵電大學(xué)》2017年碩士論文
【摘要】:在互聯(lián)網(wǎng)飛速發(fā)展的時代背景下,數(shù)據(jù)規(guī)模正在飛速增長,這些數(shù)據(jù)主要來自不同數(shù)據(jù)源的異構(gòu)數(shù)據(jù)。知識圖譜在信息搜索領(lǐng)域的成功應(yīng)用促進了異構(gòu)數(shù)據(jù)的融合,存儲和查詢的研究。本體使用唯一標(biāo)識符對互聯(lián)網(wǎng)上的資源進行標(biāo)記,并可以在資源之上添加自身屬性和資源之間的關(guān)系屬性,具有較大的靈活性和可擴展性。隨著語義web的興起,經(jīng)過幾十年的發(fā)展,本體被廣泛應(yīng)用于異構(gòu)數(shù)據(jù)的表達,被公認(rèn)為是一種有效的解決方案。近年來,在計算機領(lǐng)域,涌現(xiàn)出很多基于本體對數(shù)據(jù)進行管理和應(yīng)用的相關(guān)研究。傳統(tǒng)的存儲方法將不同類目的信息存儲在不同的表中,導(dǎo)致搜索結(jié)果單一,無法滿足用戶需求。隨著網(wǎng)絡(luò)規(guī)模和多源數(shù)據(jù)量的增加,傳統(tǒng)的數(shù)據(jù)庫存儲方案和單機環(huán)境難以支持海量數(shù)據(jù)的存儲與查詢。因此,越來越多的云平臺與分布式系統(tǒng)的解決方案被應(yīng)用到數(shù)據(jù)存儲與查詢領(lǐng)域。雖然基于分布式系統(tǒng)的研究尚不成熟,但很有研究意義與發(fā)展前景。本文基于云平臺Hadoop和非關(guān)系型數(shù)據(jù)庫HBase,研究海量異構(gòu)數(shù)據(jù)的融合,存儲和查詢。主要工作如下:1.首先,作為后續(xù)分布式存儲與查詢的基礎(chǔ),實現(xiàn)了多源異構(gòu)數(shù)據(jù)的融合。本文通過并行化計算框架MapReduce實現(xiàn)并行化本體構(gòu)建與融合。在構(gòu)建過程中,將不同源的數(shù)據(jù)分別構(gòu)建為類別單一的本體。在融合過程中,對不同源的數(shù)據(jù)進行融合,生成類別和語義豐富的本體。2.隨著數(shù)據(jù)爆炸式增長,傳統(tǒng)的存儲方法在導(dǎo)入性能和對單機存儲硬件需求這兩方面的瓶頸日益凸顯。參考近年的分布式RDF數(shù)據(jù)存儲方案,本文綜合考慮存儲空間及后續(xù)對查詢的響應(yīng)速度這兩個因素,設(shè)計了基于HBase的存儲模型。3.在HBase存儲模型之上,分別設(shè)計了三元組模式查詢,基本圖模式查詢和關(guān)鍵詞查詢的查詢策略。三元組模式查詢是基本圖模式查詢的基礎(chǔ),它的響應(yīng)速度由兩方面決定:數(shù)據(jù)庫的表設(shè)計,數(shù)據(jù)庫本身的索引性能。此外,通過分析復(fù)雜基本圖模式查詢的結(jié)構(gòu)規(guī)律,提出了基于連接操作的優(yōu)化方法。關(guān)鍵詞查詢的研究意義在于提升查詢引擎的易用性,本文提出的關(guān)鍵詞搜索方法利用了基本圖模式查詢的研究成果,達到較好的性能。通過在LUBM數(shù)據(jù)集上進行試驗,驗證了策略的有效性和高效性。
[Abstract]:Under the background of the rapid development of the Internet, the scale of data is growing rapidly, which mainly comes from heterogeneous data from different data sources. The successful application of knowledge map in the field of information search promotes the research of heterogeneous data fusion, storage and query. Ontology uses unique identifiers to mark resources on the Internet and can add its own attributes to the resources and the relationship between the resources. It is flexible and extensible. With the rise of semantic web, ontology has been widely used in the expression of heterogeneous data after decades of development. It is recognized as an effective solution. In recent years, there have been a lot of ontology-based data management and application research in the field of computer. The traditional storage method stores the information of different categories in different tables, resulting in a single search result, which can not meet the needs of users. With the increase of network scale and multi-source data, the traditional database storage scheme and single machine environment can not support the storage and query of massive data. Therefore, more and more cloud platforms and distributed system solutions are applied to data storage and query. Although the research based on distributed system is not mature, it has great significance and development prospect. Based on the cloud platform Hadoop and the non-relational database HBasethis paper studies the fusion storage and query of massive heterogeneous data. The main work is as follows: 1. Firstly, as the foundation of the subsequent distributed storage and query, the fusion of multi-source and heterogeneous data is realized. In this paper, parallel ontology construction and fusion are realized by parallel computing framework MapReduce. In the process of building, the different data are constructed into a single ontology. In the process of fusion, the data of different origin are fused to generate a class and semantic rich ontology. 2. With the explosive growth of data, the bottleneck of traditional storage methods in the import performance and the demand for single-machine storage hardware has become increasingly prominent. Referring to the distributed RDF data storage scheme in recent years, this paper designs a storage model based on HBase. Based on the HBASE storage model, the query strategies of triple mode query, basic graph schema query and keyword query are designed respectively. Triple schema query is the basis of basic graph schema query. Its response speed is determined by two aspects: database table design and database itself index performance. In addition, an optimization method based on join operation is proposed by analyzing the structure of complex basic graph schema query. The research significance of keyword query is to improve the ease of use of query engine. The keyword search method proposed in this paper makes use of the research results of basic graph pattern query to achieve better performance. The effectiveness and efficiency of the strategy are verified by experiments on the LUBM dataset.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13;TP393.09
【參考文獻】
相關(guān)碩士學(xué)位論文 前1條
1 項靈輝;基于圖數(shù)據(jù)庫的海量RDF數(shù)據(jù)分布式存儲[D];武漢科技大學(xué);2013年
,本文編號:2093709
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2093709.html
最近更新
教材專著