面向眾包數(shù)據(jù)庫的隱私保護技術(shù)研究
發(fā)布時間:2018-07-20 17:59
【摘要】:眾包數(shù)據(jù)庫是一種利用眾包平臺將人類智慧和機器相結(jié)合,以解決傳統(tǒng)關(guān)系數(shù)據(jù)庫難以處理的查詢?nèi)蝿?wù)的新型數(shù)據(jù)庫。其核心思想是將查詢及相應(yīng)數(shù)據(jù)集以眾包任務(wù)的形式發(fā)布到互聯(lián)網(wǎng),并最終交給大眾網(wǎng)民,利用人類智慧來解決。然而包含隱私信息的數(shù)據(jù)集若不做任何處理就發(fā)送給大眾網(wǎng)民,則可能造成隱私信息的泄漏。隱私問題在傳統(tǒng)數(shù)據(jù)庫領(lǐng)域已有多年的研究,其中數(shù)據(jù)匿名技術(shù)已在數(shù)據(jù)發(fā)布等實際應(yīng)用中證明了其有效性。然而現(xiàn)有的匿名技術(shù)難以簡單地應(yīng)用于眾包數(shù)據(jù)庫,首先,眾包數(shù)據(jù)庫通常規(guī)模較大且分布式地存儲于不同節(jié)點中,現(xiàn)有算法難以高效地處理這種大規(guī)模、分布式數(shù)據(jù);其次,現(xiàn)有算法會造成任務(wù)相關(guān)的信息損失量過大,導(dǎo)致任務(wù)完成質(zhì)量降低。為提高眾包任務(wù)的完成質(zhì)量,基于空間分割的Two-Phase Partition匿名算法通過抽樣技術(shù)保留更多的任務(wù)相關(guān)信息,提高匿名數(shù)據(jù)的可用性。第一階段Pre-Partition,以樣本坐標(biāo)為候選分割點,對空間做全域分割,根據(jù)真實值設(shè)計估值函數(shù),篩選最優(yōu)分割點集合。第二階段Further-Partition,以第一階段的輸出為候選分割點,對空間做基于kd-tree的本地分割,再根據(jù)得到的子空間邊界對數(shù)據(jù)做替換操作,完成數(shù)據(jù)匿名化。為高效地處理大規(guī)模、分布式眾包數(shù)據(jù)庫,基于MapReduce的并行匿名框架,實現(xiàn)了對Two-Phase Partition算法的并行化。該框架采用哈希技術(shù)將原數(shù)據(jù)集重新劃分為多個子數(shù)據(jù)集,分別對其做匿名處理后再將其整合正完整的匿名數(shù)據(jù)集。實驗表明,與現(xiàn)有算法相比,單機版Two-Phase Partition算法在查詢正確率上提高了20%以上,且隨著樣本比例的增大,查詢正確率增加。利用并行匿名框架實現(xiàn)Two-Phase Partition算法的并行化后,查詢正確率略低于單機版算法,但降低幅度在5%以內(nèi),且在執(zhí)行效率上可以實現(xiàn)隨數(shù)據(jù)集大小的線性增長。因此該并行匿名方案適合于解決大規(guī)模、分布式眾包數(shù)據(jù)庫的隱私問題。
[Abstract]:Crowdsourcing database is a new type of database which combines human intelligence with machine by using crowdsourcing platform to solve the difficult query task of traditional relational database. Its core idea is to publish the query and the corresponding data set to the Internet in the form of crowdsourcing tasks, and finally to the mass Internet users to use human wisdom to solve the problem. However, if the data set containing private information is sent to Internet users without any processing, it may lead to the disclosure of privacy information. Privacy issues have been studied in the field of traditional databases for many years, among which the technology of data anonymity has been proved to be effective in practical applications such as data release. However, the existing anonymous technology is difficult to be simply applied to crowdsourcing databases. Firstly, crowdsourcing databases are usually stored in different nodes on a large scale and distributed, and the existing algorithms are difficult to deal with such large-scale and distributed data efficiently. The existing algorithms will result in excessive loss of information related to tasks, resulting in poor quality of task completion. In order to improve the completion quality of crowdsourcing tasks, the Two-Phase Partition anonymous algorithm based on space segmentation retains more task related information through sampling technology, and improves the availability of anonymous data. In the first stage Pre-Partition takes sample coordinates as candidate segmentation points makes global segmentation of space designs estimation functions according to real values and selects the optimal set of segmentation points. In the second stage Further-Partition takes the output of the first stage as the candidate segmentation point and performs the local segmentation of the space based on kd-tree then replaces the data according to the obtained subspace boundary to complete the data anonymity. In order to efficiently deal with large-scale and distributed crowdsourcing databases, a parallel anonymous framework based on MapReduce is implemented to parallelize the Two-Phase Partition algorithm. The framework uses hashing technique to redivide the original data set into multiple subdatasets, and then integrates them into complete anonymous data sets after anonymous processing. Experimental results show that the Two-Phase Partition algorithm increases the query accuracy by more than 20% compared with the existing algorithms, and the accuracy increases with the increase of the sample ratio. The parallel anonymous framework is used to realize the parallelization of Two-Phase Partition algorithm. The query accuracy rate is slightly lower than that of the single version algorithm, but the range is reduced by less than 5%, and the execution efficiency can increase linearly with the size of the dataset. Therefore, the parallel anonymous scheme is suitable to solve the privacy problem of large scale and distributed crowdsourcing database.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP309
本文編號:2134311
[Abstract]:Crowdsourcing database is a new type of database which combines human intelligence with machine by using crowdsourcing platform to solve the difficult query task of traditional relational database. Its core idea is to publish the query and the corresponding data set to the Internet in the form of crowdsourcing tasks, and finally to the mass Internet users to use human wisdom to solve the problem. However, if the data set containing private information is sent to Internet users without any processing, it may lead to the disclosure of privacy information. Privacy issues have been studied in the field of traditional databases for many years, among which the technology of data anonymity has been proved to be effective in practical applications such as data release. However, the existing anonymous technology is difficult to be simply applied to crowdsourcing databases. Firstly, crowdsourcing databases are usually stored in different nodes on a large scale and distributed, and the existing algorithms are difficult to deal with such large-scale and distributed data efficiently. The existing algorithms will result in excessive loss of information related to tasks, resulting in poor quality of task completion. In order to improve the completion quality of crowdsourcing tasks, the Two-Phase Partition anonymous algorithm based on space segmentation retains more task related information through sampling technology, and improves the availability of anonymous data. In the first stage Pre-Partition takes sample coordinates as candidate segmentation points makes global segmentation of space designs estimation functions according to real values and selects the optimal set of segmentation points. In the second stage Further-Partition takes the output of the first stage as the candidate segmentation point and performs the local segmentation of the space based on kd-tree then replaces the data according to the obtained subspace boundary to complete the data anonymity. In order to efficiently deal with large-scale and distributed crowdsourcing databases, a parallel anonymous framework based on MapReduce is implemented to parallelize the Two-Phase Partition algorithm. The framework uses hashing technique to redivide the original data set into multiple subdatasets, and then integrates them into complete anonymous data sets after anonymous processing. Experimental results show that the Two-Phase Partition algorithm increases the query accuracy by more than 20% compared with the existing algorithms, and the accuracy increases with the increase of the sample ratio. The parallel anonymous framework is used to realize the parallelization of Two-Phase Partition algorithm. The query accuracy rate is slightly lower than that of the single version algorithm, but the range is reduced by less than 5%, and the execution efficiency can increase linearly with the size of the dataset. Therefore, the parallel anonymous scheme is suitable to solve the privacy problem of large scale and distributed crowdsourcing database.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP309
【參考文獻】
相關(guān)期刊論文 前2條
1 馮劍紅;李國良;馮建華;;眾包技術(shù)研究綜述[J];計算機學(xué)報;2015年09期
2 張嘯劍;孟小峰;;面向數(shù)據(jù)發(fā)布和分析的差分隱私保護[J];計算機學(xué)報;2014年04期
,本文編號:2134311
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2134311.html
最近更新
教材專著