基于MapReduce的kNN-join算法的研究與設(shè)計(jì)
發(fā)布時(shí)間:2018-06-06 10:22
本文選題:MapReduce + kNN連接操作 ; 參考:《黑龍江大學(xué)》2016年碩士論文
【摘要】:由于互聯(lián)網(wǎng)行業(yè)的不斷發(fā)展,隨之而來的是大量的數(shù)據(jù),因此如何在這些大量數(shù)據(jù)中獲得有價(jià)值的知識成為了人們關(guān)注的焦點(diǎn)。在所有的數(shù)據(jù)挖掘算法中,可以利用kNN算法進(jìn)行數(shù)據(jù)分類,隨著kNN算法的廣泛應(yīng)用,kNN-join算法隨之被提出,算法被廣泛的應(yīng)用在數(shù)據(jù)挖掘的各個(gè)階段:數(shù)據(jù)預(yù)處理階段和數(shù)據(jù)挖掘階段。然而隨著數(shù)據(jù)量的不斷增大,以及人們對操作效率的要求,傳統(tǒng)方法已經(jīng)無法滿足,因此產(chǎn)生了基于MapReduce的kNN-join操作。本文對基于MapReduce的kNN-join操作的的各個(gè)階段進(jìn)行研究,首先,對數(shù)據(jù)進(jìn)行預(yù)處理,對數(shù)據(jù)劃分算法進(jìn)行優(yōu)化,對現(xiàn)有的數(shù)據(jù)劃分算法進(jìn)行改進(jìn),以保證數(shù)據(jù)均勻劃分;其次,為了節(jié)約join過程中的開銷,使得每個(gè)數(shù)據(jù)劃分中的所有元素的最近k個(gè)鄰居在一個(gè)集合內(nèi),為每個(gè)數(shù)據(jù)劃分尋找種集;最后,為了均衡資源利用率與算法準(zhǔn)確率,我們對數(shù)據(jù)劃分進(jìn)行群組劃分。本文使用真實(shí)數(shù)據(jù)與合成數(shù)據(jù)相結(jié)合,對算法進(jìn)行實(shí)驗(yàn),以證實(shí)算法的有效性,實(shí)驗(yàn)結(jié)果顯示,我們提出的算法優(yōu)于已有算法。
[Abstract]:Due to the continuous development of the Internet industry, there is a large number of data, so how to obtain valuable knowledge in these data has become the focus of attention. Among all the data mining algorithms, the kNN algorithm can be used to classify the data. With the wide application of the kNN algorithm, the kNN-join algorithm has been proposed. The algorithm is widely used in all stages of data mining: data preprocessing and data mining. However, with the increasing amount of data and the requirement of operation efficiency, the traditional methods can not meet the requirements, so the kNN-join operation based on MapReduce is produced. In this paper, we study the stages of kNN-join operation based on MapReduce. Firstly, we preprocess the data, optimize the data partition algorithm, improve the existing data partition algorithm to ensure the uniform partition of data. In order to save the overhead in the join process, the nearest k neighbors of all the elements in each data partition are found in one set. Finally, in order to balance the resource utilization with the accuracy of the algorithm, the nearest k neighbors of all the elements in each data partition are found in a single set. We divide the data into groups. In this paper, we use real data and synthetic data to test the algorithm to verify the effectiveness of the algorithm. The experimental results show that the proposed algorithm is better than the existing algorithm.
【學(xué)位授予單位】:黑龍江大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP311.13
,
本文編號:1986197
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1986197.html
最近更新
教材專著