天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 計算機論文 >

TextGen:用于新型存儲系統(tǒng)基準(zhǔn)測試的真實文本數(shù)據(jù)集生成方法(英文)

發(fā)布時間:2019-04-15 19:11
【摘要】:新型存儲系統(tǒng)通過內(nèi)置數(shù)據(jù)壓縮功能提高性能,并節(jié)省存儲空間。因此,數(shù)據(jù)內(nèi)容會顯著影響存儲系統(tǒng)基準(zhǔn)測試結(jié)果。由于真實數(shù)據(jù)集規(guī)模龐大,難以復(fù)制到目標(biāo)測試系統(tǒng),并且大多數(shù)數(shù)據(jù)集由于隱私性無法進行共享。因此,基準(zhǔn)測試程序需要人工生成測試數(shù)據(jù)集。為了保證測試結(jié)果的準(zhǔn)確性,需要根據(jù)影響存儲系統(tǒng)性能的真實數(shù)據(jù)集特征信息生成數(shù)據(jù)。現(xiàn)有方法 SDGen在字節(jié)級別上分析真實數(shù)據(jù)集內(nèi)容分布特征,并以此生成數(shù)據(jù)集,因此能夠保證內(nèi)置字節(jié)級壓縮算法的存儲系統(tǒng)測試結(jié)果準(zhǔn)確。但是SDGen并未分析真實數(shù)據(jù)集的詞級別內(nèi)容分布特征,因此不能保證內(nèi)置詞級別壓縮算法的存儲系統(tǒng)測試結(jié)果準(zhǔn)確,本文提出了一種基于Lognormal概率分布模型的文本數(shù)據(jù)集生成方法Text Gen。該方法根據(jù)真實數(shù)據(jù)集的詞切分結(jié)果建立語料庫,分析語料庫中詞的分布特征,利用最大似然估計得到詞分布的Lognormal模型參數(shù),根據(jù)模型采用蒙特卡洛方法生成數(shù)據(jù)內(nèi)容。該方法生成數(shù)據(jù)集所消耗的時間只與生成數(shù)據(jù)集規(guī)模相關(guān),具有線性的時間復(fù)雜度O(n)。本文收集了四種數(shù)據(jù)集驗證方法有效性,并通過一種典型的詞級別壓縮算法——ETDC(End-Tagged Dense Code)進行測試。實驗結(jié)果表明:相比SDGen,Text Gen生成文本數(shù)據(jù)集性能更高,并且,生成數(shù)據(jù)集用于壓縮測試后與真實數(shù)據(jù)集的壓縮速率、壓縮率相似程度更高。
[Abstract]:The new storage system improves performance and saves storage space through built-in data compression. Therefore, the data content will significantly affect the storage system benchmark results. Because of the large scale of the real data set, it is difficult to copy to the target test system, and most data sets cannot be shared because of privacy. Therefore, benchmark programs need to generate test data sets manually. In order to ensure the accuracy of the test results, it is necessary to generate data according to the real data set characteristic information that affects the performance of the storage system. The existing method SDGen analyzes the content distribution characteristics of the real dataset at the byte level and generates the data set so that the test results of the storage system of the built-in byte-level compression algorithm can be guaranteed to be accurate. However, SDGen does not analyze the word-level content distribution characteristics of the real data set, so it can not guarantee the accuracy of the storage system test results of the built-in word-level compression algorithm. In this paper, a text dataset generation method Text Gen. based on Lognormal probability distribution model is proposed. This method builds a corpus according to the word segmentation results of the real data set, analyzes the distribution characteristics of words in the corpus, obtains the parameters of the Lognormal model of the word distribution by using the maximum likelihood estimation, and generates the data content by the Monte Carlo method according to the model. The time consumed by this method is only related to the size of the data set, and it has a linear time complexity O (n). In this paper, four kinds of data set verification methods are collected and tested by a typical word-level compression algorithm-ETDC (End-Tagged Dense Code). The experimental results show that the performance of generating text dataset is higher than that of SDGen,Text Gen. Moreover, the compression rate of generated dataset is higher than that of real data set after compression test, and the compression ratio is higher than that of real data set.
【作者單位】: School
【基金】:Project supported by the National Natural Science Foundation of China(Nos.61572394 and 61272098) the Shenzhen Funda mental Research Plan(Nos.JCYJ20120615101127404 and JSGG20140519141854753) the National Key Technologies R&D Program of China(No.2011BAH04B03)
【分類號】:TP333

【相似文獻】

相關(guān)期刊論文 前7條

1 黎連業(yè);軟盤和軟盤使用[J];計算機工程與應(yīng)用;1982年12期

2 寧鵬飛;許建平;;Argo光盤數(shù)據(jù)集的研制與應(yīng)用[J];海洋技術(shù);2009年01期

3 張寶華;韓冰潔;;SIO接口片在數(shù)據(jù)集中分配器中的應(yīng)用[J];無線電工程;1986年03期

4 孫洪昌;B系列機ISAM數(shù)據(jù)集維護經(jīng)驗點滴[J];中國金融電腦;1994年03期

5 王曉軍;孫惠;;基于MapReduce的多路連接優(yōu)化方法研究[J];計算機技術(shù)與發(fā)展;2013年06期

6 ;新品快遞[J];微電腦世界;2000年33期

7 ;[J];;年期

,

本文編號:2458407

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2458407.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶2f0be***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com