稀疏矩陣插補及在大型問卷調(diào)查中的應(yīng)用研究

發(fā)布時間：2018-08-02 17:07

【摘要】：自2012年以來,“大數(shù)據(jù)”一詞越來越多地出現(xiàn)在人們的生活、工作和學(xué)習(xí)中。IBM公司曾進(jìn)行過一項研究,研究結(jié)果顯示從古至今我們?nèi)祟愂澜绲娜繑?shù)據(jù)中有90%都產(chǎn)生于過去的兩年,并且預(yù)計2020年后全人類范疇的數(shù)據(jù)量可能會達(dá)到目前數(shù)據(jù)量的44倍。在大量數(shù)據(jù)產(chǎn)生及擴(kuò)展的過程中不完備數(shù)據(jù)的出現(xiàn)是不可避免的,而不完備數(shù)據(jù)中的缺失值又往往會對數(shù)據(jù)的可利用性產(chǎn)生重大的影響。網(wǎng)絡(luò)購物平臺的評價系統(tǒng)在收集大量不完備數(shù)據(jù)上起到了很大的作用。假如所有消費者均對自己所購買到的商品進(jìn)行了評價,該網(wǎng)絡(luò)購物平臺的評分系統(tǒng)就能夠?qū)⑺性u分?jǐn)?shù)據(jù)收集成一個含有大量缺失值的矩陣,我們稱之為“稀疏矩陣”。如果一些消費者購買了商品,卻沒有對商品進(jìn)行評價,則會提高該稀疏矩陣的缺失率。本文根據(jù)網(wǎng)絡(luò)購物平臺評分系統(tǒng)和美國Netflix在線影片租賃公司影片評價系統(tǒng)得到的數(shù)據(jù)結(jié)構(gòu)為依據(jù),聯(lián)系當(dāng)前隨著大數(shù)據(jù)不斷發(fā)展而擴(kuò)增的實踐調(diào)查數(shù)據(jù),不難發(fā)現(xiàn),以往簡單的小型抽樣調(diào)查已經(jīng)不能滿足當(dāng)今社會對實踐調(diào)查的要求,因此無論在問卷大小還是在樣本量多少上,都需要有新的突破。針對含有大量問題的問卷調(diào)查,以往的做法通常是給予被調(diào)查者一定的獎勵或回饋以獲得被調(diào)查者的配合,該方法不但在人力、物力及財力上需要一定的保障,而且并不能保證問卷數(shù)據(jù)的質(zhì)量。本文運用問卷分割法將調(diào)查中的大型問卷按照題量及問題之間的關(guān)聯(lián)性分割為多個小型問卷,在調(diào)查過程中每個被調(diào)查者從中隨機(jī)抽取特定數(shù)量的小型問卷進(jìn)行作答,在保證樣本量的前提下,收集并整理調(diào)查數(shù)據(jù),最終會得到一個含有大量缺失值的稀疏矩陣。進(jìn)而運用缺失值插補的方式對稀疏矩陣進(jìn)行插補,以獲到完整的研究數(shù)據(jù)。本文通過對一般數(shù)據(jù)插補方法、稀疏矩陣數(shù)據(jù)插補方法和大型問卷缺失數(shù)據(jù)插補方法的對照,采取隨機(jī)數(shù)插補和多項邏輯模型插補兩種插補方式,通過對插補成效的對照分析,得出相應(yīng)的結(jié)論。由于人力及時間的限制,本文數(shù)據(jù)來自于R-Studio軟件的模擬。首先,運用R-Studio軟件生成模擬數(shù)據(jù),由于每位被調(diào)查者回答的數(shù)據(jù)均以“單元”為單位,因此在進(jìn)行數(shù)據(jù)缺失的過程中要實現(xiàn)成塊缺失,即單元缺失,最終的稀疏矩陣中每個被調(diào)查者都回答了特定單元數(shù)的問題;其次,利用不同被調(diào)查者共同回答的問題作為鉚題,計算不同被調(diào)查者在回答同一問題時的關(guān)聯(lián)性,進(jìn)而利用該關(guān)聯(lián)性對其他未回答數(shù)據(jù)進(jìn)行插補;最后,利用插補所得的數(shù)據(jù)與原始數(shù)據(jù)進(jìn)行對比,驗證問卷分割法及本文所用插補方式的可行性和準(zhǔn)確性。由于本文數(shù)據(jù)采用R-Studio軟件模擬生成,因此在理論上具有一定的理想化假設(shè),雖然每個被調(diào)查者回答問卷的單元數(shù)可以在調(diào)查過程中進(jìn)行人為的控制,但被調(diào)查者回答每個單元的問題數(shù)據(jù)需假設(shè)為內(nèi)部無缺失,即整個數(shù)據(jù)矩陣只有“單元”缺失,沒有個別缺失。全文包括五章的內(nèi)容。第一章,介紹了文章的根本內(nèi)容,包括選題背景和研究目的、文獻(xiàn)綜述、研究方法及論文創(chuàng)新之處;第二章,是缺失數(shù)據(jù)的處理方法簡介,闡述了近年來學(xué)者們研究缺失數(shù)據(jù)插補時所用到的方法及其簡單概念;第三章,作為本文的核心內(nèi)容,從易到難、從數(shù)據(jù)的生成到缺失,再到插補,具體介紹了大型問卷分割法及缺失數(shù)據(jù)插補方法,并將完成插補的數(shù)據(jù)與原始數(shù)據(jù)進(jìn)行比較;第四章,運用第三章研究的內(nèi)容及R-Studio軟件生成的大型稀疏矩陣進(jìn)行進(jìn)一步的分析,驗證本文理論和方法的可行性和準(zhǔn)確性;第五章,是對全文的總結(jié)以及對本文所研究內(nèi)容發(fā)展前景的展望,同時,對本文的不足之處提出了改進(jìn)方法。
[Abstract]:Since 2012, the word "big data" has appeared more and more in people's life. In work and study,.IBM has conducted a study. The results show that 90% of all the data in our human world have been produced in the past two years from ancient times to the present, and it is expected that the amount of data in the whole human category may reach the target after 2020. 44 times the amount of previous data. Incomplete data is inevitable in the process of generating and expanding a large number of data, and the missing values in incomplete data often have a significant impact on the availability of data. The evaluation system of the network shopping platform plays a great role in collecting a large number of incomplete data. Consumers are all evaluating what they have bought. The scoring system of the online shopping platform can collect all the scoring data into a matrix with a large number of missing values. We call it a "sparse matrix". If some consumers buy a commodity but do not evaluate the commodity, it will improve the sparse moment. Based on the data structure obtained by the network shopping platform scoring system and the film evaluation system of Netflix online film leasing company in the United States, this paper is not difficult to find out that the simple small sample survey can not meet the current social reality. As a result, a new breakthrough is needed both in the size of the questionnaire and in the size of the sample. In the past, the past practice usually gives the respondents a reward or feedback to obtain the cooperation of the respondents. The method not only needs a certain guarantee in human, material and financial resources. The quality of the questionnaire data is not guaranteed. In this paper, the questionnaire segmentation method is used to divide the large questionnaire in the survey into a number of small questionnaires according to the correlation between the questions and the questions. In the course of the investigation, a small number of small questionnaires are randomly selected from each of the respondents. After sorting out the survey data, a sparse matrix with a large number of missing values is finally obtained. Then the sparse matrix is interpolated with the missing value interpolation to obtain the complete data. Two interpolations are taken by random number interpolation and multiple logic model interpolation, and the corresponding conclusions are obtained by comparing the results of interpolation. The data of this paper are derived from the simulation of R-Studio software because of human and time constraints. First, R-Studio software is used to generate analog data, because the data each respondents answered is "unit". "As a unit, therefore, in the process of missing data, we have to realize the missing block, that is, the missing unit, and each of the investigators in the final sparse matrix answers the problem of the number of specific units. Secondly, the problem is used by different respondents as a riveting problem, and the correlation between the respondents in answering the same question is calculated. In the end, the data from the interpolation are compared with the original data to verify the feasibility and accuracy of the questionnaire segmentation method and the interpolation method used in this paper. Because the data used in this paper are simulated by R-Studio software, it has a certain idealization hypothesis in theory. The unit number of each respondents' answers to the questionnaire can be controlled artificially during the investigation, but the respondents' answer to each unit's problem data needs to be assumed to be internal, that is, the whole data matrix has only "unit" missing and no individual missing. The full text includes five chapters. Chapter 1 introduces the basic content of the article. Including the background and purpose of research, literature review, research methods and the innovation of the paper; the second chapter is the introduction of the missing data processing methods, and expounds the methods and simple concepts used by scholars in the absence of data interpolation in recent years. The third chapter, as the core content of this paper, is from easy to difficult, from data generation to missing, In the fourth chapter, the fourth chapter uses the contents of the third chapters and the large sparse matrix generated by the software of the third chapter to verify the feasibility and accuracy of the theory and method of this paper; Fifth Chapter one is the summary of the whole paper and the prospects for the development of the research content in this paper. At the same time, the paper puts forward the improvement methods for the deficiencies of this paper.
【學(xué)位授予單位】：河北經(jīng)貿(mào)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：O151.21

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 周家斌;一種氣象資料插補方法[J];科學(xué)通報;1987年15期

2 張時釗;;氣象哨溫度資料的插補[J];陜西氣象;1981年08期

3 曹宗智;利用電子計算機(jī)實現(xiàn)水文資料的自動插補[J];干旱區(qū)地理;1987年04期

4 蔣勇敏,邱士安;無誤差插補方法初探[J];機(jī)械;2000年S1期

5 喬麗華;傅德印;;缺失數(shù)據(jù)的多重插補方法[J];統(tǒng)計教育;2006年12期

6 楊偉東;朱紅春;劉麗冰;;計算機(jī)數(shù)據(jù)課程中插補原理教學(xué)方法的探討[J];實驗室科學(xué);2009年02期

7 屠其璞;一種氣溫場序列的延長插補方法[J];南京氣象學(xué)院學(xué)報;1986年01期

8 黃蓉;胡澤勇;關(guān)婷;孫根厚;楊耀先;劉火霖;;藏北高原氣溫資料插補及其變化的初步分析[J];高原氣象;2014年03期

9 龐新生;;分層隨機(jī)抽樣條件下缺失數(shù)據(jù)的多重插補方法[J];統(tǒng)計與信息論壇;2009年05期

10 楊軍;趙宇;丁文興;;抽樣調(diào)查中缺失數(shù)據(jù)的插補方法[J];數(shù)理統(tǒng)計與管理;2008年05期

相關(guān)會議論文前7條

1 余予;李俊;任芝花;張志富;;標(biāo)準(zhǔn)序列法在日平均氣溫缺測數(shù)據(jù)插補中的應(yīng)用[A];第八屆全國優(yōu)秀青年氣象科技工作者學(xué)術(shù)研討會論文匯編[C];2014年

2 呂強;;編寫數(shù)控車、銑床加工多邊形插補程序的方法[A];數(shù)控技術(shù)學(xué)術(shù)研討會論文集[C];1999年

3 安金剛;;離線插補技術(shù)在運動控制中的應(yīng)用[A];全國第十二屆空間及運動體控制技術(shù)學(xué)術(shù)會議論文集[C];2006年

4 鄭金興;張銘鈞;孟慶鑫;;變插補周期的數(shù)控進(jìn)給速度控制算法研究[A];先進(jìn)制造技術(shù)論壇暨第五屆制造業(yè)自動化與信息化技術(shù)交流會論文集[C];2006年

5 谷永山;王銳;韋穗;;基于兩幅視圖的縱向插補方法[A];第十五屆全國圖象圖形學(xué)學(xué)術(shù)會議論文集[C];2010年

6 宋琦;陳璞;;稀疏求解—結(jié)構(gòu)修改的一種新的可能性[A];北京力學(xué)會第20屆學(xué)術(shù)年會論文集[C];2014年

7 徐道遠(yuǎn);王寶庭;王向東;馮伯林;;求解大型稀疏矩陣的ICCG法[A];第八屆全國結(jié)構(gòu)工程學(xué)術(shù)會議論文集（第Ⅰ卷）[C];1999年

相關(guān)博士學(xué)位論文前9條

1 王允森;基于樣條插補的高質(zhì)量加工關(guān)鍵技術(shù)的研究[D];中國科學(xué)院研究生院(沈陽計算技術(shù)研究所);2015年

2 金永喬;微小線段高速加工的軌跡優(yōu)化建模及前瞻插補技術(shù)研究[D];上海交通大學(xué);2015年

3 葉偉;數(shù)控系統(tǒng)納米插補及控制研究[D];北京交通大學(xué);2010年

4 梅鵬;中國群死群傷火災(zāi)數(shù)據(jù)插補及快速損失評估研究[D];中國科學(xué)技術(shù)大學(xué);2013年

5 孟書云;高精度開放式數(shù)控系統(tǒng)復(fù)雜曲線曲面插補關(guān)鍵技術(shù)研究[D];南京航空航天大學(xué);2006年

6 劉巍;ARGO稀損數(shù)據(jù)插補與三維海洋要素場重構(gòu)研究[D];西南交通大學(xué);2012年

7 郭松;面向稀疏矩陣運算的異構(gòu)并行算法研究[D];國防科學(xué)技術(shù)大學(xué);2015年

8 周勇;高速進(jìn)給驅(qū)動系統(tǒng)動態(tài)特性分析及其運動控制研究[D];華中科技大學(xué);2008年

9 郝永江;復(fù)雜參數(shù)曲線曲面加工控制與狀態(tài)監(jiān)測技術(shù)研究[D];天津大學(xué);2011年

相關(guān)碩士學(xué)位論文前10條

1 劉艷玲;調(diào)查數(shù)據(jù)無回答的插補方法及模擬比較[D];天津財經(jīng)大學(xué);2012年

2 余威;氣象相似性網(wǎng)絡(luò)構(gòu)建及缺失氣象要素數(shù)據(jù)的插補[D];西南大學(xué);2015年

3 李玲雪;缺失偏態(tài)數(shù)據(jù)下異方差模型的統(tǒng)計推斷[D];昆明理工大學(xué);2015年

4 李永杰;基于PH曲線五軸數(shù)控插補策略的研究[D];遼寧科技大學(xué);2015年

5 趙偉;針對回歸模型的缺失數(shù)據(jù)插補方法模擬分析[D];天津財經(jīng)大學(xué);2014年

6 駱新珍;基于DA插補法的線性回歸模型系數(shù)估計量的模擬研究[D];天津財經(jīng)大學(xué);2014年

7 肖哲;基于STM32的嵌入式數(shù)控插補控制器的研究與實現(xiàn)[D];湖北工業(yè)大學(xué);2016年

8 李珍;不完全測量信息系統(tǒng)的辨識研究[D];安徽工程大學(xué);2016年

9 紀(jì)忠光;缺失數(shù)據(jù)的非參數(shù)插補[D];華中師范大學(xué);2016年

10 楊曉倩;缺失數(shù)據(jù)插補方法的選擇研究[D];蘭州財經(jīng)大學(xué);2016年

，

本文編號：2160108

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/yysx/2160108.html

上一篇：兩類特殊聯(lián)圖的交叉數(shù)
下一篇：基于映射函數(shù)的中心型三階格式

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

稀疏矩陣插補及在大型問卷調(diào)查中的應(yīng)用研究