股市數(shù)據(jù)挖掘中偏模型的檢驗和應用
發(fā)布時間:2018-01-15 20:06
本文關鍵詞:股市數(shù)據(jù)挖掘中偏模型的檢驗和應用 出處:《西南財經(jīng)大學》2014年碩士論文 論文類型:學位論文
【摘要】:中國股市已經(jīng)走過24年風雨歷程,這一路跌跌撞撞,起步雖晚的中國股市在不斷地進行著自我探索,又在不斷地自我否定中理性回歸。這24年來,面對尚未達到弱式有效的股票市場,各界專家學者做了大量關于股市特點及股市預測方面的研究,F(xiàn)今的研究主要可分為兩大派別:基本面分析和技術面分析;久娣治龀姓J股票價格是公司內(nèi)在價值的反映,注重對分析變量的選擇;技術面分析則以歷史上的開盤價、收盤價、最高價、最低價等等作為預測未來股價的豐沃土壤,注重對數(shù)據(jù)處理方法和模型建立方法的選擇。應該說兩大派別體系不同,各有千秋。但無論如何,中國股市未達到弱式有效是不爭的事實,股票價格序列歷史相關,技術面分析有其立足點。本文隸屬于技術面分析。 從現(xiàn)有的技術面分析方法來看,大致有時間序列分析法、模糊數(shù)學、混沌理論、數(shù)據(jù)挖掘等分析技術。其中的數(shù)據(jù)挖掘技術是近些年來隨著數(shù)據(jù)量幾何式增長出現(xiàn)的一種新的處理大量數(shù)據(jù)的技術,它事先并不規(guī)定待探索信息的形式,而是讓數(shù)據(jù)本身來說話。時下流行的數(shù)據(jù)挖掘技術有決策樹、神經(jīng)網(wǎng)絡、支持向量機、聚類分析等。而每一種技術本身又可有多種實現(xiàn)算法。毫無疑問,面對龐大紛雜的股票數(shù)據(jù),數(shù)據(jù)挖掘技術是一種很好的處理方法。目前各學者在用數(shù)據(jù)挖掘技術研究股票市場時,主要從挖掘技術本身的算法設計及改進、股市變量的選擇以及處理、股票數(shù)據(jù)的使用方式以及不同挖掘模型的組合使用幾個方面進行研究改進。本文亦選擇數(shù)據(jù)挖掘技術作為研究股票市場的起點,但嘗試從一個全新的角度對這種技術進行探索、改進,提出數(shù)據(jù)挖掘偏模型的概念。 數(shù)據(jù)挖掘偏模型的概念起初是源于對分類樹特有模型結構的思考。分類樹模型的輸出結果是一棵擁有很多片樹葉的樹,它的每一片樹葉都代表了一條知識表述,有多少片樹葉,就有多少條知識表述。在實際應用中,這些知識表述的利用價值有所不同:有些樹葉所闡述的知識屢試不爽,預測正確率很高,而有些樹葉所闡述的知識幾乎沒有利用價值,預測正確率極低。因此如果把每一片樹葉都看作是一個子模型,就可以對每一個子模型都進行預測正確率的計算而不是對模型整體進行正確率的計算,尋找到那些正確率較高的子模型并將其它正確率較低的子模型予以放棄就是建立偏模型的過程。事實上,在股市上,有操作價值的買點和賣點是有限的,成功的投資者絕不是每天頻繁進行買進賣出操作的那一部分人,而是能夠看準時機,只在股票信號最明顯、最有把握的時刻出手的投資者。 本文運用上證綜合指數(shù)的基礎數(shù)據(jù)建立決策樹偏模型。由于K線圖操作理論相對完善,為了便于將模型輸出結果和已有理論進行對比,本文將股市每日開盤價、收盤價、最高價、最低價4個基礎指標轉(zhuǎn)換成上影線長、下影線長、箱長、箱色4個指標并以這4個指標為輸入變量,以10日后股票漲跌情況為輸出變量。用R軟件(版本3.0.2)建立決策樹模型后進行篩選,把擬合正確率最高的7片樹葉集中到一起,發(fā)現(xiàn):若同時具有孕線組合和雙針探底,股價上升;若只具有雙針探底,則若探底針較長(=9.65),股價也上升;若探底針不明顯,未來不詳;若只具有孕線組合,單從基礎數(shù)據(jù)來看,未來不祥!霸芯組合”和“雙針探底”是人們已經(jīng)做出的關于K線圖形態(tài)特點含義的經(jīng)驗總結,分類樹偏模型的初步探索與經(jīng)驗總結基本吻合。 決策樹偏模型是從模型輸出結果角度考慮的偏模型。它的本質(zhì)是只接納了模型結果的一部分而不是全部。進一步的,本文在決策樹偏模型的基礎上對偏模型概念進行了擴展。股市可供操作的買點和賣點有限,只有當股價信號明朗(無論是上升還是下降)時,才有必要進行預測。基于這一思路,支持向量機偏模型旨在找到可以用其進行預測的最佳數(shù)據(jù)環(huán)境。這是從模型輸入角度考慮的偏模型。具體來說,如果我們不加選擇的運用訓練數(shù)據(jù)建立SVM模型并進行預測,效果并不好,SVM偏模型則是在用訓練數(shù)據(jù)集A建立模型M1之后,挑選M1中擬合正確的數(shù)據(jù)記錄,記作集合B,再用集合B建立模型M2;然后用分類樹尋找并歸納集合B中數(shù)據(jù)記錄的共同點,記作K,用模型M2僅預測驗證數(shù)據(jù)中具有特點K的數(shù)據(jù)記錄。也就是說,只有具有特點K的數(shù)據(jù)記錄才有資格成為模型M2的輸入。 在建立SVM偏模型之前,本文運用方差分析的方法證明不同數(shù)據(jù)輸入建立的SVM模型,在擬合優(yōu)度方面的確有顯著不同。將2011年1月20日——2014年2月18日的735條數(shù)據(jù)進行分組,每50條數(shù)據(jù)為一組,共有14組數(shù)據(jù),對這14組數(shù)據(jù)進行三組對比實驗,第一組實驗,每組數(shù)據(jù)里的每條數(shù)據(jù)都會作為建模對象;第二組實驗,每組數(shù)據(jù)僅選擇前30條數(shù)據(jù)作為建模對象;第三組實驗,每組數(shù)據(jù)僅選擇前20條數(shù)據(jù)作為建模對象。在三組數(shù)據(jù)輸入方式建立的模型的擬合度沒有顯著差別的原假設下,P值近似為0,可認定否定原假設,同一時間段內(nèi)的不同的數(shù)據(jù)輸入的確可導致完全不同的擬合優(yōu)度。 在初步驗證了決策樹偏模型的實用性和支持向量機偏模型的合理性之后,本文利用這兩種偏模型尋找股票市場上的投資規(guī)律。在第五章中,運用決策樹偏模型,’以“昨日箱長、昨日箱色、昨日下影線長、今日箱長、今日箱色、今日下影線長、DIF、DEA、DIF-DEA"為輸入變量,以“10日后股票漲跌”為輸出變量,找到擬合正確率為80%以上的9片樹葉,并把這9片樹葉所揭示的規(guī)則應用于驗證數(shù)據(jù),發(fā)現(xiàn)其中的32號、11號、132號、266號規(guī)則,均達到100%的預測正確率。而將這些規(guī)則進行整理、綜合以后,發(fā)現(xiàn)它們實際上是:若DIF-DEA-1.85,股價預測會下跌;若DIF-DEA11.05,股價預測會上漲;若-1.85DIF-DEA11.05,股價未來趨勢不明朗。在股市技術分析的歷史資料中,有當“DIF0且DEA0時,DIFDEA,股價會上漲;當DIF0且DEA0時,DIFDEA,股價會下跌;當DIF0且DEA0時,DIFDEA,股價會上漲;當DIF0且DEA0時,DIFDEA,股價會下跌”的技術總結,可以看出,本文決策樹偏模型的結論實際上是在此總結的基礎上給出了更確切的數(shù)值區(qū)間。本文認為,模型結果對區(qū)間要求更為嚴格(不再以0為分界線,而是以-1.85和11.05為分界線),可能是投資者心理原因所致:當股市略有反彈時,大多數(shù)股民仍會處于觀望狀態(tài),不會輕易出手,反而導致未來不明朗。只有股市的反彈達到一定程度,股民才會相信春天已來,出手買入,未來股價上升。反之亦然。 在建立支持向量機偏模型時,首先對訓練數(shù)據(jù)進行建模,建模后將擬合正確的數(shù)據(jù)集中到一起再次建模,并尋找它們的共同規(guī)律,將這些規(guī)律分別記作G1、G2、G3……;然后將驗證數(shù)據(jù)中符合規(guī)律G1,G2,G3……的記錄篩選出來,用再一次建立起來的模型進行預測,計算預測正確率。按此思路,從擬合正確的驗證數(shù)據(jù)身上找到了4條共同規(guī)律:它們基本上都是在下影長前、DIF、DIF-DEA三個指標上具有某種共同點。把驗證數(shù)據(jù)中符合這4條規(guī)律的數(shù)據(jù)篩選出來進行預測,正確率分別為57.1%、46.1%、72.7%、75%。平均數(shù)明顯高于不加處理、直接使用訓練數(shù)據(jù)建模,驗證數(shù)據(jù)驗證時的正確率55.5%。進一步證明了存在適合使用SVM模型進行預測的數(shù)據(jù)環(huán)境,僅在這種環(huán)境來臨時進行預測比不加選擇不分時機的盲目預測效果要好得多。 傳統(tǒng)的經(jīng)典統(tǒng)計學總是首先給出符合經(jīng)濟理論的一組變量,事先指定這組變量的相互關系,然后在事先構筑好的框架中進行各種回歸分析,是一種“先理論,后數(shù)據(jù)”的思考模式。而數(shù)據(jù)挖掘技術則打破這種常規(guī),它并不事先給定任何“應該是什么”的理論束縛,而是把話語權完全的交給數(shù)據(jù)本身?梢哉f,它是一種“先數(shù)據(jù),后理論”的思考模式。正因如此,本文大膽地在沒有詳盡數(shù)學推導的情況下討論了偏模型的概念。本文不僅提出了偏模型的概念,還擴展了偏模型的概念:在利用數(shù)據(jù)挖掘技術處理數(shù)據(jù)時,或數(shù)據(jù)輸入、或數(shù)據(jù)處理、或結果輸出,在整個模型建立的過程中,只要有一個環(huán)節(jié)不是整體的被采納,我們就稱這樣的模型為數(shù)據(jù)挖掘偏模型。分類樹偏模型是從“輸出結果”的角度考慮的偏模型,支持向量機偏模型是在“數(shù)據(jù)輸入”過程中的偏模型。未來,更多含義更多角度的偏模型有可能出現(xiàn)。筆者相信,越來越多的學者將會加入到對偏模型的討論中來。
[Abstract]:China stock market has gone through 24 years of ups and downs, the bumps along the way, the stock market started late in China continue to carry out the exploration of the self, and constantly self denial in the rational regression. These 24 years, the face has not yet reached the weak efficiency of stock market, all experts and scholars have done a lot of features and the stock market stock market prediction research. The current research can be divided into two factions: fundamental analysis and technical analysis, fundamental analysis is that the stock price reflects the company's intrinsic value, focusing on the analysis of the choice of variables; technical analysis to the history of the opening price, closing price, the highest price, the lowest price and so on. As a predictor of future stock price fertile soil, pay attention to the establishment of the choice of methods of data processing methods and models. It should be said that the two major factions of different systems, each one has its own merits. But in any case, the stock market did not reach the weak China The validity is an indisputable fact, the stock price sequence is related to history, and the technical aspect analysis has its foothold. This article is subordinate to the technical analysis.
From the existing technical analysis methods, roughly the time sequence analysis method, fuzzy mathematics, chaos theory, data mining analysis technology. Data mining technology which is in recent years as a new data processing geometric growth appeared a large amount of data, it does not require the prior to be explored in the form of information. But let the data speak for themselves. The popular data mining decision tree, neural network, support vector machine, clustering analysis and so on. And every kind of technology itself and there are many algorithms. There is no doubt that in the face of large complex stock data, data mining technology is a good method at present, various scholars in mining technology. The research on the stock market data, mainly from algorithm design and improvement of mining technology, the stock market variable selection and processing, the use of stock data type and different The combination of mining models is studied and improved in several aspects. In this paper, data mining technology is also chosen as the starting point for the study of stock market. However, we try to explore and improve this technology from a totally new perspective, and put forward the concept of data mining partial model.
The data mining model of partial concept originally on classification tree specific model structure. The output classification tree model is the result of a tree with many leaves of the tree, every leaf it represents a knowledge representation, the number of leaves, there are many knowledge in practical expressions. Application of these knowledge representation using value is different: some leaves of knowledge tested, the prediction accuracy is very high, and some leaves of knowledge almost no use value, the prediction accuracy is very low. So if each leaf is viewed as a sub model, can for each child model prediction accuracy rate instead of the whole model to calculate the correct rate, to find that the accuracy of sub model and other low accuracy of sub model is built to give partial model Process. In fact, in the stock market, operating value of buying and selling points is limited, successful investor is not every day that some people frequently buy sell, but can only see the opportunity in the stock the most obvious signal, the most certain shots of investors.
In this paper the basic data of Shanghai Composite Index of decision tree based on partial model. Because the K-line theory of operation is relatively perfect, in order to facilitate the modeling results and theoretical comparison, the stock market daily opening price, closing price, the highest price, the lowest price of 4 basic indexes into line under the shadow of long, long box long, 4 boxes of color index and the 4 indicators as input variables, the stock price 10 days after output variables. Using R software (version 3.0.2) decision tree model was established after screening, the correct rate of fitting the highest 7 leaves together, found that: if both pregnancy line combination and the double needle bottom, stock prices rise; if only with double needle bottom, if the dip needle long (=9.65), stock prices also rise; if the dip needle is not obvious, the future is unknown; if only has single pregnancy line combination, from the basic data, the future pregnancy group ominous. " "Combined" and "double needle probing" are the experience summaries that people have made about the characteristics and meanings of K-line maps. The preliminary exploration and classification of tree classification models basically coincide.
The decision tree model is partial partial model considering the output results from the model point of view. It is the essence of a part only accepted model results but not all. Further, based on the decision tree model of partial partial model concept was extended. The stock market can buy and sell for only limited. When the stock price signal is clear (either up or down), it is necessary to predict. Based on this idea, the support vector machine model to find the best partial data environment with its prediction. This model is considered from the perspective of model input. Body, if we use the training data without choice the establishment of SVM model and forecast, the effect is not good, but SVM model is in the A M1 model was established using the training data set, choose M1 fitting the correct data records, denoted by the set B, and then set B to set up the model of M2; and After that, we use the classification tree to find and induce the common points of data records in set B, and record it as K. We only use model M2 to predict data records with characteristic K in validation data. That is to say, only the data records with characteristic K are eligible to be input to model M2.
Before the establishment of SVM model, this paper uses the method of variance analysis show that the SVM model established by different input data, the goodness of fit is significantly different. 735 January 20, 2011 - February 18, 2014 data packet, each of the 50 data as a group, a total of 14 sets of data, three groups of experiments on this the data of the 14 groups, the first group of experiments, each data in each data will be as the modeling object; second sets of experiments, each data only 30 data as the modeling object; third sets of experiments, each data only 20 data as the model. In the original hypothesis fitting up three group data input mode no significant differences in the degree, the P value of approximately 0 can be identified, we reject the null hypothesis, different input data at the same time it can lead to a completely different fitting goodness.
After a preliminary validation of the rationality and practicability of SVM decision tree model of partial partial model, this paper use the two partial model for investment rules on the stock market. In the fifth chapter, using the decision tree model "to" partial, long box color box yesterday, yesterday, yesterday today under the long shadow. Long box, color box today, long lines, today DIF, DEA, DIF-DEA as input variables, "10 days after the stock price as the output variables, find the fitting accuracy of 9 leaves above 80%, and the 9 leaves revealed the rules used to verify the data, find the No. 32, No. 11, No. 132, No. 266, reached 100%. The rate of correct prediction after finishing, these rules are integrated and found they are in fact: if DIF-DEA-1.85, forecast the stock price will fall; if DIF-DEA11.05, forecast the stock price will rise; if the future price of -1.85DIF-DEA11.05. The trend is not clear. In the stock market technical analysis of the historical data, when the DIF0 and DEA0, DIFDEA, the price will rise; when the DIF0 and DEA0, DIFDEA, the price will fall; when DIF0 and DEA0, DIFDEA, the price will rise; when the DIF0 and DEA0, DIFDEA, the price will fall. The technical summary, can be seen, the decision tree model is actually a partial conclusion based on summing up the given numerical interval more accurate. This paper argues that the model results of the interval is more strict (no longer in 0 as a dividing line, but as -1.85 and 11.05 as the dividing line), may be caused by investor psychological reasons: when the stock market rebounded slightly, most investors will still be in a wait state, not easily shot, but lead to future uncertainties. Only the rebound in the stock market to a certain extent, investors will believe the spring has come, buying, the future price rise. And vice versa.
In the establishment of support vector machine partial model, the first model of the training data, modeling after fitting the correct data together again and look for modeling, their common rules, these rules are denoted as G1, G2, G3.; then the authentication data in accordance with the rules of G1, G2, G3. records were screened out to set up again, the model prediction, prediction accuracy. According to this idea, from the verification of data fitting correctly found 4 common law: they are basically in the shadow long before DIF, with some common DIF-DEA three indicators to meet these 4 laws. The data were screened out were verified in the data, the correct rates were 57.1%, 46.1%, 72.7%, the average number of 75%. was significantly higher than that without treatment, the direct use of the training data modeling, data verification of the correct rate of 55.5%. further proved to exist for Using SVM model to predict the data environment, only in this environment comes for better prediction results than predicted blindly choose not timing so much.
The traditional classical statistics are always the first given a set of variables consistent with economic theory, the relationship between the pre specified set of variables, and then build a good frame in advance in a variety of regression analysis, is a kind of "theory first, after data" mode of thinking. The data mining technology to break the routine, it is not bound given any "what should be" theory, but the right to speak completely to the data itself. It can be said that it is a "first data, after the theory of" thinking mode. Because of this, this paper boldly in the absence of detailed studies are discussed exhaustively the concept of partial model in this paper. Not only put forward the concept of partial model, also extended the concept of partial model: using data processing technology in data mining, or input data, or data processing, or output, in the process of establishing the model, as long as there is a A link is not integral is adopted, we call this partial model for data mining model. Partial model classification tree model is partial output from the "results" point of view, the support vector machine model is partial partial model in data input in the process. In the future, partial model more meaning more angle there. I believe that more and more scholars will be added to the discussion of the partial model in the past.
【學位授予單位】:西南財經(jīng)大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:F832.51
【參考文獻】
相關期刊論文 前9條
1 姚錚,,袁禹治;股票投資中的行業(yè)分析[J];當代財經(jīng);1994年08期
2 程瑜蓉,郭雙冰;基于混沌時間序列分析的股票價格預測[J];電子科技大學學報;2003年04期
3 李煒;馬克;魯保云;;一種基于BP網(wǎng)絡的多模型預測主動容錯控制方法[J];甘肅科學學報;2008年02期
4 靳云匯,于存高;中國股票市場與國民經(jīng)濟關系的實證研究(上)[J];金融研究;1998年03期
5 徐愛琴,張德賢;基于神經(jīng)網(wǎng)絡的分類決策樹構造[J];計算機工程與應用;2000年10期
6 馮予,陳萍;非線性時間序列分析在股市行情預測中的應用[J];南京理工大學學報;1998年01期
7 苗奪謙,王玨;基于粗糙集的多變量決策樹構造方法[J];軟件學報;1997年06期
8 何基報,茆詩松;影響新興股市的多因素模型及與中國股市的比較[J];統(tǒng)計與信息論壇;1997年03期
9 趙自強;鄭明;;應用分類樹模型篩選logistic回歸中的交互因素[J];中國衛(wèi)生統(tǒng)計;2007年02期
相關博士學位論文 前1條
1 鮑漪瀾;基于支持向量機的金融時間序列分析預測算法研究[D];大連海事大學;2013年
本文編號:1429847
本文鏈接:http://sikaile.net/jingjilunwen/touziyanjiulunwen/1429847.html
最近更新
教材專著