TUTORIAL.TXT --------------------- Until I get around to constructing some more decent documentation (probably in the form of on-line help), this tutorial file will have to do. I'm basically going to just write down whatever I think is useful to know using actual examples. I have include two data files for example: 1994.xls is an Excel 4.0 worksheet and 1994.txt is a tab-delimited text file. These are the only files types TimeStat can handle for now. The Formula One VBX can handle .VTS files, a proprietary format supplied by Visual Tools, Inc. I do not consider it useful. The two example files contain identical numerical data: 1994 daily closing prices of DJIA, DJTA, DJUA, DJBA, S&P and some other information. This data is supplied by NeuroVe$t Journal and is available on selected BBS's. There is an additional file, BDSQUANT.XLS, which contains the small sample quantile tables for BDS test. This file is for reference when conducting BDS analysis. 1. Open file Select File/Open or click file open button on tool bar (second from left). Locate and open either 1994.xls or 1994.txt. You can also select File/New to start with a blank worksheet (leftmost button on tool bar). 2(a). Quick charting Use arrow keys or mouse to place active cell box anywhere on the second column (labeled DJIA at top), click chart button on toolbar (rightmost) or select Window/New Chart from the menu. Everything in the Chart Setup dialog is self-explanatory. The 'Data Arrangement' list box at right shows one choice since there is only one way to handle one column of data. Make your choices and click OK. A chart appears. You can resize the chart window and the chart will resize in the same manner. You can double-click on the chart to bring up the Chart Setup dialog again to change any option except Data Arrangement. To plot only part of the data in a column, use the mouse to click and drag across the range you want, or you can choose Sheet/Select Range from the menu and enter a range in standard spreadsheet format. For example: enter B2:B101 to select the first 100 days of 1994 DJIA. Clicking chart button at this point results in a plot of the first 100 values from column B. 2(b). Printing The print, print setup, and page setup items under File menu all work with the current spreadsheet pretty much as one expects. Print selections only prints the selected cells. You might have to experiment a little to get the output you want. My HP 660C DeskJet works fine. To print a chart, double click on the chart to bring up the chart setup dialog. In the lower right cornet there are check boxes for printing to printer and printing to file. Checking one or both of these and clicking button initiates printing. The graphics server sends the chart to the current default Windows printer and I honestly do not know how to set options. The color output from my printer is quite good. Your best bet for customization is to print to file. Checking the print to file box brings up a file save dialog for saving to a .WMF file. I apologize for the quirky chart printing interface, but I did not want to spend too much time on this. 3. Selecting ranges Go to the rightmost data columns. The 3 rightmost columns contain the Hi, Lo, and Close of S&P 500 index. Click and hold down mouse on header of column G (SP500H), drag it across to column I and release. Clicking the header of a column or row selects (hi-lights) the whole column or row; clicking and dragging selects consecutive columns (rows); clicking the upper-left most cell in the spreadsheet selects the whole sheet. With columns G, H, and I selected, click chart button. Notice the Data Arrangement options list now gives 3 choices. You can designate all columns as y-data, the first column as the x data the rest y curves, or arrange the chosen columns as (x1,y1), (x2,y2),...etc. If odd number of columns are chosen, the last column is ignored. For SP500 Hi, Lo, and Close we want the first choice (all y-data). Click OK to see three colored curves on the same plot. To select two non-adjacent column or non-contiguous ranges: first select column/range as usual, then hold the key down and select all subsequent ranges. For example: click header of column G to select SP500H, then while holding down key, click the header of column I to select SP500C. Now you can plot these two disjoint columns as before. 4. Calculations All calculations (under the Calculations and Analysis menu headings) expect to work on entire columns. More precisely the program scans the column starting from the top, starts reading data at the first row with a valid number and stops reading data as soon as a row without a valid number is encountered. Thus data within a column can have an arbitrary number of spaces and text-filled cells above and below it. But if there are two or more disjointed data groups separated by spaces or text cells, only the first (topmost) group will be read in. To perform calculations on only part of the column, you will need to copy the data of interest to a new column. To do this, using a mouse, first select the cells of interest, choose Edit/Copy from menu, create new empty column somewhere (using Sheet/Insert Column from menu) and Edit/Paste or Edit/Paste Values in the new column. 5. Single column calculations These are operations that always work on the column containing the active cell. The results are placed in new column(s) to immediate right of the source column along with helpful labels. These labels are input as formulas; therefore, they can recalculate themselves if the source column should change header because of column deletions/additions to the left. Example: under the Calculations menu, 'Statistics' evaluates the data set distribution in terms of mean, deviation, skew, and other parameters. 'Delta' calculates the daily changes, '% Delta' gives percent change, 'Delta Log' gives log10() of (today/yesterday) ratios, 'z-Score' subtracts the mean and divides by the standard deviation to give a data set with mean of zero and standard deviation of 1. The ‘Fractional Delta’ (fractional differencing) is calculated from a binomial series expansion which converges very slowly. I implemented this for looking at ARFIMA type analysis. One can think of the slow convergence as a reflection of the long memory of the series involved. The value at any one point is calculated from all previous points in the series in a recursive manner. It may take up to 100 points or more to get some semblance of convergence. Try this: for any column of data, do fractional delta with d=0.25 then do d=-0.25 on the result, subtract the final result from the original series and plot the difference; the plot clearly shows the error only goes as 1/n in the series calculation. The dialog box will allow a fraction from -0.5 to 0.5; to get some other fraction you can always use integer differencing (using Calculations/Delta) to transform the series first. 6. Calculations/Correlation menu item Calculates correlation between two series (the first two columns, if more than two are chosen) for a range of time lags up to half the time extent of the smaller of the two series. Positive time lags represent a lag of column 1 relative to column 2 (data set 1 shifted to the right on x axis). Negative lags means the opposite. If only one column is chosen, the autocorrelation is calculated, and only 0 and positive lags need be presented. I pad the series with enough zeroes to prevent spurious aliasing. Example 1: Select the DJIA column. Select 'Calculations/Correlation' menu item. Two adjacent columns are produced to the right. These represent the autocorrelation coefficients for various time lags. Zero time lag is always 1 (obviously!). To see a plot of this, select the Lag and autocorrelation columns and click Chart button. Be sure to choose appropriate Data Arrangement list items; in this case both second (X,Y1,Y2...) and third items (X1,Y1,X2,Y2...) work the same. Note the autocorrelation drops off rapidly to near zero at lag of 20 to 30 days. Example 2: Select DJIA and then SP500C columns. Select 'Calculations/Correlation' from menu and plot the correlation. As might be expected, correlation is near 1 for lag=0 and drops off rapidly. The curve is symmetric about lag=0, again to be expected. Doing the same for DJBA and DJUA gives similar results, this is also logical since utilities tend to move closely with bonds. How about stocks and bonds? Try DJIA and DJBA. Note correlation is less than 0.4 at lag=0, but a larger magnitude of -0.47 at lag=60. The positive lag and negative correlation might imply that DJIA (stocks) tends to move in an opposite direction to DJBA (bonds) with a lag of about 60 trading sessions (roughly 3 months). Keep in mind that we are looking at 1994 only, these relationships may or may not be the same for other years. The newly implemented partial autocorrelation (PACF) is actually a by-product of a Burg algorithm linear predictive routine hidden in the code. The routine actually produces additional useful information: AR coefficients. I did not present these because I have not decided on appropriate user interface for them (maybe next version). PACF is useful for ARMA-type model identification and many books on time series analysis discuss its use. 7(a). The fast Fourier Transform (FFT) works on a single column of real data. Therefore, we only need to consider zero and positive frequencies. The output is arranged in two columns, real and imaginary parts of frequencies: 0, 1f , 2f,..., Nf/2 where N is the original number of real data points and f = 1/(Nd), d is the sampling period (1 day in the example below). The 'Periodogram' is the power spectrum which is simply the sum of squares of real and imaginary parts of FFT output. For financial data, the power spectrum rarely seems to provide any useful information. I do not pad the series with zeroes for FFT; the series lengths need not be integer powers of 2 but does need to be even. If the series length is odd, I add a zero at the end. If we select the two output columns generated by an FFT and then choose Calculations/Inverse FFT from menu, a column of data will be produced that is basically identical of that of the original data that was used as input to the FFT (applying FFT then inverse FFT does nothing overall). This means that the Inverse FFT function expects to work on the FFT output of a real function. I use the FFT and its inverse to perform filtering. For example, perform an FFT on a data set, take the output columns and set all rows higher than a chosen frequency to zero, perform inverse FFT, the result is low-pass filter on the input data. This techniques allows one to filter out fast changes (daily fluctuations) or longer term weekly or monthly trends in the frequency domain. Example 1: Select the DJIA column. Select 'Calculations/Z-Score' menu item. This shifts the DJIA series to mean=0 and scales it to variance=1. Select the newly created z-score column (should be to immediate right of DJIA). Select 'Calculations/FFT'. Two new columns, Real and Imaginary parts, are created to the immediate right. The top row is frequency=0. The bottom row is the largest frequency which is 1/2 the sampling rate (known as the Nyquist critical frequency, in this case 1/(2day)). If you select these two columns and then select 'Calculations/Inverse FFT', you will get back the original DJIA z-score data. Example 2: We can perform low-pass or high-pass filtering. Low-pass filter: (remove the high frequency components) Set all cells from row 21 to the last row (128) in both the real and imaginary columns to 0. Select 'Calculations/Inverse FFT'. Now select and plot DJIA z-score column and the newly created inverse FFT column; note the new curve closely matches the original curve except it is much smoother. High pass filter: (remove the low frequency components) Set rows 2 to 20 to zero. Inverse FFT and compare a plot with the original data. Note that high frequency daily to weekly changes are still preserved but longer trends are eliminated. Note that an easy way to set a large range of cells to 0 is: 1) set the active cell to any cell in the range by clicking on the cell, use edit bar to set the content to 0, select 'Edit/Copy' (or use toolbar button), select 'Sheet/Select Range' menu item and enter desired range (example: c21:d128), select 'Edit/Paste' or use toolbar button. Note that using 'Edit/Clear Range' menu item clears the cells of any content, it does not set them to 0. 7(b). The fast Hartley transform (FHT) is similar to FFT in that it is a transform from time to frequency domain. This well know transform from digital signal processing (DSP) has the advantage that it work entirely in the real domain (maps Rn -> Rn). Similar to FFT, the forward transform is exactly the same as inverse transform except for normalization factor. For an input series of length N (N a power of 2), FHT maps it to N frequencies. In TimeStat, the first row is frequency 0 (constant offset of series), next rows are 1f, 2f, and so on just as described for FFT above until Nf/2 (Nyquist critical frequency). The values from (N/2 + 1)f to (N -1)f are actually negative frequencies mapped from -Nf/2 to -f. For example to filter out the 3 lowest and 0 frequencies, set rows 1, 2, 3, 4, (N-2), (N-1), and N to zero and perform inverse transform. If you are lost at this point, I would not worry too much. FHT is really not that useful for time series analysis anyway...in my humble opinion. Those who really need it will not even need my rambling here. 8. The edit bar works just like an Excel edit bar. Many common functions are supported (sin, cos, sum, ln, log10, sum, stdev, etc). I'll see about making up a list of these functions from Formula One's manual without violating their copyrights. 9. Distribution statistics - the 'Analysis/Cumulative Percentile' menu item is a single column operation. It sorts the column's data and lists the sorted list as a function of cumulative percentile in two new adjacent columns. The distribution profile of two different series can be compared using their z-scores. Example 1: Select the DJIA column. Select 'Calculations/Z-Score'. Select the newly created DJIA z-score column. Select 'Analysis/Cumulative Percentile'. Two new columns are created with percentile and sorted DJIA z-score values. Repeat all of the above using the DJBA. Now select the 4 columns of percentile and sorted z-scores values for DJIA and DJBA and plot them using the 'X1,Y1,X2,Y2,...' Data Arrangement list option. Note the two cumulative percentile plots look quite different. Example 2: Continue with the example above. We can quantify the difference between the DJIA and the DJBA. Select the DJIA column. Select 'Analysis/Frequency Histogram' menu item. Three new columns are produced: 'Sigma' is deviation from the mean in units of standard deviation, 'Gaussian Expect.' Is the expected value for a particular bin for a normal (Gaussian) distribution, 'Freq. Hist.' Is the frequency histogram for the values in the series sorted into bins according to how far each is from mean. Select the three columns and plot them using first column ('Sigma') as the X value. The DJIA distribution certainly looks pretty close to normal. How close? Select the 'Gaussian Expect.' and 'Freq. Hist.' Columns. Select 'Analysis/Chi-Square.' When asked if the first column is the expectation answer 'yes'. The chi-square is 35.7 with probability of 30% that the DJIA distribution is normal. Doing the same for DJBA produces a chi-square of 242.6 with vanishing probability. Therefore, DJBA is definitely not normal, which is somewhat obvious from looking at a comparison of DJBA distribution plot versus a plot of the normal distribution. We can do a chi-square analysis using the DJIA z-score and DJBA z-score directly, in this case, answer 'no' to question about whether first column is expectation value. The result is chi-square > 700 and probability of 0. Example 3: Quite often, just one or two data points far from the mean can drastically alter the chi-square value. Select the NYVD(000) column (NYSE daily volume). Select 'Analysis/Frequency Histogram' menu item. Select the newly created 'Gaussian Expect.' and 'Freq. Hist.' Columns. Select 'Analysis/Chi-Square.' When asked if the first column is the expectation answer 'yes'. The chi-square is 146 with vanishing probability. But note that there is one data point at -4 sigma. Remove that point from the frequency histogram (set the cell to the right of -4 sigma to 0), and recalculate the chi-square. Lo and behold the chi-square is down to 29.4 with probability of 60%. Looking at the statistics of NYVD(000), we can see that the -4 sigma point occurred on 11/25/94, the Friday after Thanksgiving. So the light volume is not surprising. In this case there is a logical explanation for the data point; for a less artificial example, select the SP500C column and choose menu item 'Calculations/Delta Log'. Now repeat the ‘Frequency Histogram’ and ‘Chi-Square’ analyses on the Delta Log of SP500C. Note the presence or absence of a data point at -3.75 sigma again makes large difference on the chi-square probability. The data point is in row 26 (2/4/94), the date of first Fed rate hike. In general time series analysis, determining the significance and subsequent treatment of such outliers can be a difficult and subtle problem. Deviations from normality and linearity lie at the heart of modern applications of non-linear dynamics and chaos theory to financial time series. 10. Principal components analysis (PCA) - this is a powerful and well known technique from multivariate statistics. User should consult any book from that field for proper understanding, usage, and interpretations. This technique 'rotates' the sample data sets in multi-dimensional space to new axes of maximum variances. It is useful for reducing a large number of variables to fewer number of linear combinations (the principal components) which contains nearly the same variance (which may be interpreted as ‘information’) as the original variable. With fewer input variables, neural networks would be more efficient and might train faster with better results. As always, this is a tool, not black magic, YMMV (your mileage may vary). Example 1: Open file '1994.xls'; selected columns B to G inclusive (DJIA to SP500H). Select menu Analysis/Principal Components Analysis. In the pop-up dialogue note that you can perform PCA using covariance or correlation matrix. The default is correlation matrix because it normalizes input variables. Analysis using covariance matrix will give heavier weights to variables of higher variances. This may or may not be desirable; stick with default correlation matrix unless you know what you are doing. Clicking the 'Calculate' button causes the program to calculate the matrix, diagonalize it and list eigenvalues. The eigenvalues, in descending order, are the resulting variances of the principal components. Note that the first three out of six of the eigenvalues already captured 97% of the all variance (or variations) in the 6 original inputs. The check boxes group in the lower left part of the dialog box lists the output options; the default is to output the correlation (or covariance) matrix and its eigenvalues and eigenvectors. The other two boxes give options to output selected principal components series and one or more of the original variables as reconstructed from the selected principal components. In the 'Principal Components Values' list box in upper right, select the third item (0.93777 (15.6%)), indicating we want to keep and work with the first 3 principal components (containing 97% of total variance). In the 'Reconstruction Series Selection' list box in lower right, select B and G, indicating we want to reconstruct these columns (DJIA and SP500H) from the first 3 principal components (make selection by clicking on B then on G, no need to hold down key; click on a selection again to un-select it). Make sure that all output options boxes in lower left are checked and then click 'OK' button to see the result. Looking at the output symmetric matrix, we can see that 3 components are able to capture most of the variability in the 6 input series. The correlation coefficients are very high (> 0.9) for (DJTA, DJBA, and DJUA), and also for (DJIA and SP500H), implying 2 series from the first group and 1 from the second may be providing similar information. None of this is surprising except perhaps the correlation between DJTA and DJBA (or DJUA). To the right of the matrix are the selected principal components series (unnormalized), and the selected reconstructed series. Select column B (DJIA) and then scroll right and select column R ("B2:B253 PCA Reconst.") while holding the Ctrl key. Click chart button to compare the two series. As you can see, the reconstruction is not too bad over the entire range. Try the same for columns C and S (SP500H and its reconstruction). PCA does not always give clear cut results and interpretations; as with any other tools, employ it with a healthy dose of common sense. Some remarks about the PCA output: for the analysis using the correlation matrix as above, the input data series are normalized to mean of 0 and variance of 1. The output PCAn (n=1,2,...) series are linear combinations of the normalized input (with the coefficients given by the corresponding eigenvector), they must have mean = 0 by definition, but the variance is not normalized to 1. In fact, the variance of the PCAn series is just the nth eigenvalue of the correlation matrix. The reconstructions of input series are linear combinations of the selected PCA series, with coefficients given by the appropriate components of the eigenvectors. In our example above, column B reconstruction (the 1st input variable) is a linear combination of columns PCA1, PCA2, and PCA3 with coefficients given by the first components of the first 3 correlation matrix eigenvectors (in this case, located in I18, J18, and K18 cells). The resulting series is then multiplied by the variance and added the mean of the original input series for direct comparison. In a similar manner, the G column reconstruction uses the 6th components of the first 3 eigenvectors (cells I23, J23, and K23). Note that if we had chosen to use all principal components, generating PCA1 through to PCA6, the reconstruction would be perfect--the reconstructed series would be exact copies of original input series since the operations performed amounted to multiplying input vector by a 6x6 matrix and then by its inverse, resulting in an identity operation. All of the above linear combinations for PCA and reconstruction series are conveniently expressed as proper spreadsheet formulae in TimeStat. Move the active cell to a number under a PCAn column or a reconstructed series column and check the edit bar above the spreadsheet to see the formula. This is useful for quickly calculating the PCAn series for fresh data in the original input series. Simply place the new data at the bottom of each input series column, then copy down the formulae for each PCAn column into the new rows. The spreadsheet will take care of the rest. 11. Discrete wavelet transform (DWT) - the wavelet transform has been a hot topic in math, science, engineering, and more recently, economics and finance. Broad references are generally abundant and easy to find; references on applications in economics and finance are few right now but increasing in number. I cannot possibly do justice to its richness and complexity here or in the program. My implementation does allow you to play with the 1-dimensional DWT and hopefully gain some understanding through actual examples. The algorithm itself is actually much simpler than the mathematical concepts in my opinion. The actual DWT code in C is probably no more than 100 lines. I adapted this version from Bob Lewis' Imager Wavelet Library (see README.TXT), so some of the bases implemented are probably more suitable for image processing. Basically, wavelets allow simultaneous decomposition of a time series into components (bases) which are localized in both time and frequency. This is unlike FFT where the component sine and cosine waves are localized in frequency but totally unlocalized in time. Example: With file '1994.xls' loaded, select column B (DJIA), select menu item ‘Calculations/Z-Score’ to normalize the series to zero mean and variance one (this is not necessary for DWT but it makes the charting below easier). Select the newly created column C (the Z-score); select menu item ‘Analysis/Discrete Wavelet Transform’. The DWT dialog allows you to choose from a selection of bases and perform thresholding on the coefficients. For now leave the threshold slide bar at 0% (no thresholding), choose the Daubechies 4 basis, check the forward transform box (default) and check the 'Freq. Band Decomposition' box. Click 'OK'. On the spreadsheet you now have a new column of Daub4 DWT coefficients arranged in the standard manner: rows 130 to 257 represent the upper frequency band detail (roughly Fc/2 to Fc, where Fc is the Nyquist critical frequency or 1/2 the sampling rate of the input series. In this case, the sampling rate is 1/trading day so Fc is 1/(2 trading days)). Rows 66 to 129 represent the next band (Fc/4 to Fc/2), and so on until the first coefficient represents the scaling function. This arrangement comes from the multiresolution analysis (MRA) of wavelet transform; look up any introduction to wavelets for a discussion. The next 7 columns represent the spectral components of the input series. Summing the 7 columns row by row would reproduce the original input series. As the column titles indicate, each column represents the 'information content' of the frequency bands as described above. At this point one can throw away the high (or the low) frequency bands similar to filtering using FFT’s. One recently proposed technique involves using separate neural networks to forecast each individual decomposed series and recombine the results to obtain a forecast for the original series. Select column D (Daub4 DWT) and select menu item Analysis/Discrete Wavelet Transform. Uncheck the Forward/Reverse check box (to get the reverse transform). Clicking 'OK' at this point would just give us back the original input series (Z-score of DJIA). Move the 'Quantile threshold' slide bar to 75%, make sure the basis option is set for 'Daubechies 4' and click 'OK'. The program now sets those coefficients that represent the smallest 75% in magnitude to zero, and then performs the inverse transform. Select columns C and E and chart them. Note that, with only 25% of the wavelet coefficients we get a surprisingly good reproduction of the original series. Note also that this 'approximation' is different from the usual smoothing or averaging in that small rapid oscillations are eliminated but all sharp turns or significant magnitude are faithfully captured with great accuracy. This is one of the advantages of wavelets over Fourier transform; you would need large number of FFT coefficients to accurately reproduce large, rapid, and isolated variations while it takes only a few compact wavelets to do the same job. Some final comments about the DWT. To actually see what the wavelet bases look like, first inverse transform a series with only one non-zero coefficient. For example, fill an empty column with zeroes from row 1 to row 256, set row 12 to = 1, inverse transform this column with a basis of your choice, and then chart the output series to see what the wavelet looks like. What is the downside to using the DWT? Like the FFT, the DWT suffers from aliasing effects at the ends. The algorithm assumes the series are periodic just as the FFT. Padding with zeroes alleviates but does not solve the problem. For forecasting and financial time series analysis, the ending points of the series are precisely where the most important informations may lie. The solution to this problem is to use orthonormal bases which 'live' in finite intervals. I hope to implement such wavelets on the interval sometime. Which basis should you use? There is no 'wrong' basis--any basis will transform any data series. You can choose a 'best' basis based on the data pattern. For example, I used Daub4 basis above because these wavelets has sharp corners (mathematically--discontinuities in the first derivative) which happen to suit the stock data well. If I had used smoother wavelets (Daub8 or Daub16 for example), the inverse transform with 75% thresholding would show rounded corners. There are more systematic ways to understand and choose bases, but that is beyond our scope here. 12. Moving Averages (MA) - under the 'Calculations' menu, there are 3 types of moving averages: simple (SMA), exponential (XMA), and adaptive (AMA). Choosing any of these would produce a dialog box prompting you to enter an MA window length, n; the positive integer entered should be >1 and <=500. SMA is just that, an average of past n days. XMA gives more weight to more recent days so that sharp features will not have excessive influence far into the future. AMA attempts to reduce the 'lag' inherent in all moving averages (see NeuroVe$t July '95 issue). AMA involves subtracting a multiple of previous XMA from current XMA and smoothing the result with XMA. I note here that in the June '95 issue of Technical Analysis of Stocks and Commodities, the interview with Perry Kaufman also presented an 'Adaptive Moving Average'. It is different from (and more complicated than) what we have here. Those interested can of course look it up, enter the relevant Excel formulae and compare. 13. BDS test - First let me cover my behind: I do not pretend to even begin to gain a grasp of the vast field of non-linear dynamics, chaos, and their applications in time series. BDS test is a widely used technique for detecting non-linearity (or more precisely: deterministic structure) in time series. I have baiscally taken the C code provided by Prof. Blake LeBaron of University of Wisconsin, hacked and compiled it into TimeStat, adding only some bare user interface for input parameters. It is impossible to use BDS statistics without some knowledge of the theory and the background. Unfortunately both are way beyond the scope of these notes. BDS tests the null hypothesis that the input series is independent. The analysis is NOT going to spit out a ‘yes’ or ‘no’ answer (more like ‘maybe yes’, ‘perhap no’, ‘good chance that...’). To generate the BDS statistics, select a column and choose Analysis/BDS menu item. A dialog box appears for user to enter two parameter: real parameter epsilon expressed in terms of % of standard deviation of the series, and an integer parameter, m. Epsilon is the nearest neighbor cutoff for the correlation integral calculation (two points in space are neighbors if they are within this distance of each other). m is the maximum embedding dimension to be considered. A time series {x1, x2, x3, ..., xn} is embedded into spatial dimension m by forming the m-tuples {(x1,x2,...,xm), (x2,x3,...,x(m+1)),...}. So embedding in m=2 would consist of forming ordered pairs {(x1,x2), (x2,x3), (x3,x4),...} in 2d space. The correlation integral measures the number of pairs of embedded points within distance Epsilon of each other. Truly independent series’ correlation integral has a definite asymptotic behavior, giving us the null hypothesis for BDS analysis. For a given m entered, the program will calculate BDS statistics for embedding dimensions 2, 3,..., m. Clicking button initiate the calculation, which can be lengthy depending on the series data size and on embedding dimension; on my 486/66 under Windows NT, a 1000-point series for m=5 took about 40 seconds, 2.5 minutes for m=10. In Linux and 32-bit NT, and using the faster algorithm, the times are down to seconds. The results are presented in a new column to the right of the series. The new column first lists the series’ standard deviation and then the actual value of epsilon used (if 100% was entered, than epsilon = std. Dev.), followed by the integer m and finally (m- 1) real numbers representing the BDS statistics calculated for embedding dimensions 2, 3,..., m in order. That is it for the easy part; the much harder part is the proper interpretation of the statistics. The statistics generated for each embedding dimension should be asymtotically (meaning large number of data points) Gaussian normal of mean 0 and standard deviation 1 for a truly independent series (random walk). Example: in any column generate 1000 random numbers using the built-in rand() function (type “=rand()” in a cell and copy to others). Select the column and select Analysis/BDS Statistics menu item. Accept the default values in the pop-up dialog. The hourglass cursor will be around from 20 to 50 seconds and then the results will be presented in the column to the right. The result for all embedding dimensions should be within -1 to +1, indicating passage of the null hypothesis (independent series). Copy the entire column and do a Edit/Paste Values to copy the random numbers (but not the formula) to a new column. Sort the new column by Sheet/Sort (accept default settings). Run BDS again on the new column. You should get some large numbers far from norm, indicating rejection of null hypothesis (series not independent). The deceptively simple procedure above hides the complicated background, theory, calculations, and interpretations involved in BDS analysis. To begin with, non-linear/chaos studies and analysis really need large amount of data just to get started most of the time. BDS test will spit out some result for data series as short as 100 points but the interpretation is much trickier and most likely suspect. The authors of the reference below conducted extensive Monte Carlo tests to study the small sample behavior of BDS. Results indicate that for small samples the BDS statistics are typically not Gaussian normal. They provided some tables as guidelines for interpreting small samples. I reproduced the relevant tables in the included spreadsheet BDSQUANT.XLS. The tables are provided for 100, 250, and 500 datapoints for embedding dimensions 2, 3, 4, 5, and sometimes 10. Each table also has a column of standard normal quantiles for comparison. Generally, for N data points and maximum embedding m, one can used the Gaussian distribution if N/m > 200. Example: Open 1994.XLS, select first column (DJIA) and run BDS. The series is 253 points long so we will use the 250-point table. The calculated BDS statistics are very large numbers, in fact well off the tabulated values. The implication is that DJIA seris is not independent. No real big surprise, we could have reached the same conclusion by just looking at the chart. Difference the series once (Delta) and do BDS again. This time the BDS statistics are very small numbers (.57, .51, 1.52, 2.35 for m=2,3,4,5). Looking at second part of Table 2 in BDSQUANT.XLS, we can see that these statistics are within the 10- 90% quantile, with the exception of m=5, which is not too far off. The interpretation here might then be that the once differenced DJIA series has very little forecastable structure (recall how many traders got whipsawed badly that year?). Try the same thing for once-differenced DJBA. This time the statistics are all beyond 97.5% quantile, allowing us to reject with some confidence the null hypothesis of independence (you will not catch me saying something like 'tradeable'). The examples I gave above almost certainly overly simplified and trivializd a complex subject and a powerful tool. Beyond the caveats, here are some final remarks: - for series of lengths < 500, do not use m>7. The calculations will likely run into underflow problems because there are simply too few non-overlapping points m-dimensional space. - for those studs in numerical/statistical methods, proper use of bootstrapping can get better results for short series - BDS is also well-suited for testing for remaining structures in the residuals of time series forecasting models (ARMA, ARCH, GARCH, NN, and so on). - anyone serious about using this should read the reference given below and look at the source code Reference: Nonlinear Dynamics, Chaos, and Instability by Brock, Hsieh, and LeBaron, MIT Press 1991.