<< Click to Display Table of Contents >> Navigation: Data Exploration and Spatial Statistics > Spatial Autocorrelation > Autocorrelation, time series and spatial analysis |
As we saw in Table 1‑3, if we have a sample set {xi,yi} of n pairs of data values the correlation between them is given by the ratio of the covariance (the way they vary jointly) to the square root of the variance of each variable. This is effectively a way of standardizing the covariance by the average spread of each variable, to ensure that the correlation coefficient, r, falls in the range [‑1,1]. The formula used for this ratio is:
Now suppose that instead of a set of data pairs {xi,yi} we have a set of n values, {xt}, which represent measurements taken at different time periods, t=1,2,3,4,…n, for example daily levels of rainfall at a particular location, or the closing daily price of a stock or commodity. Figure 5‑24 shows a typical stock price time series: the blue (thin, jagged) line is the closing stock price on each trading day; the red/gray and black looped lines highlight the time series for 7 and 14 day intervals or ‘lags’, i.e. the sets {xt,xt+7,xt+14,xt+21,...} and {xt,xt+14,xt+28,xt+42,...}.
Figure 5‑24 Time series of stock price and volume data
The pattern of values recorded and graphed might show that rainfall, or commodity prices, exhibits some regularity over time. For example, it might show that days of high rainfall are commonly followed by another day of high rainfall, and days of low rainfall are also often followed by days of low rainfall. In this case there would be a strong positive correlation between the rainfall on successive days, i.e. on days that are one step or lag apart. We could regard the set of “day 1” values as one series, {xt,1} t=1,2,3…n‑1, and set of “day 2” values as a second series {xt,2} t=2,3…n, and compute the correlation coefficient for these two series in the same manner as for the r expression above. Each series has a mean value, which is simply:
and
Using these two mean values we can then construct a correlation coefficient between our two series. This is essentially the same formula as for r:
If n is reasonably large then the value 1/(n‑1) will be very close to 1/n, and the values of the two means and standard deviations will be almost the same, so the above expression can be simplified under these circumstances to:
This expression is known as the serial correlation coefficient for a lag of 1 time period. It may be generalized for lags of 2, 3, …,k steps as follows:
The term autocorrelation coefficient has been used since the 1950s to describe this expression, rather than serial correlation coefficient. The top part of this expression is like the covariance, but at a lag of k, and the bottom is like the covariance at lag of 0. These two components are sometimes known as the autocovariance at k and 0 lags. In time series analysis it is usual for the time spacing, or “distance”, to be in equal steps. The set of values {r.k} can then be plotted against the lag, k, to see how the pattern of correlation varies with lag. This plot is known as a correlogram, and provides a valuable insight into the behavior of the time series at different lags or “distances”.
For a random series the values for the r.k will all be approximately 0 ― in fact they are distributed as N(0,1/n). If there is short term correlation, as in our rainfall example, the r.k will start high (close to +1) and decrease to roughly 0 when the number of lags exceeds the length (or range) of this correlation. It is possible of course, that the overall pattern of rainfall shows a steady increase over time, in which case the correlograms will not tend to zero in the manner expected. In this case the series is described as non-stationary (see further, Section 6.7.1) and before carrying out such analysis an attempt to remove the trend component should be undertaken. Typically this involves fitting a trend curve (e.g. a best fit straight line) to the original data points and subtracting values for this curve from the original dataset at lags 1,2,3… before carrying out the analysis. The original data may also contain outliers, which if left in may distort the analysis, so inspection of the source data and outlier adjustment or removal (e.g. of data errors) may be advisable. Having identified these factors, adjusted the data if necessary, and computed the correlograms, the next step is to examine the results and attempt to interpret the observed patterns. Unfortunately, as is the case with many patterns observed in time or space, more than one process can generate an identical pattern. However, modeling an observed pattern may provide an effective means of estimating missing or sparse data, or predicting values beyond the observed range, despite the fact that the process generating the model may not be unique.
These comments apply to series which follow a clear sequence of steps in a single dimension, time. At first sight such methods do not translate easily to spatial problems, since there is no obvious single direction to follow. Of course one could select a single transect and take measurements at fixed intervals to produce a well-ordered series, which could then be analyzed in exactly the manner described. But more general procedures are needed if a wide range of practical spatial problems are to be subjected to such analysis. These procedures need to model space in a manner that results in well-ordered data series, ideally in evenly spaced steps, using the general notion of proximity bands (see further, Section 5.5.2, Global spatial autocorrelation).