Simple regression and trend surface modeling

<< Click to Display Table of Contents >>

Navigation:  Data Exploration and Spatial Statistics > Spatial Regression >

Simple regression and trend surface modeling

Simple regression techniques have been widely applied in spatial analysis for a very long time. The general form of model employed can be represented by the expression:

where: y is the (possibly transformed) observed value of some continuous dependent variable y at location (x1,x2); f() is a general function to be defined, but often a linear expression in the coefficients; and w is a vector of attributes measured at (x1,x2). For example, if f() is simply modeled as a linear function of the coordinates we can write

Model performance can be assessed using a number of statistics that are discussed in standard statistical text books. One such statistic, which will be referred to in our subsequent discussion, is the correlation coefficient squared or R2 statistic. This statistic records the proportion of variation in the data that is explained by the model. A modification this statistic, adjusted R2, takes into account the complexity of the model in terms of the number of variables that are specified.

With a set of n observations made at different locations in the study area, yi(x1i,x2i), a series of n equations can be set up and solved for the p=3 unknown parameters using OLS. This is simply a linear trend or best fit linear surface through the point set, {y1}. A set of differences between the observed values and the best fit surface can then be computed:

where the y hat values (the estimated values) are obtained by evaluating the trend surface at each point i. Assuming the value of n is reasonably large and provides a representative coverage of the complete set of locations in the study area, these residuals may be treated as samples from a continuous surface and mapped using simple contouring or filled contouring procedures. If, on the other hand, the sample points represent centroids of zones (pre-defined regions or generated zones, e.g. Voronoi polygons) the visualization of residuals is generally applied to these zones as a choropleth map. Plotting the residuals using a combination of the fitted surface and stick/point display in 3D may also be helpful but is less common within GIS packages.

Whichever approach is adopted the fitting of a trend surface and mapping of residuals is a form of Exploratory Spatial Data Analysis (ESDA) and is supported in many GIS and related packages. Within GIS packages such residual maps may be overlaid on topographic or thematic maps with a pre-specified transparency (e.g. 50%) enabling spatial associations between large positive or negative residuals to be visually compared to potential explanatory factors.

With higher order polynomials or alternative functions (e.g. trigonometric series) the fit of the trend surface to the sample data may be improved (the sum of the squared residuals reduced) but the overall utility of the procedure may well become questionable. This may occur for many reasons including the increased difficulty in interpreting the fitted surface (e.g. in terms of process), and increased volatility of the surface close to and immediately beyond the convex hull of the sample points and/or in sparsely sampled regions. Higher order polynomial trend surfaces are usually limited to second order equations, although in some applications 3rd order or above have been shown to be useful — for example, Lichstein et al. (2002) fit a full 3rd order polynomial trend equation to the spatial coordinates of bird species sightings. They then proceed to analyze the distribution of bird species by modeling habitat + trend (large scale spatial pattern) + spatial autocorrelation (small scale spatial pattern) in order to identify separate factors that might be used to explain observed distribution patterns by species. Note that any analysis of trends in this way will result in a fitted surface whose form is heavily dependent upon the set of specific locations at which measured values of y are available and the scale factors operating for the problem being examined. Frequently data locations are not the result of selections made by the analyst but are pre-determined survey locations that may reflect convenience for data gathering (e.g. proximity to a forest path, access to quarry sites, location of wells) rather than an ideal sampling frame for analytic purposes.

The most common measure of the quality of fit of a trend surface or similar regression is the coefficient of determination, which is the squared coefficient of (multiple) correlation, R2. This is simply a function of the squared residuals with standardization being achieved using the sum of squared deviations of observations from their overall mean:

Under appropriate conditions, e.g. ε~N(0,σ2I) the significance of this coefficient can be evaluated in a meaningful way, with values close to 1 generally being highly significant. Frequently, however, the statistical significance of the measure cannot be determined due to distribution conditions not being met, and the value obtained should be seen as no more than an indicator of goodness of fit.

Trend surface and residuals analysis is essentially a form of non-statistical ESDA — it is not necessary to state how significant the fit of the model is or what confidence limits might be placed around the values of the estimated coefficients and the fitted surface. For inferential analysis the standard pre-conditions that apply are extremely unlikely to be met in the context of spatial analysis. Conditions that may apply for statistical purposes include:

the set {yi} is comprised of independent (uncorrelated) observable random variables. This set does not have to be drawn from a Normal distribution but it is often helpful or necessary that they are at least asymptotically Normal (possibly after transformation)

the set {εi} is comprised of independently and identically distributed (iid) unobservable random variables with mean 0 and constant variance, σ2, where σ2 is not a function of β or x

the set {εi} is Normally distributed — this condition is not a pre-requisite but is a requirement for inference on finite sets (as previously noted, this condition is directly connected to the first bullet point above)

the model applied is appropriate, complete and global. This assumption incorporates the assumption that the independent variables, x, are themselves uncorrelated, and the parameters β are global constants

Tests for the Normality of the observations and the residuals are commonplace (the first and third conditions above), and a variety of tests for heteroskedasticity (resulting in failure to meet the second condition) have been implemented in a several GIS-related packages. Note that heteroskedasticity may or may not be associated with positive spatial autocorrelation. As discussed in Section 5.5.2, Global spatial autocorrelation, testing for (global) spatial autocorrelation using the Moran I statistic is a useful starting point for such analyses. A spatial weights matrix, W, is required in order to generate the Moran I statistic. As shown earlier this statistic has the form:

This measure is the spatial equivalent of the standard product moment correlation coefficient. The weighting terms, wij, are defined either from patterns of zone adjacency or using some form of distance-decay function. A value of Moran’s I=1 indicates perfect positive spatial autocorrelation, whilst a value of I very close to 0 indicates no spatial autocorrelation. A value such as I=0.2 indicates some positive spatial autocorrelation, although this may or may not be significant — with large spatial datasets small positive I–values are very often technically identified as significant, although the value of this interpretation may not be great.

Assuming that significant spatial autocorrelation is identified several options exist for continued analysis using regression methods:

proceed with the analysis ignoring the observed spatial autocorrelation (particularly if it is small) accepting that the drawing of confidence intervals and determination of significance levels will not be possible, or if computed may be misleading

drop the requirement that the parameters β are global constants and allow these to vary with location. Effectively this involves fitting a series of regression surfaces to the dataset in such a way as to ensure a continuous prediction surface is created. One procedure for achieving this objective is known as Geographical Weighted Regression (GWR); or

include additional elements into the regression model that take explicit account of the observed pattern of spatial autocorrelation, as described in Section 5.6.4, Spatial autoregressive and Bayesian modeling