Pairwise distances

<< Click to Display Table of Contents >>

Navigation:  Data Exploration and Spatial Statistics > Point Sets and Distance Statistics >

Pairwise distances

A similar and in many ways more powerful approach, known as Ripley’s K (or L) function, is supported within several packages, including Crimestat, splancs and spatstat. Ripley’s procedure utilizes a larger set of event-event distances, not just nearest neighbor distances.

Figure 5‑18 Ripley’s K function computation


It operates as follows (see Figure 5‑18):

Construct a circle, radius d, around each point (event), i (d rather than r is generally used for the circle radius in this technique)

Count the number of other events, labeled j, that fall inside this circle

Repeat these first two stages for all points i, and then sum the results

These steps equate to computing the sum:


where I(dij)=1 if the distance, dij, from i to j is less then d otherwise I(dij)=0

Increment d by a small fixed amount (e.g. R/100, where R is the radius of a circle of area equal to the study area, A)

Repeat the computation, giving values of K(d) for a set of distances, d

The function K(d) can then be plotted against distance, d, in a manner similar to that described in Section 6.7 for correlograms and variograms. The graph of the (transformed) K(d) function provides an excellent insight into localized clustering, with characteristic graphs reflecting particular observed patterns. Computation of (the transformed) K(d) function for a mapped dataset is supported in both splancs, spatstat, and Crimestat and in all cases Monte Carlo simulation of confidence envelopes is provided.

As noted earlier, under CSR the expected number of events within a circle of radius r or d is simply the event density (sometimes referred to as the intensity), m, times the circle area. With m estimated as the overall number of events divided by the study area, N/A, we have

The observed K(d) function minus the expected value above has an expected value of L(d)=0 for a given d. L(d) measures the difference between the observed pattern and that expected under CSR. It is generally computed and plotted with d subtracted from the expectation:

If the component d is not subtracted (as per Ripley’s original formulation of this method) then the L(d) plot against d for a truly random distribution will be a line at 45 degrees (i.e. L(d)=d) whereas with d subtracted it will be approximately 0 for all d. This transformed expression is the form in which the K-statistic is analyzed within Crimestat. The sampling distribution of L(d) is unknown, but may be approximated by Monte Carlo simulation, as described earlier. Crimestat produces its simulations based on a rectangular region of area A and shape matching the MBR, whilst Splancs utilizes a user-specified polygon, which is more satisfactory in most cases.

Figure 5‑19 shows the L(d) plot for the lung cancer point set shown in Figure 5‑21 (998 observations). The envelope shown provides the maximum and minimum L(d) curves generated from 300 simulation runs. The observed plot falls outside this envelope over shorter distances, confirming the visually observable short-range clustering patterns. As d is incremented the behavior of the L(d) statistic becomes increasingly subject to border effects. Ideally the number of points utilized should be large (well in excess of 100 if possible) and where necessary edge correction procedures should be adopted. Simple edge correction can be implemented by weighting circular regions that lie partially outside the study boundary by the proportion lying within the study region.

The ideas behind the production of the standard K(d) statistic and associated Monte Carlo simulations can be used to provide additional measures that are often more useful. For example, instead of comparing the observed pattern to CSR, alternative models could be examined. splancs and spatstat, for example, support comparison with Poisson Cluster Processes (PCP) rather than a simple Poisson (CSR) process. PCP is similar to a simple Poisson process in that it starts with a random point set. These are regarded as “parents”. Each parent then produces a random number of offspring, and these are located at random around the parent according to a bivariate distribution function, g(x,y), for example a circular Normal with variance (spread), σ2:

The parents are then removed from the set and the PCP consists of the set of all offspring (Figure 5‑20).

Figure 5‑19 Ripley K function, shown as transformed L function plot


Splancs implements this kind of model with three parameters: RHO: intensity (density parameter, ρ) of the parent process; M: the average number of offspring per parent; and S2: the variance of location of offspring relative to their parent (the σ2 value in the expression above). By generating many PCP simulations for a given study region, an alternative probability envelope can be computed and plotted against the observed K- or L-function. There are many possible Poisson Cluster Processes; spatstat for example, implements the PCP process described above, which it refers to as a Thomas process. Other PCP processes supported in spatstat include: the Matérn PCP, in which g(x,y) is defined as a uniform disc of radius r around each parent; the Gauss-Poisson PCP in which each cluster is a single point or pair of points; and a very general PCP known as the Neyman-Scott process, of which the Thomas process is a specific instance.

Figure 5‑20 Thomas Poisson Cluster Process (20 clusters, SD=0.03, mean=5)


Source: Simulation generated using spatstat

A second important variant of the K(d) function involves treatment of two or more point (event) patterns within the same region. For example, one set of points might represent disease cases and a second set might be matched cases (controls) selected at random from the background population. Similar procedures may be used for spatio-temporal analysis, for example comparing crime patterns or disease incidence over time. If both datasets exhibit similar clustering we would expect the plots for both to be similar, or the difference between the computed functions, D=K1(d)‑K2(d), to be approximately zero for all d. Where background data are only available at an aggregated level (e.g. census Output Areas), it may be acceptable to assign the second set of points to the OA centroids together with a weighting factor reflecting the zone population. Crimestat supports such weighting as an option in order to assist in studies with background variation.

Another variant of the K(d) function, supported within splancs and in SANET for a network-based variant, is the bivariate model (or Cross-K function). With the Cross-K function the relationship between two separate point patterns is being examined. For example, in an ecological study this might involve examining collaborative or competitive behavior between two species, or an analysis of infected trees and the “at risk” population. In this instance the K(d) function measures the number of events in set B that lie within distance d of events A. The splancs function K12Hat supports this model together with supporting simulation functions to provide an estimated probability envelope for the function. A network-based equivalent, supported within the SANET software, might be used to examine the relationship between reported crime incidents and the location of railway stations (see Okabe et al., 2006a, 2006b, and Okabe and Satoh, 2008).

Readers may have noticed the similarity between the procedures involved in generating Ripley’s K, and the generation of kernel density maps (Section 4.3.4, Density, kernels and occupancy). The latter involves assigning a proportion of each event or point to a circular neighborhood, and then summing and normalizing the weighted surface at grid intersection points. This provides a continuous density surface rather than a linear model, and can be used to highlight local clustering. However, with case/control data or case/background variation data, separate surfaces may be computed for each dataset and then these may be combined, e.g. by taking the difference or the ratio of the case surface/control surface (for densities greater than some minimum value) to provide a comparison surface. Values of 1 on the ratio surface represent matching densities, whilst values greater than 1 represent locations where the cases occur in higher densities than the controls or background data.

As noted earlier, the majority of these distance-based statistics are computed using the Euclidean or similar analytical metrics. Where the observations lie on a network, for example an urban street network, it is often preferable to perform computations using a metric determined from the network, typically using shortest path calculations (see also, Figure 5‑5). Network-based statistical techniques implemented within the SANET package (see Okabe et al. 2006a, 2006b for more details) include: nearest neighbor analysis; conditional nearest neighbor analysis (in which the location of a related point set is incorporated into the model, see Okabe and Miki, 1984); Ripley’s K-function; and Ripley’s Cross-K function. If the observed points are not georeferenced to network nodes they can be assigned by SANET as additional nodes placed on the closest network link and then network variants of the statistical methods may be run. In these models distances are computed using shortest paths (which may be regarded as a reasonable behavioral assumption or approximation) and probability distributions (mean values and probability envelopes) are computed by Monte Carlo simulation.

It is worth concluding this subsection with some general observations (cautionary notes) regarding point pattern analysis:

The classical statistical model of CSR is inappropriate for many spatial datasets, but does provide a starting point for analysis and simulations in many instances

A close examination of the source data is always advisable, checking questions such as: how the data were collected; how locations and attributes were recorded; when data were recorded; what data (e.g. true or surrogate) information is represented; what overall measurement error is associated with the data; etc. Examining the underlying dataset for duplicates or near-duplicates is often an importance exercise

Multiple analytical approaches, including visualization techniques (1D, 2D, 3D) are advisable since initial impressions and single approaches may be misleading. Density (intensity) modeling and mapping are often useful exploratory techniques that aid understanding of the patterns observed

The variables selectable in analyzing point patterns (e.g. sample region, weights, order neighbor, maximum distance, clustering model, kernel function and bandwidth etc.) can yield very different results depending on the values and models chosen

Borders, area definitions, metrics, background variation, temporal variation and non-spatial data issues are of major significance in describing and modeling point patterns

Analysis of rare events presents particular problems — the small numbers involved may result in very low densities of events in large parts of a study region, if only by chance. Calculations based on zone counts are particularly subject to extremes and problems computing ratios

Analysis should be seen as exploratory, and a part of an overall in-depth analysis process

Analyses of this kind will not provide meaningful cause-effect or process-realization models ― in general pattern analysis tells us very little about process, although it may enable us to exclude certain processes from consideration. Many processes can result in the same or similar observed patterns, and different patterns can generate the same or very similar statistical measures, such as K(d) functions

Model fitting is often a challenging process, although recent toolsets make this process a great deal easier than in the past. Examination of the goodness of fit and sophisticated residuals analysis now form an important part of this type of spatial analysis