﻿ Data Exploration and Spatial Statistics > Statistical Methods and Spatial Data > Spatial sampling

# Spatial sampling

Principles and methods of spatial sampling have been described briefly in Section 2.2.8, Spatial sampling, and some of the general concepts are explored in more detail in Dixon and Leach (1977, CATMOG 17) and more recently in Delmelle (2008). In this subsection we focus on 2D sampling, but similar concepts apply to 1-D (transect) and 3D (volumetric) sampling. When spatial samples are subsequently analyzed, many factors have to be taken into consideration, such as: sample size; how representative the sample is; whether the sample might be biased in any way; whether temporal factors are important; to what extent edge effects might influence the sample and the subsequent analysis; whether sampled data has been aggregated; how the data measurements were conducted (procedures and equipment); whether sampling order or arrangement is important; to what extent can the measured data samples be regarded as being from a population? In short, the full range of classical sampling issues must be considered coupled with some specifically spatial factors.

Amongst the most commonly applied sampling schemes are those based on point sampling within a regular grid framework. Figure 5‑1 illustrates a number of the simplest schemes based on 100 sample points. The first (Figure 5‑1A) shows a set of regularly spaced sample points in a sample square region. Systematic sampling of this type, and variants such as that shown in Figure 5‑1D in which the start points of each sequence are selected with a random offset, suffer from two major problems: (i) the sampling interval may coincide with some periodicity in the data being studied (or a related variable), resulting in substantial bias in the data; and (ii) the set of distances sampled is effectively fixed, hence important distance-related effects (such as dispersal, contagion etc.) may be missed entirely. Purely random sampling, as illustrated in Figure 5‑1B has attractions from a statistical perspective, but as can be seen, marked clustering occurs whilst some areas are left without any samples at all. A number of solutions to these problems are used, often based on combining the coverage benefits of regular sampling with the randomness of true random selection. Figure 5‑1C illustrates this class of sampling schemes, with each sample point being selected as a random offset from the regularly spaced (x,y) coordinates shown by the + symbols. The degree of offset determines how regular or how random the sampled points will be. Note that some clustering of samples may still occur with this approach. In each of these examples the point selection is carried out without any prior knowledge of the objects to be sampled or the environment from which the samples are to be drawn. If ancillary information is available, it may alter the design selected. For example, if samples are sought that represent certain landscape classes (e.g. grassland, deciduous woodland, coniferous woodland, etc.), then it is generally preferable to stratify the samples by regions that have been separated identified as forming these various classes. Likewise, if 100 samples are to be taken, and it is known that certain parts of the landscape are much more varied than others (in respect of the data to be studied) then it makes sense to undertake more samples in the most varying regions.

Figure 5‑1 Point-based sampling schemes

 A. Regular B. Random  C. Random offset from regular (random clustered) D. Regular with random start of sequence (y offset)  In addition to fixed sampling schemes, adaptive schemes can be applied which may offer improvements in terms of estimating mean values and reducing uncertainty (providing lower variances). Typically an adaptive scheme will involve four steps: apply a coarse resolution fixed scheme (e.g. as per Figure 5‑1C ) to the study area; record data at each sampled location; compute decision criteria for continued sampling; and extend sampling in the neighborhood of locations that meet the decision criteria. For example, if recorded values of Cadmium in soil samples at certain locations exceeds a pre-defined threshold, then additional samples might be taken radially around the initial threshold locations. Alternatively, the initial values at each location might be used to compute an experimental variogram (see Section 6.7.1, Core concepts in Geostatistics), from which estimated values and variances of these values can be computed using Kriging methods (see Section 6.7.2, Kriging interpolation). Locations with high Kriging variance (i.e. poorly represented) could then be identified and additional sampling designed to reduce this uncertainty.

Sampling frameworks

GIS toolsets and related software incorporate few facilities that directly address issues of sampling and sample design. Most commonly the terms sampling and resampling in GIS are used to refer to the frequency with which an existing dataset (raster image or in some cases, vector object) is sampled for simple display or processing purposes (e.g. when overlaying multiple data layers, or computing surface transects). These operations are not directly related to questions of statistical sampling. Two aspects of statistical sampling are explicitly supported within several GIS packages. These are: (i) the selection (sampling) of specific point or grid cell locations within an existing dataset; and (ii) the removal of spatial bias from collected datasets, using a procedure known as declustering. We describe each of these in the subsections below.

A number of GIS software packages, such as TNTMips, ENVI, Idrisi and GRASS provide tools to assist in the selection of sample points, grid cells or regions of interest (ROI) from input datasets. Often these datasets are remote-sensing images, which may or may not have been subjected to some form of initial classification procedure. Examples of the facilities provided are listed below:

ENVI — takes a raster image file as input and provides three types of sampling, which it describes as:

Stratified random sampling, which may be proportionate or disproportionate. In the former case random samples are made from each class or ROI in proportion to the class or region size. Disproportionate sampling essentially requires users to specify the sample size, although the elements will still be randomly selected from each class or ROI

Equalized random sampling, which selects an equal number of observations at random from each class or ROI

Random sampling, which ignores classes or ROIs and simply selects a predefined number of cells or points at random

The selected points or cells (which may be output as a separate georeferenced list or table) are then used in post-classification analysis — comparing classifications with ground truth in each case (obtained from field survey or other independent data sources).

Idrisi offers similar facilities to ENVI via its SAMPLE function, providing random, systematic or stratified random point (cell) sampling from an input image (grid). The selection process for stratified random simply involves regarding the input image as being constructed from rectangular blocks of cells, and then sampling random cells within these larger blocks.

GRASS provides simple random sampling which may be combined with masking to create forms of stratified random samples. This facility may be somewhat cumbersome to implement. GRASS also provides a facility to generate random sets of cells that are at least D units apart, where D is a user-specified buffer distance. This can result in a more stratified than random sample and it is suggested that D should be derived with reference to observed levels of spatial autocorrelation (cf. Sections 5.5, Spatial Autocorrelation, and 6.7, Geostatistical Interpolation Methods).

TNTMips supports a range of point sampling facilities to be used within vector polygons (e.g. field boundaries, Figure 5‑2). These provide the more familiar form of statistical sampling frameworks that would precede field studies, and have application for research into areas such as soil composition (e.g. for precision farming), groundwater analysis, geological studies, or ecological research. However, they could equally well be applied to urban environments, as a precursor to environmental monitoring or even household surveys. The software provides for a two-stage process: (i) the creation of grids within the polygonal regions to be studied; and (ii) the selection of points within these grid structures. Grids are of user-definable size (edge length or area), shape (triangular, hexagonal, square, linear strips or random rectangles), and orientation (angle of rotation). See examples in Figure 5‑2.

Where a regular cell framework has been generated the software then supports creation of sampling points within each cell — single points in this example. The methods supported are regular (center of cell, Figure 5‑3A); systematic unaligned (Figure 5‑3B) in which the first cell point is selected with random x,y coordinates and subsequent points are selected using the same x or y coordinate as the previous cell, but with one of the two coordinates selected at random, alternating on a column-by-column basis); and random (Figure 5‑3C). In the latter two cases a weighting factor is provided that biases selection towards the center of the cell (100= no bias, 1=maximum bias). Selected sample points that nominally fall inside a cell but outside of the polygon boundary are excluded. With general purpose GIS packages it is straightforward to generate a random, regular or partially randomized point set (within or externally to the GIS), and then to compute the intersection of this set with pre-defined polygons or grid cells. With this approach simple point-sampling schemes may be created, although precise matching to polygon forms, sample numbers or attribute weightings may be difficult. Purpose-built add-ins, such as Hawth’s Tools/GME for ArcGIS, provide a range of tailored sampling facilities. These include: (i) generation of random points, with a range of selection options (including use of raster or polygon reference layers — see Figure 5‑4A, Mississippi, USA); (ii) random selection from an existing feature set (points, lines, polygons — see Figure 5‑4B, 200 radio-activity monitoring sites in Germany. Random sample of 30 (red/large dots)<100 units of radiation and 30 (crosses)>=100 units of radiation); and (iii) conditional point sampling, designed for case-control analysis and similar applications. The latter facilitates a variety of random point generation methods in a region surrounding specified source points.

Random points in the plane may be used as sampling points or in connection with modeling ― for example as part of a Monte Carlo simulation of a probability distribution. Random points can also be generated on a network. Naturally the distribution of such points will be affected by the distribution of the network links in the plane, and may thus appear clustered with respect to the plane (excluding the network). The corollary of this observation is that a clustered point pattern in the plane that is, in fact, a set of points on a network, may actually be a random uniform distribution when shortest network path distances are used rather than Euclidean planar distances. For example, the set of 100 points in Figure 5‑5A (Tripolis, in Greece) appears to be far from random, but in fact this is a random uniform (Poisson) point set on a network, as shown in Figure 5‑5B. This example was generated using the SANET software, which also provides a wide range of tools for analyzing observed point patterns on or almost on networks such as that illustrated (see further, Section 5.4.1, Basic distance-derived statistics).

The term quadrat sampling is applied to schemes in which information on all static point data (e.g. trees, birds’ nests etc.) is collected using an overlay of regular form (e.g. a square or hexagonal grid). Collected data are then aggregated to the level of the quadrat, whose size, orientation and internal variability will all affect the resultant figures (e.g. counts). Very small quadrats will ultimately contain 0 or 1 point objects, whilst very large quadrats will contain almost all the observations and hence will be of little value in understanding the variability of the data over space. An alternative to procedures based on lattices of quadrats is to “drop” quadrats onto the study area at random. Such quadrats may be of any size or shape, but circular forms have the advantage of being directionally invariant. A disadvantage with this approach is that some areas may be repeat sampled unless precautions are taken to exclude areas once sampled.

Figure 5‑2 Grid generation examples

 A. Square grid B. Hexagonal grid C. Random rectangular grid, 60deg   Figure 5‑3 Grid sampling examples within hexagonal grid, 1 hectare area

 A. Regular (cell center) B. Systematic (random offset) C. Random, no center bias   Figure 5‑4 Random point generation examples — ArcGIS

 A. Random sample points, 5 per county B. Stratified random, 30% of each stratum  Figure 5‑5 Random point samples on a network

 A. Point set in the plane B. Random point set on a network  SANET software: Prof A Okabe et al; Network data (Tripolis, Greece): S Sirigos; see also Figure 7‑16

Declustering

Point samples may be unduly clustered spatially, for a variety of reasons. For example, samples from rivers, boreholes and wells may provide the basis for a chemical analysis of groundwater supplies, and the distribution of these is often clustered. Geological, hydrological and hydrographic surveys frequently involve intensive data collection in localized areas, with sparsely sampled areas elsewhere. Practical constraints, such as access in built-up or industrialized zones, may also dictate sampling schemes that exhibit strong clustering. And of course, there may be clustering as a feature of the sampling design (e.g. stratified sampling, repeat sampling in small areas to obtain a representative measure of selected attributes). The latter may have been designed to ensure that different regions of interest (ROIs) are represented adequately, or that suspected areas of greater local variation are sampled in more detail than areas that are suspected of being more uniform.

Measured attributes in such instances may not be representative of population (whole region) attributes because observations in close proximity to one another may exhibit strong positive spatial autocorrelation — neighboring measurements often have very similar attributes (see further Section 5.5.1). This results in attributes within these regions having undue weight in subsequent calculations. In the extreme, almost all observations may have been taken in a small region with consistently high or low attribute values, whilst very few have been taken from all remaining parts of the study area. Assuming spatial autocorrelation is present, clustering has the effect that measures such as the calculation of mean values, the estimation of regression parameters, or the determination of confidence intervals may be substantially biased.

A partial solution to problems of this kind is known as spatial declustering. Essentially this involves removing or reducing the known or estimated adverse effects of clustering in order to obtain a more representative picture of the underlying population data and/or to ensure techniques such as feature extraction and surface modeling operate in an acceptable and useful manner. There are several approaches that may be adopted, each of which involves adjusting the sample values prior to further analysis. One of the simplest procedures for declustering involves defining a regular grid over sampled points (rather as per the grid generation procedure described in the Section 6.5.2, Gridding and interpolation methods). The grid cell size is selected such that it is meaningful for the problem at hand (e.g. feature extraction) and/or ensures that the average number of points falling in a grid cell is 1 (typically). Cells which contain many sample points may then be regarded as clustered or possibly over-sampled, and a statistic such as the median value of the measured attribute(s) across all sampled points in that cell may then be used as the single assigned cell (center) value. Another commonly provided declustering technique based on this grid-overlay approach is to use the density of points as a weighting function. For example, cells with 0 points have zero weight, cells with 1 point have a weight of 1, and cells with n points have each point weighted 1/n (hence in effect this is a simple averaging procedure). In reality both of these procedures amount to a kind of stratification of already sampled locations subsequent to their selection. It is important to note that procedures of this kind present no substitute for randomness in the selection of locations to be sampled and can amount to very dubious practice if the intention is subsequently to build an inferential statistical model using the observations that are retained.

In a similar vein, and as an alternative to count-based weighting, area-based weighting is provided as an option in several packages. This involves generating a set of Voronoi regions around each sample point, which results in small areas for closely spaced points and large areas for sparsely arranged sample points. The weights applied are then directly related to these areas. This method is simple but needs to have some justification and/or validation in terms of the problem under consideration, and may suffer from serious edge-effect problems, depending on how the Voronoi regions are computed (e.g. to the edge of the mapped region, or to the MBR or convex hull of the sample point set). Hybridized variants of area-based weighting (e.g. by adjusting the weights using known physical boundaries and/or nearest neighbor distances) have been shown to substantially reduce mean absolute error (MAE) and RMSE in some instances, e.g. see Dubois and Saisana (2002). Revised point-weighting schemes of this kind can be generated within GIS packages and then applied to the target attributes prior to further analysis. The scheme proposed by Dubois and Saisana, for example, which they tested on DEM data for Switzerland, was of the form: where wi is the weight applied to the ith sample point, i=1,2,…n; si is the area of the ith Voronoi region; sm is the average area of all the Voronoi regions (i.e. study area/n); and di2 is the squared distance of the ith sample point to its nearest neighbor. Models of this type do not have universal application, and selection of appropriate declustering procedures requires careful analysis of the sample data, sub-sampling and cross-validating against some form of ground truth where necessary, and then applying adjustments in a manner appropriate to the problem and dataset to hand.