Spatial analysis is somewhat unusual in that key datasets are often provided by or acquired from third parties rather than being produced as part of the research. Analysis is often of these pre-existing spatial datasets, so understanding their quality and provenance is extremely important. It also means that in many instances this phase of the PPDAC process involves selection of one or more existing datasets from those available. In practice not all such datasets will have the same quality, cost, licensing arrangements, availability, completeness, format, timeliness and detail. Compromises have to be made in most instances, with the over-riding guideline being fitness for purpose. If the datasets available are unsuitable for addressing the problem in a satisfactory manner, even if these are the only data that one has to work with, then the problem should either not be tackled or must be re-specified in such a way as to ensure it is possible to provide an acceptable process of analysis leading to worthwhile outcomes.
A major issue related to data sourcing is the question of the compatibility of different data sets: in formats and encoding; in temporal, geographic and thematic coverage; in quality and completeness. In general datasets from different sources and/or times will not match precisely, so resolution of mismatches and data linkage issues can become a major task in the data phase of any project. And as part of this process the issue of how and where to store the data arises, which again warrants early consideration, not merely to ensure consistency and retrievability but also for convenient analysis and reporting.
Almost by definition no dataset is perfect. All may contain errors, missing values, have a finite resolution, include distortions as a result modeling the real world with discrete mathematical forms, incorporate measurement errors and uncertainties, and may exhibit deliberate or designed adjustment of positional and/or attribute data (e.g. for privacy reasons, as part of aggregation procedures). Spatial analysis tools may or may not incorporate facilities for explicitly handling some of the more obvious of these factors. For example, special GIS tools exist for handling issues such as:
•boundary definition and density estimation
•coding schemes that provide for missing data and for masking out invalid regions and/or data items
•modeling procedures that automatically adjust faulty topologies and poorly matched datasets, or datasets of varying resolutions and/or projections
•a wide range of procedures exist to handle difficulties in classification
•data transformation, weighting, smoothing and normalization facilities exist to facilitate comparison and combination of datasets of differing data types and extents
•lack of continuity in field data can be explicitly handled via breaklines and similar methods
•a range of techniques exist for modeling data problems and generating error bounds, confidence envelopes and alternative realizations
In other words, many facilities now exist within GIS and related software packages that support analysis even when the datasets available are less than ideal. It is worth emphasizing that here we are referring to the analytical stage, not the data cleansing stage of operations. For many spatial data this cleansing has been conducted by the data supplier, and thus is outside of the analyst’s direct control. Of course for data collected or obtained within a project, or supplied in raw form (e.g. original, untouched hyperspectral imagery) data cleansing becomes one element in the overall analytical process.
For many problems important components of the project involve intangible data, such as perceptions or concerns relating to pollution, risk or noise levels. Quantifying such components in a manner that is accepted by the key participants, and in a form suitable for analysis, is a difficult process. This is an area in which a number of formal methodologies have been established and widely tested (including spatial decision support tools and cost-benefit analysis tools). Such tools are almost always applied separately and then utilized to generate or qualify input to GIS software (e.g. weightings).
An additional issue arises when data are generated as part of a research exercise. This may be as a result of applying a particular procedure to one or more pre-supplied datasets, or as a result of a simulation or randomization exercise. In these latter cases the generated datasets should be subjected to the same critical analysis and inspection as source datasets, and careful consideration should be given to generated and magnified distortions and errors produced by the processing carried out. A brief discussion of such issues is provided in Burrough and McDonnell (1998 Ch.10, and 2015 Ch.4) whilst Zhang and Goodchild (2002) address these questions in considerable detail. For further discussion of the nature of spatial data and some of the key issues raised when undertaking analyses using such datasets, see Haining (2008).