Ratios, indices, normalization, standardization and rate smoothing

<< Click to Display Table of Contents >>

Navigation:  Building Blocks of Spatial Analysis > Queries, Computations and Density >

Ratios, indices, normalization, standardization and rate smoothing

The vegetation index (VI) calculation we described in Section 4.3.2, Simple calculations, was generated in a manner that ensured all output pixels (grid cells) would contain entries that lie in the range [‑1,+1]; an index value of 0 indicates that matching pixel locations have the same value. There are many reasons for changing attribute data values in this manner before mapping and/or as part of the analysis of spatial datasets. One of the principal reasons is that if you create a map showing zones (e.g. counties) for a spatially extensive attribute, such as total population, it is very misleading — obviously a single total value (indicated perhaps by a color) does not apply throughout the zone in question. A more correct or meaningful view would be to divide by the area of the zone and show a spatially intensive attribute, the density of the population across the zone in this case. In this way zones of different size can be analyzed and mapped in a consistent manner.

The field of geodemographics, which involves the analysis of population demographics by where they live, involves the widespread use of index values — for a full discussion of this subject see Harris et al. (2005) and the “Geodemographics Knowledgebase” web site maintained by the UK Market Research Society (www.mrs.org.uk/geodemographics/).

In many instances, especially with data that are organized by area (e.g. census data, data provided by zip/post code, data by local authority or health districts etc.) production of ratios is vital. The aim is to adjust quantitative data that are provided as a simple count (number of people, sheep, cars etc.) or as a continuous variable or weight, by a population value for the zone, in a process known as normalization. This term, rather confusingly, is used in several other spatial contexts including statistical analysis, in mathematical expressions and in connection with certain topological operations. The context should make clear the form of normalization under consideration. Essentially, in the current context, this process involves dividing the count or weight data by a value (often another field in the attribute table) that will convert the information into a measure of intensity for the zone in question — e.g. cars owned per household, or trees per hectare. This process removes the effects associated with data drawn from zones of unequal area or variable “population” size and makes zone-based data comparable.

Three particular types of normalization are widely used: (i) averages; (ii) percentages; and (iii) densities. In the case of averages, an attribute value in each zone (for example the number of children) is adjusted by dividing by, for example, the number of households in each zone, to provide the average number of children per household, which may then be stored in a new field and/or mapped. In the case of percentages, or conversion to a [0,1] range, a selected attribute value is divided by the maximum value or (in the case of count data) the total count in the set from which it has been drawn. For example, the number of people in a zone registered as unemployed divided by the total population of working age (e.g. 18-65). The final example is that of densities, where the divisor is the area of the zone (which may be stored as an intrinsic attribute by some software packages rather than a field) in which the specified attribute is found. The result is a set of values representing population per unit area, e.g. dwellings per square kilometer, mature trees per hectare, tons of wheat per acre etc. — all of which may be described as density measures.

The term standardization is closely related to the term normalization and is also widely used in analysis. It tends to appear in two separate contexts. The first is as a means of ensuring that data from separate sources are comparable for the problem at hand. For example, suppose that a measure of the success rate of medical treatments by health district were being studied. For each health district the success rates as a percentage for each type of operation could be analyzed, mapped and reported. However, variations in the demographic makeup of health districts have not been factored into the equation. If there is a very high proportion of elderly people in some districts the success rates may be lower for certain types of treatment than in other districts. Standardization is the process whereby the rates are adjusted to take account of such district variations, for example in age, sex, social deprivation etc. Direct standardization may be achieved by taking the district proportion in various age groups and comparing this to the regional or national proportions and adjusting the reported rates by using these values to create an age-weighted rate. Indirect standardization involves computing expected values for each district based on the regional or national figures broken down by mix of age/sex/deprivation etc., and comparing these expected values with the actual rates. The latter approach is less susceptible to small sample sizes in the mix of standardizing variables (i.e. when all are combined and coupled with specific treatments in specific districts).

The second use of the term standardization involves adjusting the magnitude of a particular dataset to facilitate comparison or combination with other datasets. Two principal forms of such standardization are commonly applied. The first is z-score standardization (see Table 1‑3). This involves computing the mean and standard deviation of each dataset, and altering all values to a z-score in which the mean is subtracted from the value and the result divided by the standard deviation (which must be non-zero). This generates values that have a mean of zero and a standard deviation of 1. This procedure is widely used in the statistical analysis of datasets. The second approach involves range-based standardization. In this method the minimum and maximum of the data values are computed (thus the range=maxmin), and each value has the min subtracted from it and is then divided by the range. To reduce the effect of extreme values range standardization may omit a number or percentage of the smallest and largest values (e.g. the lowest and highest 10%), giving a form of trimmed range. Trimmed range-based standardization is the method currently applied within the UK Office of National Statistics (ONS) classification of census areas (wards) using census variables such as demographics, housing, socio-economic data and employment.

Another way of looking at the process of dividing a count for a zone, e.g. population, by the zone area is to imagine the zone is divided into very many small squares. Each square will contain a number of people registered as living in that square. The total for the zone is simply the sum of the counts in all the little squares. But this form of representation is exactly that of raster or grid GIS, and hence raster representations generally do not require area-based normalization, since they are already in a suitable form. Of course, the process of generating a grid representation may have involved computation of values for each cell from vector data, either assuming a uniform model or some other (volume preserving model) as discussed in Section 4.2.10. Normalization is used with raster data for a variety of reasons, for example: (i) to change to set of values recorded in a single map layer to a [0,1] or [‑1,1] range, e.g. by dividing each cell value, z, by the theoretical or observed maximum absolute value for the entire grid:

z*=z/zmax or using the transformation:

z*=(z‑zmin)/(zmax‑zmin)

and (ii) to combine two or more grid layers or images into a single, normalized or “indexed” value (sometimes this process is described as indexed overlay). In this latter case the overlay process may be a weighted sum, which is then normalized by dividing the resulting total by the sum of the weights. These various observations highlight a number of issues:

the division process needs to have a way of dealing with cases where the divisor is 0

missing data (as opposed to values that are 0) need to be handled

if an attribute to be analyzed and/or mapped has already been normalized (e.g. is a percentage, a density, an average or a median) then it should not be further normalized as this will generally provide a meaningless result

highly variable data can lead to confusing or misleading conclusions. For example, suppose one zone has a population of 10,000 children, and the incidence of childhood leukemia recorded for this zone over a 10 year period is 7 cases. The normalized rate is thus 0.7 per 1000 children. In an adjacent zone with only 500 children one case has been recorded. The rate is thus 2 per 1000 children, apparently far higher. Effects of this type, which are sometimes described as variance instability, are commonplace and cause particular problems when trying to determine whether rare illnesses or events occur with unexpectedly high frequencies in particular areas

the result of normalization should be meaningful and defensible in terms of the problem at hand and the input datasets — using an inappropriate divisor or source data which are of dubious value or relevance will inevitably lead to unacceptable results

obtaining an appropriate divisor may be problematic — for example if one is looking at the pattern of disease incidence in a particular year, census data (which in the UK are only collected every 10 years) may be an inadequate source of information on the size of the population at risk

GIS packages rarely emphasize the need for normalization. One exception is ArcGIS, where the Properties form for map layers includes a normalization field when quantitative data are to be classified and mapped (Figure 4‑38). In this example we are normalizing the number of people who are categorized as owner occupiers (home owners, OWN_OCC) by a field called TOTPOP (total population). The normalization prompt is provided for graduated colors, graduated symbols and proportional symbols, but not for dot density representation as this already provides a spatially intensive representation of the data. The result of normalization in such cases is a simple ratio. Assuming the field being normalized is a subset of the divisor (which it should be) the ratio will not exceed 1.0. Specialized spatial analysis software, like GeoDa, may provide a more extensive and powerful set of pre-defined normalization tools (see further, Section 5.2.1), although in this case the software does not support built-in user-definable calculation facilities.

Figure 4‑38 Normalization within ArcGIS

clip0080

Many of GeoDa’s facilities are designed to address problems associated with simple normalization, or in ESDA terminology, the computation of raw rates. The example of OWN_OCC/TOTPOP above would be a raw rate, which could be plotted as a percentile map in order to identify which zones have relatively high or low rates of owner occupancy, unemployment or some other variable of interest. As an illustration of this procedure, Figure 4‑39 shows 100 counties in North Carolina, USA where the incidence of sudden infant death syndrome (SIDS, also known as “cot death”) in the period 1st July 1974 to end June 1978 is mapped. This dataset has been normalized to form a raw rate by dividing the number of cases in each zone by the births recorded for the same period (BIR74). The resulting raw rates have been sorted, from the lowest to the highest, and divided into 7 groups (or quantiles), each containing approximately the same number of zones. The uppermost range is comprised of the zones in which the relative incidence or rate of SIDS is highest.

Figure 4‑39 Quantile map of normalized SIDS data

clip0081

However, even raw rates can be misleading, as noted earlier, and various techniques have been devised for analyzing and mapping such data to bring out the real points of interest. One example will suffice to explain the types of procedure available. The ratio SID74/BIR74 is normally computed for every zone. However, one could compute this ratio for all zones first, based on the sum of the selected fields. This would give an average ratio or rate for the entire study area. This value could be regarded as the expected rate for each zone, E, say. We can then calculate the expected number of SIDS cases for each zone by multiplying the average rate E by the relevant population in each zone, i: Ei=E*BIR74i.

Now consider the ratio: Ri=SID74i/Ei — this is a ratio with a value from [0,n]. It highlights those areas with higher than expected or lower than expected rates (Figure 4‑40), i.e. excess risk rates. As can be seen, one zone in Figure 4‑40 is picked out in the lower center of the mapped zones as appearing to have an unusually high SIDS rate, being at least 4 times the average for the region as a whole. This technique is one of a series of procedures, many of which are described as rate smoothing, that seek to focus attention on interesting data variations rather than artifacts of the way the zones are laid out or their underlying population size. As the GeoDa documentation states: “Using choropleth maps to represent the spatial distribution of rates or proportions represents a number of challenges. These are primarily due to the inherent variance instability (unequal precision) of rates as estimates for an underlying 'risk'.” One approach proposed by Anselin to address this issue is their Empirical Bayes (EB) rate smoothing method (described in Anselin et al., 2003 and supported in GeoDa). However, researchers should be aware that such smoothing is simply one form of ‘reasonable’ adjustment of the dataset that can introduce its own artifacts into subsequent data analyses.

Figure 4‑40 Excess risk rate map for SIDS data

clip0082

When attempting to develop explanatory models of such data (mortality data, disease incidence data, crime data etc.) it is the widely accepted practice to utilize the expected rate based on overall regional or national statistics as an essential component of the analysis. This issue is discussed further in Section 5.6.4, Spatial autoregressive and Bayesian modeling, where the data illustrated here are re-visited ― see also, Cressie and Chan (1989); and Berke (2004).

Single ratios provide the standard form of normalization, as described above, but more complex combinations of attributes, from one or more sets of spatial data, may be combined in a variety of ways in order to generate index values. One advantage of such procedures as that index values may be used as a guide for planning policy or as comparative measures, for example as a means of comparing crime rates and clear-up rates between different police forces. As the need to incorporate several factors in the construction of such indices becomes more important, simple weighting and ratio procedures will be found wanting, and more sophisticated methods (such as various forms of cluster analysis) will be required.