Sampling has always been a fundamental component of geographic research. In order to devise nomothetic (law-based) knowledge, hypotheses need to be built from representative samples. Bias typically arises due to one of three reasons: the selection procedure is not random; the sampling frame does not account for all groups in society; and/or some groups are impossible to reach (Moser and Kalton, 1985). In the case of Big Data, they are usually a by-product of actions and therefore are not random nor scientifically sampled at all, and some groups may be excluded. Instead of building representations from carefully constructed sampling frames, researchers working with Big Data acquire data and then must attempt to work out who they truly represent.
One very early failure of a big dataset was when the Literary Digest’s postal poll regarding the presidential election received roughly 2.4 million returns in 1936. With the aim of achieving as large a sample as possible, the magazine sought datasets which contained the names and addresses of millions of adults, these primarily comprised vehicle registration lists and telephone directories, and in total over 10 million ballots were posted. However, despite receiving an impressive number of responses, the poll incorrectly predicted that Landon would beat Roosevelt. Their data sources are now understood to have produced biased samples that were likely to be of a higher socio-economic status. The rates of both automobile and telephone ownership were much lower amongst poorer adults.
Generally, in both academia and the commercial sector, large sample sizes are valued as more favorable where they represent a greater proportion of the "population" of interest. Indeed, Poisson’s law of large numbers entails that with larger samples one is more likely to acquire an observed mean closer to the theoretical mean of a given phenomenon. Despite the promises of Big Data that they represent complete populations (“N = All”), the majority of datasets unfortunately fail to capture every citizen. Data may represent all users of a given service, but does everyone in the population use that service? It is not unusual for a minority of the population to produce the majority of the data (Crampton et al., 2013). Big Data are devoid of robust sampling frames meticulously designed to reduce sample errors and sample biases. In addition, the primary objectives of most big datasets are not to acquire complete coverage of the population at large, thus data are prone to representing particular subsets of the population that engage with the various activities that generate data. Moreover, it is very difficult to filter out incorrect information, often government and commercial bodies rely on the individual correcting their records themselves.
Whilst most representations are necessarily partial, incompleteness has previously been considered to be both systematic and known. A disadvantage of Big Data is that the data are often of unknown provenance. It is not uncommon for big datasets to contain no demographic variables at all making it inherently difficult to estimate their representativeness of the population at large. This is usually because asking participants to volunteer additional information may discourage them from using the particular service. We are, therefore, still reliant on censuses for good quality primary geodemographic data available at fine geographic scale, despite the fact they are collected very infrequently.
Data linkage is often required to ascribe demographic characteristics to Big Data, bolstering their utility and reducing bias. Linkage techniques in geodemographics generally fall into one of three categories: a) linkage to individual-level data, b) linkage to small area population statistics or c) inferences based on other personal variables in the data. Individual data linkage is the only means of acquiring accurate demographic variables without succumbing to issues of ecological fallacy. Unfortunately, individual-level demographic data are very rarely available to researchers due to their confidentiality restrictions. However, such linkage has been made feasible in some countries where individual-level data are routinely collected at vast scales, for example, medical research in Sweden.
Another means of estimating the demographic bias of geographic datasets is through data linkage to small area official statistics. However, in these instances, validity is limited by the quality of the official population statistics and is hampered by issues associated with spatial aggregations (for example, the scale and zoning effects (Openshaw and Taylor, 1979) and the ecological fallacy (Openshaw, 1984)). While some have taken advantage of microsimulation procedures in order to retain the focus of analysis at the individual level, they are still haphazard estimations based on assumptions.
An additional, but often overlooked means of linking demographic characteristics to big data is through linkage to datasets that record the statistics of personal attributes which are often recorded in large databases. A prominent example of such attributes is names. Names appear in many different big datasets that require users to register accounts (such as social media or store loyalty programmes). Both surnames and forenames are likely to be distributed unevenly across a wide range of geodemographic domains. For example, most European names are sex-specific. Consequently, there is value in devising demographic databases on forenames in order to provide typical statistics on age and gender that can be ascribed to other data. Figure 9-7 demonstrates this by comparing UK Census data to estimated demographics from the names detected in a 2011 population register that has near complete coverage of the adult population. Although, of course, not every individual will share the modal characteristics of all bearers of the same name.
Figure 9-7 Population pyramid based calculated from inferring demographics from the full names of residents recorded in the 2011 UK Consumer Register (colored) versus the equivalent demographics from the 2011 Census of Population (grey). Source: Leak, 2018