Data Linkage

<< Click to Display Table of Contents >>

Navigation:  Big Data, AI and Geospatial Analysis > Challenges of Big Data >

Data Linkage

Although rich in detail, most big datasets still only offer partial representations and typically only pertain to a very limited number of variables. Therefore, as discussed earlier, data linkage is required to improve the coverage of data and enhance their predictive capabilities (Harris et al., 2005). Combining Big Data maximizes their value and enables critical research to ask questions about distinctive phenomena that may be extrinsically associated with each other. Having discovered the value of linked data, some countries have introduced data infrastructures which enable the linkage of individual records to facilitate research and public service delivery. However, it is not feasible to link most data from commercial organizations given that few efforts have been made to standardize how they collect information on individuals.

Yet even where consumer data are more widely available, challenges to effective concatenation and conflation remain, not least because of the ways that different data are structured by the organizations that collect them. In short, data linkage is plagued by uncertainty and the potential for propagating errors. There have been countless efforts to link big datasets, often using probabilistic linkage techniques. Such techniques cannot match every case and may often incorrectly match records (Goerge and Lee, 2001). However, without individual-level linkage data either remain isolated or trends are observed through aggregations and associations with small data.

Often unique identifiers are particular to each dataset and so linkage is reliant on other unique elements such as addresses and names. However, both of these can be input and structured inconsistently. Whilst postcode and zip code systems were designed to assist the identification of addresses, in the UK, for example, a single postcode covers 15 properties on average, often considerably more in urban areas. In addition, postcodes may change or be recorded incorrectly. Research has previously attempted to build a longitudinal database of the UK adult population from 20 separate annual population registers. It was not possible to link roughly half of the addresses using a simple string match due to inconsistencies in formatting and spellings. Therefore, a bespoke matching algorithm had to be devised. The same research also found that person names could be recorded inconsistently, thus hampering linkage. However, using specially devised heuristics, over 100,000 women each year were identified to have probably changed name following a marriage for instance. Indeed, with an understanding of how identifiers could be recorded differently, it is possible to tailor linkage algorithms to maximize match rates. Indeed, it was even able to estimate the origin and destination of over 750,000 domestic migrants by looking at subsets of households that joined and left addresses with unique combinations of names (Figure 9-9).


Figure 9-9 The estimated migration flows between local authorities in Great Britain derived from novel data linkage techniques applied on the 2013 and 2014 Consumer Registers. Only flows of more than 40 people are shown. Source: Lansley and Li, 2018