Data Quality

<< Click to Display Table of Contents >>

Navigation:  Big Data and Geospatial Analysis > Challenges of Big Data >

Data Quality

Data quality issues are paramount to big datasets and are one of the major criticisms that have deterred many academics. Errors can arise from all parts of the data generation and amalgamation processes, from measurement error to adjustment errors (see the total error paradigm in Groves et al., 2011). These errors can then contaminate other data when records are linked and are inherently difficult to identify. Furthermore, the volume and velocity of data prevent any efficient means of validating records.

Small sets of data are typically created using procedures that are transparent and robust. However, most forms of Big Data are not subject to standard quality controls and they come from numerous different sources of varying quality and reliability. That said, most datasets never enter the public realm so there is little knowledge about how rigorous the collection and data cleaning procedures are. Within the commercial sector companies may focus on quantity over quality and emphasis is placed on efficiency in order to maximize profits. It is, therefore, unsurprising that Big Data may contain an unknown proportion of noise generated by errors and inaccuracies.

While most forms of Big Data are generated by machines that have reduced the elements of human error (especially with regards to temporal and georeferencing components), there are cases where machines can be inaccurate. For example, GPS coordinates are known to be less accurate in so-called urban canyons. Yet, as seen above, there are is still a large share of Big Data that are human-sourced. It is not uncommon for errors (accidentally or deliberately) to occur in VGI. For example, the quality of OSM data has been found to vary due to the skills and interests of contributors (Haklay, 2010). Furthermore, business and administrative data could also include errors that arise due to mistakes during processing. For example, the falsification of administrative records could lead to changes in policy and allocations of funding (Connelly et al, 2016).

Data quality also often deteriorates over-time as places and events are not static. For example, administrative and consumer databases regularly retain the incorrect residential addresses of adults for individuals who have moved. It is, therefore, important that each record is timestamped to improve the transparency of databases. It is also useful to synthesis data from multiple sources to reinforce confidence in observations and measurements and to identify outliers.