Big Data and Geospatial Analysis

<< Click to Display Table of Contents >>

Navigation:  »No topics above this level«

Big Data and Geospatial Analysis

Perhaps one of the mostly hotly debated topics in recent years has been the question of "GIS and Big Data". Much of the discussion has been about the data: huge volumes of 2D and 3D spatial data and spatio-temporal data (4D) are now being collected and stored; so how they can be accessed? and how can we map and interpret massive datasets in an effective manner? Less attention has been paid to questions regarding the analysis of Big Data, although this has risen up the agenda in recent times. Examples include the use of density analysis to represent map request events, with Esri demonstrating that (given sufficient resources) they can process and analyze large numbers of datapoint events using kernel density techniques within a very short timeframe (under a minute); data filtering (to extract subsets of data that are of particular interest); and data mining (broader than simple filtering). For real-time data, sequential analysis has also been successfully applied; in this case the data are received as a stream and are used to build up a dynamic map or to cumulatively generate statistical values that may be mapped and/or used to trigger events or alarms. To this extent the analysis is similar to that conducted on smaller datasets, but with data and processing architectures that are specifically designed to cope with the data volumes involved and with a focus on data exploration as a key mechanism for discovery. Special software architectures have been developed to handle these very large datasets, including facilities for statistical analysis. Amongst these are the Apache Hadoop, Storm and Accumulo projects and associated Spark, Scala and Breeze software, and visualization and data discovery tools such as Tableau, Gephi and Qlik.

Miller and Goodchild (2014) have argued that considerable care is required when working with Big Data — significant issues arise in terms of the data (the four Vs): the sheer Volume of data; Velocity of data arrival and associated timestamps of the data; the Variety of data available and the way in which this is selected (e.g. self selection); and the Validity of such data. A presentation by Prof Mike Goodchild of some of the key elements of the Big Data debate are provided in the Resources page of the website and can be accessed directly at: www.spatialanalysisonline.com/PPTS/BigData.pdf — this presentation should be viewed alongside the article by Miller and Goodchild, as the latter provides a fuller explanation of the main ideas covered. Chapter 17 of Longley et al. (2015) is also recommended reading as this addresses the broader questions regarding information gathering and decision-making.

In an article entitled "Big Data: Are we making a big mistake?" in the Financial Times, March 2014, Tim Harford addresses these issues and more, highlighting some of the less obvious issues posed by Big Data. Perhaps primary amongst these is the bias that is found in many such datasets. Such biases may be subtle and difficult to identify and impossible to manage. For example, almost all Internet-related Big Data is intrinsically biased in favor of those who have access to and utilize the Internet most, with associated demographic and geographic bias built-in. The same applies for specific services, such as Google, Twitter, Facebook, mobile phone networks, opt-in online surveys, opt-in emails — the examples are many and varied, but the problems are much the same as those familiar to statisticians for over a century. Big Data does not imply good data or unbiased data; moreover Big Data presents other problems — it is all too easy to focus on the data exploration and pattern discovery, identifying correlations that may well be spurious as a result of the sheer volume of data and the number of events and variables measured. With enough data and enough comparisons, statistically significant findings are inevitable, but that does not necessarily provide any real insight, understanding, or identification of causal relationships. Of course there are many important and interesting datasets where the collection and storage is far more systematic, less subject to bias, recording variables in a direct manner, with 'complete' and 'clean' records. Such data are stored and managed well and tend to be those collected by agencies who supplement the data with metadata and quality assurance information.

An important part of the geospatial analysis research agenda is to devise methods of triangulating conventional ‘framework’ data sources — such as censuses, topographic databases or national address lists — with Big Data sources in order that the source and operation of bias might be identified prior to analysis and, better still, accommodated if at all possible. Yet such procedures to identify the veracity of Big Data are often confounded by their 24/7 velocity. For instance, social media data are created throughout the day at workplaces, residences and leisure destinations, and so it is usually implausible to seek to reconcile them to the size and compositions of night time residential populations as recorded in censuses. As Harford concludes: "Big Data has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers — without making the same old statistical mistakes on a grander scale than ever."


“The recommended citation of this Chapter is: Lansley G, de Smith M J, Goodchild M F and Longley P A (2018) Big Data and Geospatial Analysis, Chapter 9 in de Smith M J, Goodchild M F, and Longley P A (2018) Geospatial Analysis: A comprehensive guide to principles, techniques and software tools, 6th edition, The Winchelsea Press, UK”