Big Data and Research

<< Click to Display Table of Contents >>

Navigation:  Big Data, AI and Geospatial Analysis >

Big Data and Research

The academic community has been enticed by the potential to harness Big Data to predict real-world phenomena. No longer is research restricted by an inability to efficiently collect and process large quantities of data. In many settings Big Data offers comparable statistics to those traditionally collected by governments as officially released datasets, whilst in other settings they offer insights into phenomena that were not previously recorded. Efforts to gain access to new forms of data for research purposes are also motivated by an increasingly acknowledged necessity to extend the breadth of data sources available for research, since traditional forms of data are costly to collect and in many cases suffer from decreasing response rates (Pearson, 2015). In many contexts, working with Big Data is a necessity. For population data, in particular, the traditional gold-standard sources of data are in decline around the world. Most developed countries are cutting back on their long-form censuses or replacing them altogether with administrative data (see Shearmur, 2010). In very important respects, Big Data can offer far greater spatial granularity, attribute detail and frequency of refresh. But these compelling advantages of greater content accrue without any guarantees that the coverage of the entire population of interest is known — in many Big Data applications reported in the literature, the population of interest may not even be defined! The analyst community may need to become increasingly reconciled to use of new data sources, and may certainly benefit from their richness; however, data analysts must also undertake increasing amounts of due diligence in order to establish the provenance of such sources.

There is a growing amount of interest in harnessing Big Data beyond their primary applications since they offer the potential to provide fresh insights into a wide range of phenomena across space and time. Furthermore, they offer large volumes of data and may have a good coverage of particular populations or spaces. In many settings, the insights that Big Data may be able to offer could not be generated by traditional data sources without being very costly. See, for example, Canzian and Musolesi’s work on mental health using mobility data generated from mobile phones (2015) or Shelton et al.’s study of religion using the geoweb (2012). Big Data can present an opportunity to break away from a reliance on data of low temporal frequency to new forms of data that are generated continuously with precise temporal and spatial attributes.

Many datasets are generated in real-time by citizens using handheld technologies. Resultant data sources include mobile phone call records, WiFi usage and georeferenced social media posts. The velocity of their data generation allows very large volumes of geospatial information to be uploaded every day — this is a crucial step away from the slow production of geographic datasets and the delay between data collection and dissemination. Furthermore, the unrestricted spatial and temporal nature of data collection allows social research to be unshackled from an exclusive fixation on residential level data. For instance, Figure 9-1 shows the spatial distribution of Tweeters who submitted data between 8 am and 9 am on weekdays in 2013 in Central London. It can be observed that a large proportion of these users were using transport routes, many of which could have been commuting to work and therefore could be indicative of the spatiotemporal distribution of the population at large to a certain extent. Such information could not be acquired for large numbers of persons from traditional data sources.

Representation is a fundamental part of scientific knowledge discovery and is crucial for advancing our understanding of the world. All data should be thought of as a partial and selective representation of the totality of real-world phenomena, and the basis to selection should be understood. Data on places, people or activities are usually compressed into a set of digitally recorded variables in order to be stored efficiently. Therefore, representation is constrained by the scale and scope of each dataset, and the extent to which it depicts the totality of all real-world phenomena of interest. Most of what we know about the population and their activities has historically been dependent on generalizations and descriptions derived from very limited data. The skill of the scientist has been in devising sample designs that may found representations upon very sparse samples that nevertheless facilitate robust and defensible generalizations.


Figure 9-1 The spatial distribution of Tweets in Central London sent between 08:00 and 09:00 on Tuesdays, Wednesdays and Thursdays in 2013

In some fields of geography we are used to being data-rich. For instance, the Landsat programme was producing remotely sensed images in the early 1970s at such a velocity that the data were extremely challenging to handle. With improvements in technology, the supply of data on the physical environment has increased and improved in precision. However, data on people and their activities has historically been scarce. Now most big datasets are generated from human actions directly or indirectly and have countless applications for understanding the real-world. Yet, new forms of Big Data are fundamentally distinct from traditional datasets. Often, in human focused applications, their primary objectives are to make recordings of events and characteristics to assist administrative functions and there is little consideration of coverage as this is extremely costly. While on the one hand this means data are captured about what people actually do, rather than what they say they do in surveys, these actions will influence both who is represented in the data and how trends are manifested. Academics and others have little or no influence over the data collection process and many of the administrative processes of datasets remain hidden.

The diffusion of data collection into everyday activities has drastically enhanced the scope of data analysis. Firstly, in many cases, Big Data may represent a near-total coverage of specific populations, activities and places. Secondly, Big Data has enabled spatial analysis to shift large-scale analysis from aggregate data to individuals. For example, London’s travel card data represents the vast majority of journeys that occur on public transport across the city, and each journey is recorded individually with origin, destination, and times appended to the recording. Moreover, some datasets also record the precise locations for individual records through GPS technologies. Individual-level data enables analysis to separate itself from issues associated with small area aggregations (e.g. the ecological fallacy and the modifiable areal unit problem discussed in section 4.2). Furthermore, often individual records can be linked to unique identifiers (such as accounts, addresses, etc.) enabling the conflation of large volumes of longitudinal data. It has also been possible to harness new technologies to track anonymized individuals across time and space in order to improve transport provision. For example, mobile phone app data can be used to estimate where and when a customer boards and departs buses in order to improve understandings of public transport use (Figure 9-2).


Figure 9-2. A network graph to illustrate the bus stop to bus stop flows in Norwich, UK. Source: Stockdale et al., 2015

These characteristics have led Big Data to change how we discover information from data. Previously, much geographic research and analysis was grounded in theory partly due to the poor availability of data (by today’s standards). Indeed many theories of spatial analysis are generalizations built on assumptions in order to estimate trends for situations where there are no or limited data. For example, spatial interaction models have been widely used to estimate movements in a broad range of applications. In the context of retail geography, their use is primarily to estimate the likelihood of persons from each spatial unit visiting each of their local retail destinations. Essentially, the choice to visit a particular destination is based on convenience (inverse distance) and attractiveness (sometimes simply the retail floor space). However, store loyalty data can tell us the addresses of very large samples of patrons, as well as the times and dates when they visit stores. While one of the main criticisms of positivistic research is that models assume all individuals act rationally, through discrete data on populations researchers can also represent those that seemingly act irrationally. Indeed many actions will not conform to the generalizations that models are built from. Thus, large volumes of data have enabled us to mine trends and create predictive models with high levels of success.

As will be described later in this section, Big Data are not devoid of uncertainty especially when they are used to represent the real-world. Therefore, empiricist approaches may succumb to fallacies when the data collection procedures are not thoroughly understood. By letting data speak for themselves researchers neglect key influential geographic concepts pioneered by Geographic Information Science. Instead, data-driven approaches are limited to the identification and description of events and processes, rather than understanding why they exist. Theory is still needed to support critical research as Big Data remain only partially representative and research therefore may be founded upon data that are rich and detailed, but nevertheless not fit for purpose. It is important to think of most Big Data sources as a by-product, or ‘exhaust’ from a process that does not have re-use of data for research purposes at its heart. For example, a store loyalty card profile is a by-product of one or more transactions entered into by a consumer: it records the effectiveness of a company in delivering a service and how it makes a profit. It does not, however, record the entirety of that individual’s consumption, and provides only indicators of a broader lifestyle that requires supplementary data and/or validation from other sources for more broad-based research. As such, there should be no over-reach of the data beyond the phenomena that the data may reasonably be purported to represent. This was demonstrated by the infamous over-predictions of Google Flu Trends, see for example the article by Declan Butler (2013) in Nature News.