Tools and Skills

<< Click to Display Table of Contents >>

Navigation:  Big Data and Geospatial Analysis > Challenges of Big Data >

Tools and Skills

The volume, velocity and variety of Big Data have imposed considerable challenges on geospatial data analytics. Whilst there have been advancements in computer memory and databases, the majority of conventional analytics packages struggle to handle very large datasets. This is especially true for geographic information systems that also often need to account for complex issues such as spatial dependence and spatial non-stationarity (Shekhar et al., 2011). Here more spatial data can equate to exponential increases in computing required to undertake spatial operations. The complexity and the extensiveness of some spatial operations (and in deed various multidimensional modeling techniques) means that it is not always feasible to employ parallel computing to increase their efficiencies. Indeed, most spatial statistics techniques were originally devised to work on a finite number of points.

Visualization remains an imperative component of quantitative analysis. Carefully selected techniques are required to convey trends at the expense of noise in Big Data. Too much detail can make key patterns unobservable. In addition, scale is still important, zooming in too far can obscure local trends. Therefore, despite the detail and precision of Big Data, visual outputs must remain reductive in order to support decision making. Whilst developments in static image-based representations have been minimal, there has been a growth in the implementation of web-enabled services (such as “slippy map” platforms) that let users explore the data and focus on key areas of interest.

Data on people and places are also generated faster than they can be critically explored and understood. It is also the case that most techniques in spatial analysis were built to work on static and pre-existing databases which are therefore treated as static and timeless entities (Batty, 2017). However, a large share of Big Data are generated at high velocities and there are interests in analyzing them in real-time (for instance the data generated by smart cities). Furthermore, the handling of such data is challenging, for instance indexing techniques that were designed to make very large datasets accessible can become inefficient once data goes over their original capacity of extension. In response to this, bespoke software packages have been developed to harness real-time geosocial information in order to detect real-world trends. There are also new spatial data streaming algorithms which were primarily used to estimate weather from new flows of Big Data. However, often data are statistics that are generated for data in intervals, thus the formation of intervals become a fundamental part of the analysis.

Data may also come in a variety of different types, and they may require extensive reformatting procedures in order to work with them. Geospatial data generally falls into three forms: raster, vector and graph, and there are established practices for working with all three. However, as observed in the Unstructured Data section above, many Big Data sources generate data types that are not conventional for quantitative analysis. While this has not changed how spatial elements of data are standardized, the appended data now vary considerably. These include textual data, images and even video feeds. Therefore, new techniques primarily from the field of computer science have been devised to harvest information from new forms of data.

In response to the technical challenges of working with Big Data, researchers have to become more comfortable working with coding and scripting languages to employ techniques such as parallel computing. Indeed, access to many big datasets (such as Twitter) are granted through APIs, and so they are primarily accessible to only those with programming skills. New techniques from computer science and statistics have allowed researchers to generate never-before seen insights (such as those from machine learning and artificial intelligence) from data that are otherwise very difficult to analyze. For instance, there are now a plethora of text-mining tools that were developed to harvest information from large quantities of textual records. By creating a pipeline that blends such methods with GIS it is possible to identify previously unexplored geospatial phenomena. Such techniques have also supported the growth of geographic data into wider disciplines.

There are dangers of simply recycling these methods for social research as they are inherently reductionist and lack sociological theoretical reasoning. In addition, we still need to undertake the most basic spatial operations, but with larger data. In response, we are now witnessing a re-emergence in the popularity of geocomputation due to the requirement to analyze large and complex datasets across space and time whilst retaining the basic principles of GIScience. There have been advancements on traditional techniques to enable their implementation on very large fractal datasets (Harris et al., 2010). Open source examples of statistical programming languages which also extend to advanced spatial analysis include R and Python. In addition, there are Structured Query Languages (SQL) which are used to efficiently store and access large quantities of data in relational database management systems. In order to handle geospatial Big Data, many have utilized software such as PostGIS which specializes in integrating geographic objects into SQL databases. In many respects, there are parallels now to when GIS and computing first emerged and practitioners were required to code rather than click on buttons. It is important that such knowledge and experience are also integrated into academic programmes to ensure that the skills of graduates are aligned with working in a data-driven world, in all its many forms.