Unstructured Data

<< Click to Display Table of Contents >>

Navigation:  Big Data and Geospatial Analysis > Challenges of Big Data >

Unstructured Data

Geographers have long attempted to make sense of intangible concepts. Non-quantifiable concepts are well suited to ethnographic researchers who are able to collect very detailed observations and derive conclusions based on shared experiences. However, it is a more novel challenge to represent intangible concepts from large quantities of data in research, as quantitative data typically represent phenomena as categorical, ordinal or interval, so that they can be analyzed numerically and efficiently.

One of the challenges with many types of Big Data is that they not only represent intangible concepts but that they are also not numerically structured making it difficult to undertake quantitative analysis. Techniques are, therefore, required to manipulate the data. Taking the example of social media posts that are textual, often research has focused on isolating a series of keywords and then using filtered searches to generate relative frequencies across time and space. This has enabled researchers to use social media posts to track real-world events such as flu epidemics (Lampos et al., 2010) or earthquakes (Sakaki et al., 2010). In these approaches, each case is a binary datum; either a relevant term is present or it is absent (1 or 0).

Another set of techniques attempt to group words or entire social media messages in order to produce categories. By modeling the co-occurrence of particular words across large samples of documents, it is possible to create associations between words, and also, associations between posts. Topic modeling techniques have provided a popular means of analyzing large documents, such as blog posts, speeches or articles; however, their utility on social media is questionable due to the short length of documents and unstandardized use of language. Despite this, it is feasible to devise topic classifications from short social media posts if appropriate data cleaning steps are applied first. For instance, Figure 9-8 displays the relative frequency of geotagged Tweets from Inner London that were allocated to 20 key topic groups by hour of day. Some of the temporal patterns could be indicative of activities, for example the Food and Drink group shows peaks at mid-day (lunch time) and in the evening, as would be expected.

By converting the data into more reductive forms, there is an inherent loss of information. It is likely that qualitative analysis on a small sample of data may reveal entirely different trends and perspectives. In addition, quantitative analysis of large amounts of data tends to ignore the contextual elements of each datum. While it may be possible to group messages based on textual similarity, it is not possible to return meaning in a numerical form. Users describing the same topic may be tackling it from entirely different viewpoints with very different agendas. In the case of tracking flu from Twitter messages, it is easy to confuse messages reporting symptoms with those that are merely expressing awareness of the illness.


Figure ‎9-8 A heat map of the temporal frequency of Tweet topic groups across the whole weekday sample by hour of the day. The data from each of the Twitter groups has been standardized as z-scores. Larger numbers (red/darker) therefore indicate over-representation (Source: Lansley and Longley, 2016)