Often Big Data are used in research despite the fact that they may have been collected for completely separate reasons — this is particularly challenging when attempting to predict human actions. For example, research has found social media data to be of variable utility for predicting election outcomes. Whilst pollsters can ask the all-important question about voting intentions, researchers attempting to glean political predictions from social media must acquire a plethora of messages and then attempt to contextualize them. Big Data fundamentally represent phenomena on the population and their behaviors that is distinct from traditional social science datasets. There are advantages in this, for instance surveys are prone to misreporting (deliberately or accidentally) whereas in many instances Big Data are accurate due to the automation of data collection through machines. However, this is not to say that all data must, therefore, be valid and free of constraints. Big Data are not neutral and may be prone to misrepresentations in the context of geospatial research.
Whilst it is reasonable to assume that Big Data are reliable proxies for real-world activities or conditions, in making these assumptions it is important that we take account of the possible implications that the means of data collection may have on representations. Activities which generate data may be constrained by time and geography, they may also be unevenly distributed across the population (as shall be discussed later in this section). In addition, people may engage in the activities differently and it is not possible to obtain information on their motivations. For instance, social media contains an unregulated and unknown proportion of individuals’ beliefs, opinions and activities. In addition, some data are generated by automated programs (bots). Moreover, the proportion of persons that engage with social media, and their level of engagement, is also unknown. It is, therefore, a challenge to link trends on social media to events occurring in the real world.
When generating inferences about the real-world from Big Data, it is possible to generate misrepresentative indicators. For instance, mobile phone data can leave traces of users’ mobility at a relatively high spatial and temporal coverage. However, inferences of travel and user behavior from these footprints beyond their presence could be difficult. For instance, it may be reasonable to assume that is possible to make estimations on how crowds move from A to B, but for how many journeys would such models be accurate? Unfortunately, it is not often possible to conclusively validate models built from Big Data beyond links to aggregate datasets, small sample surveys or alternative big datasets with similar limitations.