The explosion in the production of Big Data, along with the development of new analytical methods, is leading many to argue that a data revolution is underway that has far-reaching consequences for not only how business is conducted and governance enacted but the very nature of how knowledge is produced within society. This is because Big Data analytics enables an entirely new epistemological approach for making sense of the world; rather than testing a theory by analyzing relevant data, new data analytics seeks to gain insights that simply emerge from the data itself without apparent interpretation being imposed upon it.
This idea was expressed in a somewhat provocative way in a 2008 article by, Chris Anderson of Wired magazine1 where he argued that Big Data analytics signal a new era of knowledge production characterized by ‘the end of theory’. He wrote that ‘the data deluge makes the scientific method obsolete’; that the patterns and relationships contained within Big Data inherently produce meaningful and insightful knowledge about complex phenomena. Essentially arguing that Big Data enables a more empirical mode of knowledge creation as terabytes and petabytes of data allow us to say: ‘Correlation is enough.’ We can simply analyze the data without hypotheses about what it might show. As he writes “we can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot … Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all. There’s no reason to cling to our old ways.”
Anderson’s article is a flamboyant elaboration on what has come to be called dataism. Dataism may be recognized as the general underlying philosophy of big data which holds data as a primary source of truth in its own right. Big Data offers the possibility of shifting from static snapshots to dynamic flows; from coarse aggregations to high resolutions; from data-scarce to data-rich; from relatively simple models to more complex, sophisticated simulations.2 This all-encompassing, pervasive, fine-grained nature to big data takes us into a new kind of paradigm where we could at last access the world without any kind of mediation, directly, in the language of 1s and 0s. In its capacity to present us with raw facts data may take us beyond our intuition, assumptions, bias, prejudice and other distortions. But at the same time, data can be deceptive, hiding behind a veil of objectivity while excluding the relevance of context. Thus if we want to really push what we can do with data analytics we need to be aware of where its limitations lie and how this paradigm of big data works.
The traditional way we have done science is by creating a hypothesis, we then go into the data to test that hypothesis. Usually, statistics has aimed first at the discovery of preexisting hypothesis. But the very idea of data mining is not to determine pre-existing hypothesis but make hypothesis surface from the data itself, so hypothesis or categories do not preexist the collection and processing of data; hypothesis instead is the result of the processing of data, which reverses the traditional more theoretical approach. What people increasingly want now are tools that find interesting things about the data, what is called data-driven discovery. The analyst does not even have to bother proposing a hypothesis anymore. The argument is that ‘mining Big Data reveals relationships and patterns that we didn’t even know to look for.’
Rebecca Siegel in a 2013 paper states this as such “we usually don’t know about causation, and we often don’t necessarily care … the objective is more to predict than it is to understand the world …It just needs to work; prediction trumps explanation.”3 We can take the case of a retail chain that analyzed 12 years’ worth of purchase transactions for possible unnoticed relationships between products that went into shoppers’ baskets. Discovering correlations between certain items led to new product placements and a 16% increase in revenue per shopping cart in the first month’s trial. There was no hypothesis that Product B was often bought with Product Z that was then tested. The data were simply queried to discover what relationships existed that might have previously been unnoticed.3 Amazon’s recommendation system produces suggestions for other products a user might be interested in without necessarily knowing anything about the product itself; it simply identifies patterns of purchasing across customer orders. Whilst it might be interesting to explain why these associations exist within the data, such explanation is often seen as largely unnecessary in a world of commerce where all that matters are outcomes.
There is a comprehensive and attractive set of ideas at work in the data paradigm that run counter to the deductive approach that is in many ways dominant within modern science. The basic premise is that because Big Data can capture a whole domain, providing a complete high-resolution dataset, there is no need for prior theory, models or hypotheses, as through the application of data analytics the data can speak for themselves free of human bias or framing. Meaning transcends context or domain-specific knowledge and is thus neutral being able to be interpreted by anyone.
As Yuval Noah Harari in his book Homo Deus: A Brief History of Tomorrow writes4 “For politicians, business people, and ordinary consumers, Dataism offers groundbreaking technologies and immense new powers. For scholars and intellectuals, it also promises to provide the scientific holy grail that has eluded us for centuries: single overarching theory that unifies all the scientific disciplines from musicology through economics to biology. According to Dataism, Beethoven’s Fifth Symphony, a stock-exchange bubble and the flu virus are just three patterns of dataflow that can be analyzed using the same basic concepts and tools.” We can already see how this idea of the universality of data is being applied as small groups of mathematicians, physicists, computer scientists and data analysts come to be incorporated into more and more domains, from finance to business consulting, to energy providers and all forms of technology companies, which implies that there is a single language of data that applies to all equally.
So what are the limitations of the data paradigm? Dataism can be understood as an extension of the reductionist paradigm in an age of information. Reductionism is the idea that a system, any system is nothing more than the sum of its parts. It is to say that nothing is truly continuous, everything can be rendered into a discrete quantified format without any loss of content. This is of course what datafication does, all data is discrete in that it takes a section of the universe and sticks a label or value onto it, presenting it as in some way separate from everything else and thus making it possible to move around and process into new configurations. Through analysis, we break systems down to isolate component parts, quantify them and describe the whole as some combination of the parts.
Reductionism has many great achievements but it also has its limitations. It systematically de-promotes complex relations, context and continuous processes. It takes no account of emergent phenomena that result in irreducible whole systems and processes. It can tell us about the billions of neurons in the brain but not about consciousness, it can tell us about the molecular makeup of water but not why when we combine the molecules they create something that has the property of being wet. Data is objective and it is discrete, it cannot tell us about what is subjective and continuous. The discrete nature of data is why it is so useful, it means that we can take it, separate it from the world and put it into an algorithm to manipulate and interpret in new ways. But it is also its inherent limitation, it can’t tell us about the synergies between the parts that make them continuous, more than the sum of those parts and irreducible to those discrete units. Datafication gives us a new tool to look at the world, but the problem is that that tool is incomplete; as convincing as it appears reductionism is only ever half of the story. By its inherent nature, it lets us see somethings and not others. The risk of it though is that we stay looking under the streetlamp because that is the only place that data sheds light and forget about everywhere else that it doesn’t shed light. Such an incomplete interpretation of the world can only ever lead to incomplete outcomes.
The technology ethnographer Tricia Wang describes this well when she says5 “that’s why just relying on big data alone increases the chance that we’ll miss something, while giving us this illusion that we already know everything… we have this thing that I call the quantification bias which is the unconscious belief of valuing the measurable over the immeasurable… But the problem is that quantifying is addictive and when we forget that and when we don’t have something to kind of keep that in check, it’s very easy to just throw out data because it can’t be expressed as a numerical value… this is a great moment of danger for any organization, because oftentimes, the future we need to predict — it isn’t in that haystack, but it’s that tornado that’s bearing down on us outside of the barn.” This analogy of the haystack and the barn touches on an important idea, which is that analytics helps us to better understand what is inside of the box and how the box works, but it can’t help us in seeing what is outside of the box, for that you need a very different process of reasoning call synthesis.
1. Anderson, C., Anderson, C., Allain, R., Scoles, S., Rogers, A., Simon, M., Gonzalez, R. and Molteni, M. (2018). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. [online] WIRED. Available at: https://www.wired.com/2008/06/pb-theory/ [Accessed 9 Feb. 2018].
2. Eprints.maynoothuniversity.ie. (2018). Big Data, new epistemologies and paradigm shifts – Maynooth University ePrints and eTheses Archive. [online] Available at: http://eprints.maynoothuniversity.ie/5364/ [Accessed 9 Feb. 2018].
3. Eprints.maynoothuniversity.ie. (2018). Big Data, new epistemologies and paradigm shifts – Maynooth University ePrints and eTheses Archive. [online] Available at: http://eprints.maynoothuniversity.ie/5364/ [Accessed 9 Feb. 2018].
4. Amazon.com. (2018). [online] Available at: https://www.amazon.com/Homo-Deus-Brief-History-Tomorrow/dp/1910701882 [Accessed 9 Feb. 2018].
5. Wang, T. (2018). The human insights missing from big data. [online] Ted.com. Available at: https://www.ted.com/talks/tricia_wang_the_human_insights_missing_from_big_data [Accessed 9 Feb. 2018].