Big data is a term that has come to be used in reference to data structures that are diverse, complex, and of a massive scale. Although the term has been in use since the 1990s it is only with the rise of web 2.0, mobile computing and the internet of things that organizations find themselves increasingly faced with a new scale and complexity of data. The term big data implies an increase in the quantity of data, but it also results in a qualitative transformation of how we store and analyze such data – it is certainly the case with big data that more is different. The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s resulting in an extraordinary increase in data storage capacity.1 Since that time the amount of information in the world has exploded. Likewise, the digitalization of that information has happened in a historical blink of an eye. Back in the late 80s, less than 1% of the world’s information was in digital format by now more than 99% of all the information in the world that is stored is in digital format.2
Equally the amount of data available through the internet has grown at an extraordinary level. The world’s effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007 and predictions put the amount of internet traffic at 667 exabytes annually by 2014.3 From this data we can see how a little after the year 2000 the amount of digital information began to explode and at the same time – largely due to the mass adoption of the internet and user-generated systems – the nature of that data changed from being largely structured to being largely unstructured; we might identify this as the tipping point from the world of data to the world of big data.
Indeed, industry, government, and academia have long produced massive data sets, such as remote sensing, weather prediction, scientific experiments or financial markets. However, given the costs and difficulties of generating, processing, analyzing and storing such datasets, these data have been produced in tightly controlled ways using sampling techniques that limit their scope, temporality, and size. For example, to make the compiling of national census data manageable they have been produced once every 5 or 10 years, asking just 30 to 40 questions, and their outputs are usually quite coarse in resolution. While the census may wish to be exhaustive, listing all people living in a country, most surveys and other forms of data generation are samples, seeking to be representative of a population but not technically capable of representing all features. Big data has a number of key attributes that make it distinct in nature from these more traditional data sets, including its volume, velocity of data capture, variety of data sources, its high resolution and often exhaustive scope of sampling.
Firstly as the name implies big data is truly huge in volume, consisting of terabytes or petabytes of data. Take for example the Chinese rides-sharing platform DiDi, which serves some 450 million users across over 400 cities in China. Every day, DiDi’s platform generates over 70TB worth of data, processes more than 20 billion routing requests, and produces over 15 billion location points. Or for example, a typical 20-bed intensive care unit generates an estimated 260,000 data points a second. A military fighter jet drone may have 20,000 sensors in one single wing to enable it to fly by itself. On one single flight, an A850 airplane can produce 250 gigabytes of data.
Secondly, these data sources can be high velocity as data is being created in, or near real-time to produce massive, dynamic flows of fine-grained data. For example, Facebook reported that it was processing 2.5 billion pieces of content, 2.7 billion ‘Like’ actions and 300 million photo uploads per day in 2012. Similarly in 2012 Wal-Mart was generating more than 2.5 petabytes of data relating to more than 1 million customer transactions every hour.4
The variety of the data and data sources is a key aspect of big data that differentiates it from more traditional forms of structured data. Photos, videos, text documents, audio recordings, books, email messages, presentations, geolocations, tweets, are all data, but they’re generally unstructured, and incredibly varied. As an article in the Sloan Review entitled “Variety, Not Volume, Is Driving Big Data Initiatives” notes “The past several years have been period of exploration, experimentation, and trial and error in Big Data among Fortune 1,000 companies… For these firms, it is not the ability to process and manage large data volumes that is driving successful Big Data outcomes. Rather, it is the ability to integrate more sources of data than ever before — new data, old data, big data, small data, structured data, unstructured data, social media data, behavioral data, and legacy data.”
While this variety may be the key source of complexity to big data it may also be the key source of insight; by referencing different sources we can begin to build up context to an event or outcome instead of unidimensional interpretations. For example, if we take something like fraud detection on a debit card, an ATM machine may just swallow your debit card because you are simply using it in a different country from where you usually are, the result of this analysis based upon a single data point gives very crude outcomes. But with a variety of data sources, such as social media, purchase history, geolocation etc. a much more nuanced picture could be build up to better understand if it is really you standing in front of the ATM or not.
Big data is often exhaustive in scope, striving to capture entire populations or systems and all relevant data points. Take for example a recent project initiated by the U.S. Securities Exchange Commission to try and capture and analyze every single US financial market event, every day. The goal of the project called Consolidated Audit Trail or CAT is to track every life cycle event, every tick, every trade, every piece of data that’s involved in the US market in one place.5 The goal is to build a next-generation system that will allow them to understand in a reasonable amount of time what is happening in the market. This involves taking data from all the different silos, across all these banks, the broker-dealers, the executions, the dark pools, and bringing them into a single system.
The system has to ingest between 50 and 100 billion market events per day, that’s 15 terabytes of data a day that needs to be processed within four hours and made available for running queries on the whole data set, so that any trade from that day can be traced from its origins all the way through to completion. We can note how we are no longer simply looking at a very limited amount of historical snapshots but in fact, all market events are available for analysis. Previously due to limitations of storage and computational devices, we would basically compute based upon samples and then make inferences but the hope of this exhaustive sampling is that with big data there may be no sampling error.
Big Data is characterized by being generated continuously, seeking to be exhaustive and fine-grained in scope. Examples of the production of such data include: digital CCTV; the recording of retail purchases; digital devices that record and communicate the history of their own use such as mobile phones; the logging of transactions and interactions across digital networks like, email or online banking; measurements from sensors embedded into objects or environments; social media postings and the scanning of machine-readable objects such as travel passes or barcodes.
The scale and complexity of Big data require in turn a change in computing paradigm, both in how we structure data and how we process it. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a reasonable time.Database systems up until just a decade ago were almost completely structured relational databases. Data was structured into tables and columns but with the rise of big data has come to the evolution of databases into a “nonrelational” form what can be referred to as NoSQL. A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.6
In contrast to relational databases where data schemas are carefully designed before the database is built, NoSQL systems create flexible data schema or no schema at all. For example, one type of NoSQL structure a graph storage. Graph data storage organizes data as nodes, which are like records in a relational database, and edges, which represent connections between nodes. Because the graph system stores the relationship between nodes, it can support richer representations of data relationships. Also, unlike relational models reliant on strict schemas, the graph data model can evolve over time. Additional technologies being applied to big data include massively parallel-processing databases; multidimensional big data can also be represented effectively as tensors, making it more efficient to handle by tensor-based computation.
1. Google Books. (2018). The Dark Web: Breakthroughs in Research and Practice. [online] Available at: https://goo.gl/xjcDTB [Accessed 8 Feb. 2018].
2. MartinHilbert.net. (2017). The World’s Technological Capacity to Store, Communicate, and Compute Information. [online] Available at: http://www.martinhilbert.net/worldinfocapacity-html/ [Accessed 8 Feb. 2018].
3. MartinHilbert.net. (2017). The World’s Technological Capacity to Store, Communicate, and Compute Information. [online] Available at: http://www.martinhilbert.net/worldinfocapacity-html/ [Accessed 8 Feb. 2018].
4. Google Books. (2018). The Data Revolution. [online] Available at: https://goo.gl/JiwJJz [Accessed 8 Feb. 2018].
5. YouTube. (2018). DATA & ANALYTICS: Analyzing 25 billion stock market events in an hour with NoOps on GCP. [online] Available at: https://www.youtube.com/watch?v=fqOpaCS117Q [Accessed 8 Feb. 2018].
6. YouTube. (2018). Graph Databases Will Change Your Freakin’ Life (Best Intro Into Graph Databases). [online] Available at: https://www.youtube.com/watch?v=GekQqFZm7mA&t=1279s [Accessed 8 Feb. 2018].