The digital universe – consisting of all the data we create annually – is currently doubling in size every 12 months. According to research by IDC, it is expected to reach 44 zettabytes – that’s 44 trillion gigabytes – in size by 2020 and will contain nearly as many digital bits as there are stars in the universe.1 What’s more, these projections may actually be conservative given the rise of IoT. Likewise, it is estimated that by 2030 more than 90% of this data will be unstructured. This explosion of data is of course far outstripping our capacities to use it. A small fraction is in a traditional structure form that is easily accessible and usable by organizations, a larger section of big data is unstructured but at least somewhat accessible, while the vast majority is simply hidden all together going unseen and unused, this, we can call dark data.
As Alessandro Curioni of IBM notes2 “80% of all this data that is created is dark, unstructured data, data that the computers we have developed in the last 40 years are not able to analyze efficiently… we miss 80% of the knowledge inside of this data.” Few organizations have been able to explore non-traditional data sources such as audio, image, and video files; the growing flow of machine and sensor information generated by the Internet of Things; and the enormous stores of raw data found in the unexplored depths of the ‘deep web.’ These all constitute dark data. As an example, we can think of supply chain data. A recent Gartner survey found that 85% of respondents felt that supply chain complexity is now a significant and growing challenge for their operations. Supply chain is a data-driven industry, spanning across a network of global suppliers, distribution channels and customer base, this industry churns out data in big numbers. Given that an estimated only 5% of this data is being used there is ample opportunity for big data technologies to bring this 95% dark data to light.
Dark data was a term coined by the IT consulting firm Gartner who define it as3 “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes – for example, analytics, business relationships and direct monetizing. Similar to dark matter in physics, dark data often comprises most organizations’ universe of information assets.” Data may be considered dark for a number of different reasons because it is unstructured, because it is behind a firewall on the internet, it may be dark because of speed or volume, or because people simply have not made the connections between the different datasets. In many organizations, large collections of both structured and unstructured data sit idle. On the structured side, it’s typically because connections haven’t been easy to make between disparate data sets that may have meaning—especially information that lives outside of a given system, business unit or function. Regarding “traditional” unstructured data, such as emails, messages, documents, logs, notifications. These are often text-based and reside within organizational firewalls but remain largely untapped, this may be because they do not reside in a relational database or because until relatively recently, the tools and techniques needed to leverage them efficiently did not exist. Buried within these unstructured data assets could be valuable information.
The second dark analytics dimension focuses on a different category of unstructured data that cannot be mined using traditional analytics techniques, such as audio, still image and video files from others sources. Much of the world’s information is now being created in rich media such as images and video but computer scientist of long since view video as the dark matter of the internet universe because they did not have the tools to analyze it. In 2016 alone we took an estimated trillion photographs, in the past, this was simply unstructured data we couldn’t use they were just collections of dots of color. The same for videos unless someone put a tag on it to describe the contents in text it was effectively dark data. It is only very recently with advances in machine learning and image recognition methods that this is changing. Google’s Video Analysis API can now go through every seen in a video and identify specific elements in those scenes, such as a dog, birthday cake, a mountain, a house etc. A search engine can then be implemented to look through these videos to identify specific features and when they show up in the video, thus converting the dark data into light data.
As a major dimension to dark analytics, the deep web owns what may be considered the largest body of untapped information. The deep web consists of information that is not index by publicly accessible search engines today. According to a study published in Nature, Google throws up about 16 percent of the surface web. Popular Science described it “like fishing in the top two feet of the ocean.” It is impossible to accurately calculate the deep web’s size, but by some estimates, it is 500 times larger than the surface web that most people search daily. The domain’s sheer size and distinct lack of structure makes it a daunting complex adventure. Data curated by academics, consortia, government agencies, communities, and other third-party domains, medical records, legal documents, scientific documentaries, multilingual databases, financial information, government resources, organization-specific databases, all hidden from outside usage and largely unknown to anyone but their owners.
To date, companies have explored only a tiny fraction of the digital universe for analytic value. The term “dark analytics” refers to turning dark data into intelligence and insight that an organization can use. Dark analytics seeks to remove those limits by casting a much wider data net that can capture a corpus of currently untapped signals. Recent advances in computer vision, pattern recognition, and cognitive analytics are making it possible for companies to shine a light on these untapped sources and derive insights. New dark analytics companies like Deep Web Technologies builds search tools for retrieving and analyzing data that would be inaccessible to standard search engines. Or Lattice Data recently purchased by Apple is a company that applies an AI-enabled inference engine to take unstructured, “dark” data and turn it into structured and more usable information.
1. Emc.com. (2018). [online] Available at: https://www.emc.com/collateral/analyst-reports/idc-digital-universe-2014.pdf [Accessed 9 Feb. 2018].
2. YouTube. (2018). Morning Session: The Rise of the Machines – Revolution or Evolution?. [online] Available at: https://www.youtube.com/watch?v=MEu5A9aAhHc [Accessed 9 Feb. 2018].
3. Gartner IT Glossary. (2013). Dark Data – Gartner IT Glossary. [online] Available at: https://www.gartner.com/it-glossary/dark-data [Accessed 9 Feb. 2018].