Takeaways
- Dark data is data that can't be analyzed — and it's everywhere.
- Approximately 55% of enterprise data is dark.
- 47% of dark data could already be living in content services or ECM, waiting for extraction.
- Intelligent Document Processing rapidly makes sense of dark, unstructured data, preparing it with the structure needed for analysis in a data lake.
Organizations collect huge volumes of information, but very seldom, if ever, analyze all of it. Unanalyzed “dark data” is hidden everywhere, from PDFs and spreadsheets to Teams chats and nearly every place where humans exchange ideas. With it commonly accepted that anywhere from 80-90% of enterprise data is unstructured [IDC], and almost half of enterprise data going unused in decision-making [Splunk], there are some insanely valuable insights locked away.
In his adventure of the Copper Beeches, Sherlock Holmes exclaimed, “Data! Data! Data! I cannot make bricks without clay!” Data is paramount to informed decisions, and even Mr. Holmes, a man capable of making the most astute, albeit absurd, deductions to perform his tasks, needs data to succeed. Your organization is no different. But while the amount of dark data can be alarming when put in terms like zettabytes or compared to lost floppy disks, 47% of it could already be in an ECM or content services system [IDC]. Those are our proverbial mineshafts. That’s where the gold is. It’s time to go spelunking.
What is Dark Data?
Dark data is any data, structured or unstructured, that is collected but not utilized to inform business decisions.
While structured data stored in legacy systems, personal devices, private spreadsheets, and department chats can contribute to dark data buildup, unstructured document/content data wins as the biggest offender of data gone dark, and it’s not even close.
How does unstructured data go dark?
Unstructured data can go “dark” when it gets lost in the shuffle of data siloes, legacy systems, poor lifecycle management, or general awful document storage practices. Even if data is properly stored, it can still become dark data if it becomes too complex to parse into data lakes for analysis, or is directly loaded into a lake in its raw format.
The reason unstructured data can be so difficult to master is primarily because it is often human-generated – arriving in many differing document and content types, including emails, paper files, social posts, images, or any document without a consistent format or layout.
Some unstructured data statistics and our analysis
Let’s exchange some unstructured data stats and make some absurd Holmesian deductions starting with the assumptions that a zetabyte is an insane amount of data and a floppy disk is 1.44mb.
Today, about 175zb of data is created, replicated, and consumed each year and is expected to grow exponentially.¹
That's about 122 quadrillion floppy disks in case you were wondering...
Based on IDC's global datasphere predictions, yearly world data will reach almost 400zb by 2028.² That's more than double the data... and floppy disks. And yes we translated the article to English to find those numbers.
Of the 393zb of world data in 2028, 81% of that data will be generated by enterprises hunting for data analysis and gen AI experiences.² That's 318zb of data.
Taking the commonly cited statistic that 80-90% of the world's enterprise data is unstructured,³ and being conservative, by 2028 enterprises alone will generate more unstructured data than the world of today.
Current reports assess that 55% of enterprise data is unanalyzable, or "dark".⁴ So by 2028, you'd be better off taking 122 quadrillion floppy disks containing important enterprise data and recycling them for plasticware at the office. It's better than wasting all that data.
Of the unstructured enterprise data out there, nearly half is exchanged via a central content repository like an ECM or content services platform.³
How can I better utilize my content data?
Start by understanding where unstructured data lives. Because dark data has almost a 50% chance of being unstructured data contained in a centralized content repository, that’s a great and easy place to start. Unstructured content flows into your organization through many ingestion points within inbound communication channels like email or chat, uploads and sharing, APIs and integrations, or automated systems. Ideally, the end of that workflow lands the content in some form of centralized system. So that’s where the gold is.
Use AI better — find your data gold
Intelligent Document Processing (IDP) combines natural language processing, machine learning, and a variety of capture methods to make better sense and organization out of unstructured data. It’s flexible in that it can be used to rapidly capture, label, index, and route data as it comes into your organization whether received via scan, fax, email, or scrubbed from social/sites. Or, it can be wielded to take that 50% of dark content data sitting in your repository and unlock it for data analysis.
How does IDP enable data analysis from unstructured data?
Unstructured data is really hard to parse automatically and slow to process via scripts. That’s because humans don’t think like robots. Instead of speaking in numbers and scripts, we tend to speak in words and context clues that become riddles to Python. Humans also tend to create many variations in how content is organized, and that’s not ideal for extraction.
Because of the way IDP analyzes documents, focusing on context, natural language patterns, and learning over time, IDP is able to take content that’s difficult to parse, translate it to a finer semantic layer, and provide the necessary structure for data analysis. IDP takes dark data and shines the necessary light on it for data extraction and analysis to succeed.