The following is guest article by David Trainer and Sam McBride of New Constructs, an independent research firm which uses natural language processing to extract data from the unstructured portions of financial filings to create valuation models and research products. The article is excerpted from a series of articles on artificial intelligence which can be found here.
We are awash in an ocean of data that grows bigger by the second. And it’s a complete and utter mess. Only by close collaboration between analysts and technologists can the problem be solved.
The total size of all global data hit 20 zettabytes in 2017. For 99% of people, that number probably means nothing, so picture this: if every 64-gigabyte iPhone were a brick, we could build 80 Great Walls of China with the iPhones needed to store all the world’s data.
90% of web data is unstructured, meaning it’s in a format that cannot be easily searched and understood by machines. Poor data quality costs the US economy $3.1 trillion a year according to IBM. We have become a society that is excellent at producing, storing and sharing data, but we’re lousy at making it useful.
The Size of the Data Management Problem
Poor data quality is a familiar problem for those who analyze data for a living. A recent survey found that 60% of data scientists devote the majority of their time to cleaning and organizing data, as shown in Figure 1.
Cleaning Data Takes the Most Time
Comparatively, just 9% of data scientists devote the bulk of their time to mining data for patterns. Cleaning and organizing data has become such a big task that it leaves precious little time for analysis.
When people predict that AI will make human workers obsolete anytime in the near future, they are ignoring the data quality problem. AI and machine learning may be able to replace those 9% of data scientists who are mining data for patterns, but it will still need the 80% working on collecting, cleaning and organizing data.
More importantly, data scientists need to re-orient their thinking around data quality. More time needs to be spent upfront and on collecting data in a high-integrity manner rather than retroactively “cleaning” it. We are not sure that it is possible to retroactively clean data enough to meet the needs of useful AI. If you cannot validate the data back to its source, then how do you know it is clean? And, if you are going back to the source to validate, then you might as well collect it from the source.
Structuring Financial Data: Not as Easy as Most Think
In theory, financial data in filings would be more structured and standardized, or we could make it that way easily. We have centralized bodies (FASB, SEC) that govern financial reporting standards. Public companies employ teams of accountants and lawyers to conform to these standards.
In reality, the data remains highly unstructured and variable, and we expect that it will only get worse. The most prominent effort to make financial data machine readable, XBRL, remains riddled with errors 10 years after its initial deployment. While companies are required to submit XBRL filings, they’re not required to verify them, and only 8% of companies carry out voluntary audits. Until XBRL is strictly enforced by the SEC, it does not stand a chance at being reliable.
Structuring Data: More About Team Than Technology
As long as financial data remains unstructured, existing machine learning tools cannot process it effectively. Meanwhile, the cost of employing the highly-trained analysts needed to manually structure data remains prohibitive.
Our solution is close collaboration between technologists and analysts. Analyst and programmers work together to anticipate and address problems from multiple perspectives from the outset. Along with frequent iteration and joint and rigorous testing, we teach machines one small step at a time. However small that step may be, each step means less work for human analysts.
We also enhance our data collection processes by using algorithms to validate the data points collected by the machine in real time. We identify data that’s potentially wrong – from values that are too big or too small, to data points that show up in the wrong places, to data relationships that don’t make financial sense – and direct analysts to make decisions about how the data should be collected.
The machine also tags items it doesn’t recognize so that analysts can help. If the machine is not sure or does not have adequate precedents for making a decision, analysts are called in. This process frees analysts from banal, mundane work so they can focus on new and more difficult problems.
The scale of our process has a virtuous effect on our automatic data validation capabilities. The more models we can build, the more potential data anomalies or errors we can find and feed back into the machine. The more we do, the more we can teach the machine, and, in turn, rely on it to do more. This approach gives us significant advantage over systems or analysts who can only view a few models at a time.
Working with machines presents many new challenges to our society. It is not something we’ve done before and, not surprisingly, we have a lot to learn, and so do the machines. One thing we know for sure is that people that are best at teaching machines will have the best machines, and the people with the best machines will have the upper hand.