The following is a guest article by Evan Schnidman, Founder and CEO of Prattle, a research automation firm specializing in text analytics.
Is alternative data incomprehensible technobabble rumored to somehow produce alpha? Or is it clean data that can be additive to existing models and streamline the broader research process? Unfortunately for most investors, the current alternative data landscape makes it difficult to distinguish between these two extremes.
The very best alternative data is backed by extensive, proprietary technology systems that go largely unnoticed by the broader finance community. Predictive datasets require sophisticated technology infrastructures for their creation. This article explores the nuances of the technical infrastructure needed to generate valuable alternative data signals, using Prattle as an example.
Early vs. Modern Alternative Data
The very earliest alternative data was simply excess data, or “data exhaust,” from various corporate processes. Perhaps the most significant, market-moving example of this data exhaust is the ADP Employment Report, released days before the official Bureau of Labor Statistics reports each month. This data is derived from ADP, a private payroll processor. Other forms of data exhaust have created a foundation for the modern alternative data market, though data sources like geolocation and financial transactions remain unwieldy and difficult to clean and utilize in an investment process.
Rather than washing the messy yet useful signals already at their disposal, the next wave of alternative data providers began crawling publicly available information. News, tweets and other simple text was counted, weighted and sold as signals. This data was typically more structured than the very earliest alternative data, but it also suffered from over-fitting because vendors mined for correlations without considering whether the signal had any causal effect on asset prices. The result was that much of this data performed very well—as long as the correlations held up—but fell apart in the medium and long term due to a lack of underlying theory.
The solution to the issue of sub-par alternative data is to employ sound theory coupled with robust technical infrastructure in order to generate superior signals. But as anyone familiar with financial markets knows, that is easier said than done.
Many investors have come to realize that utilizing alternative data is “not an alternative anymore.” As a result, virtually every financial institution on the planet has announced their use of alternative data in some way, shape or form. The trouble is, many financial institutions are not using robust, theory-based modern alternative data. While these institutions have been talking up their level of technical sophistication, they have been falling behind by relying on simple strategies based on outdated signals with minimum efficacy.
Despite this, startups and small data providers have relentlessly innovated, creating a bevy of ingenious solutions for gaining deeper insight into future price movements. This profusion of options combined with an increasing use of buzzwords and marketing spin in an attempt to be “the next big thing” has created confusion among alternative data buyers. Amid all of the hype, an important narrative has been lost: the best modern alternative data is produced using extremely sophisticated technology, which is often the last thing potential buyers consider.
Technical Infrastructure: A Case Study
Before we go any further, it is worth noting that although Prattle is very proud of our architecture, which we consider to be near the technological frontier, the purpose of the following example is not to claim we have built the perfect or most advanced system but rather to illustrate just how much technology is necessary to consistently derive signal from noise in a manner that is both theory-consistent and directly tradable.
At Prattle, our final product is a singular quantitative score representing the likely market impact of every publicly available primary source communication from every publicly traded company in the United States. This score is coupled with metadata on each speaker as well as deeper analytics and algorithmically extracted core comments that directly quote the most salient remarks in each communication. We frame this information as both tradable quantitative signals and automated research reports, but that frame is not fully representative of the vast technical infrastructure underpinning the system.
In order to produce our data in a reliable and timely manner, Prattle has built, among other things, a proprietary backend data science platform, a suite of customized neural nets models and a proprietary sentiment analysis engine. Let’s review each of these.
Ingesting and Cleansing
The first alternative data challenge is ingesting the data. Our systems are built from the ground up using a proprietary backend data science software. This cloud-based architecture was built to allow data scientists to go from static models in languages like R or Python to fully deployed, production-level code simply by navigating a series of dropdown menus. This micro-ETL framework serves as its own scheduled queue and allows quants to contribute directly to the core code base without taxing developer resources. In addition to speeding up development cycles, this tool allows us to efficiently iterate on model development while minimizing server load. At this point we are ingesting nearly five million documents each day.
The next step is cleansing the ingested content. Our team employs a suite of proprietary neural nets models to sort and categorize communications based on named entities. This technology, technically a suite of bi-directional long-short-term-memory (LSTM) models, not only recognizes entities such as people, places and companies, but employs novel voting software to accurately resolve spelling errors, nicknames and abbreviations. For example, our system identifies not only the difference between an article about Tim Cook from Apple and an article about cooking apples, but that Tim Cook and Timothy Cook (as well as all other variants) are the same person tied to the same permanent identifier. Another layer of neural nets models determines attribution. After all, an article about Tim Cook by Apple is not the same thing as an article about Tim Cook released by any other organization.
Analyzing and Extracting Insights
Once the data inputs have been properly tagged and sorted, they can then be analyzed. At Prattle, this means processing the text through a proprietary sentiment analysis engine built to map the dyadic relationship between every linguistic unit in each communication. Our engine identifies how every word, phrase, sentence and paragraph relate to one another in an effort to map every potentially salient linguistic pattern.
Prattle then marries these patterns to a financial model. In the case of earnings calls, Prattle’s standard model controls for all common quantitative factors that affect price movement and isolates the residual price movement unaffected by those factors. Using this residual price movement, Prattle then computes Cumulative Abnormal Return (CAR) as a measure of alpha. Finally, Prattle ties the linguistic patterns in each communication to a CAR over a specific time horizon that allows us to determine which linguistic patterns consistently correlate with specific market response.
We also use machine learning to continually update each entity-specific lexicon any time new language appears, or existing language is used in a new way. This means the system continues to learn, improve and evolve over time, just as human language does.
Large Firms Are Jumping into the Deep End
While this may seem like a thorough account of Prattle technology, it barely scratches the surface of the day to day struggles of building and maintaining a system supporting alternative data generation. The key takeaway is that it takes a great deal of infrastructure to generate and maintain alternative data systems. It is not sufficient to say your company is “in alt data.” It takes work to build systems and integrate these signals into investment processes.
These lessons are lost on asset managers, who have poured money into swanky conferences and tech-forward branding initiatives rather than building a solid alternative data infrastructure. Many of these institutions have tried to jump right into evaluating and even building alternative data without realizing how difficult and technology intensive the process actually is. These asset managers would be better off having spent more money on data science personnel and modern computing infrastructure and less on marketing.
A big part of the problem is that financial institutions ranging from banks and large asset managers to smaller hedge funds are struggling to find capable data scientists and technologists to aid in updating their technology infrastructures. The result is that many institutions currently have in place expensive, insufficient and/or outdated architecture that can barely handle market data, let alone alternative data signals.
Although the picture appears grim right now, it is not all bad. Asset managers are coming to the realization that they need to invest in modern technology infrastructure and skills. Moreover, it is not too late to reap the benefits. The tech savvy hedge funds are not as far out in front as some believe. Many of the quant funds are plagued by unwieldy garbage-can models that have been filled with variables to pick up incremental variance. Standard testing procedures at many of these funds fail to account for the idea that alternative data could replace many of these variables, improving performance and cutting gross data spend. Asset managers investing in new technology infrastructure can leapfrog these outdated and misguided systems, thereby outcompeting the legacy quant funds, especially those that are too confident in their legacy systems to evolve.