Building a Data Lake Ops Platform that Can Handle the Reality of Data
The following guest article was written by Thomas Combes, Chief Technology Officer at Eagle Alpha, an alternative data aggregation platform that also provides supporting advisory services for data buyers and vendors.
The Data Circus
Much has been said in the data-world about ideal states: “Data at your fingertips”, “Plug-and-play insights”, “Empower your whole organization with data”, “Democratizing insights from data”. The list is long, and to be honest, does sound great. There is value in setting such big-picture goals in an organization, but often the route to go from here to there is a lot harder than initially considered.
Data Warehouses such as Snowflake are often proposed as the solution that ushers in this step-change in organizational performance. If only your data was one well-defined, easy-to-understand, definitely maintained query away, everyone would be able to incorporate it into their day-to-day decisions.
At Eagle Alpha we think Data Warehouses can be useful, but often find ourselves in a chicken-and-egg situation with them. As mentioned, Data Warehouses do their best when the data is organized, but who is supposed to do that organizing and how? How do you know what to set up so that data users in your organization can hit the ground running, when all you have are 300,000 poorly named CSV files of unknown content scattered across 1500 sub-directories of sub-directories?
The ideal starts to drift pretty far away when that is the reality – a reality that many organizations face even with their internal data that they nominally can control the production of. Add in the requirement to incorporate external 3rd-party sources of data, and soon addressing the ambiguity and confusion that stems from this data circus becomes the all-consuming job of many data engineering teams.
What is a Data Lake, really?
From our experience, there’s a good chance data in a Data Warehouse experienced a part of its life as a file in a Data Lake somewhere. “Data Lake” is an intriguing term that could be the topic of an article itself (does the data float?), but for the sake of this discussion, I’ll define a Data Lake as underpinned by a cloud-based object storage system that has API-based access methods (Amazon S3 being the most recognizable example).
Because these systems expose APIs to interact with the objects in storage, many complex systems can be built off of them, including data query engines, event-queuing systems, and even entire web applications. Data-related tools such as Databricks that have been built up around these API endpoints combined with the storage system itself are what would constitute a Data Lake.
One of the best and worst aspects of a Data Lake is that there are very few restrictions on data structure. Whereas a Data Warehouse usually has Databases, Schemas, Tables, etc., the object storage systems of Data Lakes have only very base-level organizational concepts, mainly files and folders. While this might sound elegantly simple, having almost no restrictions on file formats, directory organization, or file naming conventions turns into a detriment when having to share your Data Lake with your colleagues, let alone having to share it with a 3rd party that has different ideas on how best to organize and present data.
Go From Here to There, Fast
Our own realization that not having guardrails could be a detriment came in 2020 when we started facilitating deliveries of trial data between data vendors and data buyers. We operate with a “just-in-time” posture when it comes to these data deliveries, meaning often we don’t hold the data in our Data Lake prior to a trial request being made.
Due to delays in us being provided access combined with requirements to make data accessible to the data buyer upon commencement of the trial, we have, on several occasions, had to perform an entire access-to-delivery turnaround in less than 24 hours for terabyte-size datasets with which we had little prior data structure awareness. In most cases, our delivery requirement on the data buyer side was a Data Warehouse such as Snowflake, whereas in almost every case, the data vendor gave us access to their object storage location. Our value-add is to provide the routing between these two disconnected data settings.
We found ourselves needing to know with a high degree of accuracy what we received access to, and to build up that awareness very rapidly, in order to fulfill these delivery requirements in tight timeframes. We tried a few 3rd party tools to “scan the data”, each of which broke, sometimes silently, from the assortment of edge cases we were experiencing, resulting in erroneous information on what we were given access to, and leading to costly and manual rework.
The parade of edge cases we ran into trying to translate a file/directory-based dataset into a database/table-based dataset was so eye-opening that we published a Data Style Guide on the topic. Still, we couldn’t just hope for the best that the 100s of new data providers we meet every year will all standardize on any one methodology, given that as mentioned, a Data Lake doesn’t have many guardrails.
Fast-forward to 2022, and our systems have matured to the point where, in addition to the trial facilitation service, we now offer self-service data validation and auto-documentation services to data vendors. A data vendor can now create a dataset profile on our platform and connect their production, ongoing data source directly to their data profile, allowing vendors to offer data-linked, source-of-truth documentation and ongoing 3rd party data validation to the market of data buyers. While this might seem quite different to fulfilling trial deliveries, we encountered more pronounced versions of many of the same challenges because we were building these systems to be automated and self-service to data vendors. We needed the data pipelines not only to build themselves but to build themselves to withstand the reality of data.
Edge-Case Development Methodology
To address these issues, we developed the priority needs that our data infrastructure would have to address. We had a need for:
- Speed of knowledge-building: we needed to generate in near-real-time a breadth and depth of knowledge metadata about a dataset being given to us in a file/directory form without pre-conceived notions of what the data should look like.
- Cloud Scale: we needed the system to scale to handle typical “big data” problems entirely in a cloud environment. This means not only that it could handle knowledge-building operations on terabyte-size datasets, but also datasets with 100,000+ files that are beyond the ability to logically bookkeep by hand.
- Support complex data structures: we needed a system that could automatically detect when a dataset contains multiple “tables” and, instead of just lumping everything together into one single schema, logically organize files under differentiated “tables” based on their contents.
- Robust to Data Variety: While we defined a few top-level file formats that we would support based on an 80% rule (CSV, Parquet, and JSON), we needed a system that would both derive accurate metadata and perform setup operations to achieve a “query-able state” for the numerous edge cases in data format flavors, column data types, directory structures, and compression formats we came across.
- Easily Modifiable: one of the more terrifying aspects to build for, we needed a system that could build and rebuild databases and tables on the fly if a vendor user wanted to rename a table, change a column type, merge two tables, or perform any other setup operations.
- Minimal Data Copying: from our experience, the majority of data provenance loss and confusion can be traced back to a transformed copy of a dataset written to files. It is also expensive to store for large datasets and to recompute when new data is published. Our system needed to perform any necessary transformations “one-shot” at query runtime without relying on chaining together multiple read-transform-export-read operations.
- Lazy Costs: due to the unpredictable nature of supporting requests from many 3rd parties on both data buyer and vendor sides, we needed a system that not only cost very little when not in use, but also one that we could optimize to be workload cost-efficient by capitalizing on our prior knowledge-building about each dataset.
Serverless for Data Operations
Arguably the most substantial paradigm shift in cloud infrastructure in the last 5 years is the advent of “serverless” technologies. Whereas first-generation cloud infrastructure looked and felt a lot like an on-prem setup running off-prem, serverless computation requires a process rethink and sometimes a complete rebuild to harness the advantages of the architecture.
At its core, “serverless” is inherently usage-based. You don’t pay to have servers running waiting for computation requests to be received, instead just purely once a requested computation is performed. Infrastructure management is vastly simplified because no longer are your computations running on specific machines that you have pre-defined specifications for.
While you hear about many interesting applications being built using serverless architectures, it is not very common to find truly serverless applications in the data space yet. Most data applications still use some form of server management, even if they increasingly can scale up and scale down “workers” computing on these servers and sometimes the servers themselves. Because some of the initial serverless applications were API endpoints powering web applications, generic serverless technologies are highly optimized for cold startup vs. container-based data applications running on large compute clusters which can take many minutes to initialize. When assessing our architectural needs as mentioned above, we found a good fit in the usage-based pricing, massive scalability and parallelism, and quick-start capabilities of serverless infrastructure for our use case.
Data Knowledge-Gathering with AWS Lambda
Because our native cloud environment is AWS, we utilized the AWS Lambda ecosystem for serverless functions, and the Serverless Framework for development and deployment of our serverless applications. We utilize Python-based Lambda functions for many of our “knowledge-gathering” operations, with some functions wrapped in API endpoints for communication with our web application front-end.
Serverless architecture is event-driven by nature, and we utilize several different event triggers to kick off function computations in our platform. One of the more challenging operations to architect has been the automated data sensing operation that we perform across all dataset files once we receive validated access from a data vendor. We take advantage of the parallel invocation capability of Lambda functions by distributing files to sense across different Lambda invocations. We utilize a delegator-style approach to managing the distribution (another Lambda function with an API Gateway that communicates with our front-end), so that if we are granted access to an S3 bucket containing 110,000 files that need to be sensed, we divide up the files into batches of 10, and invoke 11,000 Lambda functions that all run independent of one another.
In order to not explode costs due to reading large files (Lambda functions are priced on compute time and memory reservation), our data sensing operation takes advantage of the S3 API to read small chunks of files to sample the contents and derive file format, compression, data parsing specifications, column schema, and data types, among other pieces of information. Results of these sensing operations are queued as messages for processing by a “pooling” Lambda function that is event-triggered by messages arriving in a queue. Taking advantage of this parallelism, the sensing step of the operation for the complete file-set of most datasets takes less than a minute, meaning that data vendors can legitimately receive same-session feedback on their dataset based on sensing of their entire dataset.
Virtualized Datasets with AWS Athena
So once we’ve built up the file-level knowledge of a dataset, what do we do with that information? For us the ultimate goal of this process is to progress a dataset from access in a file/directory form to that dataset residing in a “query-able state” with as little manual setup as possible. Importantly, we are not looking to actually load any data in a database, but rather achieve query capability on the data at rest in the vendor storage location, addressing one of our stated needs of minimal data copying.
The technology we use to query this data is the data lake query engine AWS Athena, another mainstay serverless technology. At its core, Athena is a serverless “database” that queries data in AWS S3 (object storage) using standard SQL syntax. Pricing is based on the amount of data scanned for a query only, so unlike traditional databases, there is no ongoing running cost.
An important component of how we utilize Athena is that we construct entirely virtualized database tables and views based on the knowledge gathered in the above-mentioned sensing operations. As Athena is not running all the time, these are more instruction sets for how Athena interprets queries that are issued to it. We take advantage of the Glue API (the data catalog service that underpins Athena) to automatically set up database Tables (which manage the parsing instructions for the data files and have direct path links to directories in S3) and Views (which we recreate often based on transformation specifications).
An example of utilizing this virtualization technique is the detection of dates or timestamps in string-type columns. Many file formats do not allow native declaration of date or timestamp types (CSV being an example of this), however, many SQL-based operations rely on these columns being DATE or TIMESTAMP types in order to perform time series aggregations. In order to detect these columns, we 1) generate a preliminary data sample by querying the initial View, 2) scan the content returned for any date or timestamp-formatted string columns, and 3) reconstitute the View with SQL CAST to date or timestamp operations added to any detected columns.
A second example utilizing virtualization to avoid copying transformed data is in the combination of multiple schemas into single Tables. If our data sensing operation finds that 300 files share the same list of 13 columns, they get grouped into a single “schema” that we use to create a Table in our Athena database. However, if a new column is added to the newly published files in a dataset (sometimes called schema drift), these new files would be grouped under a different schema and therefore different Table. One way to combine these files to allow for querying the 13- and the 14-column data together is to transform and write a single new set of data files to storage and query that data. Because our goal is minimal data copying, we instead virtualize this process by setting up individual Tables tied to each schema, and set up a single View that contains the SQL operation necessary to union the two schemas. To an end user querying the View, the data appears to be “one table” even though that View is tied to multiple underlying Tables.
These are just two examples where we have used virtualization derived from upfront knowledge-building to avoid excessive data copying and data provenance loss. The instruction set virtualized in this way can help reduce the dependency to have intermediate transformation work completed prior to allowing an internal user to query the data.
Standing on the Lakeshore
Building out a data lake operations platform to withstand the wide variety of data that Eagle Alpha receives has been quite an interesting challenge, and one that I think many readers of this article can relate to. The data-world has an over-abundance of shiny objects for machine learning, AI, and advanced analytics, but the fuel needed for all of these tools to work is data that is understood. I hope that this article stirs some conversation about an area of the data-world that is quite often overlooked or considered “solved” but that we consider essential to the successful use of data.
A bit on Eagle Alpha as a technology organization: because we are a small team that needed to build this for ourselves, we are a relatively open-book when it comes to the technology that we’ve built. Please do feel welcome to reach out if you’d like to discuss our data operations in more detail, we are always happy to hear from like-minded data practitioners.