Engineers should write ELT
In 2016, Jeff Magnusson from Stitch Fix wrote a seminal blog post titled “Engineers Shouldn’t Write ETL”. The post laid out a compelling vision where Engineers worked on a platform to provide self-service tooling to “full-stack” Data Scientists. Data Scientists are then responsible for both building data pipelines and building advanced functionality on top of that data.
This model has the attractive quality that no group is stuck solely with the job of moving data from place to place. Ownership of data pipelines is shared amongst operators. Jeff writes that ETL is a soul-sucking job that no one wants to do, and not a burden that should be placed on a single group.
How has the world changed since 2016? In a lot of ways, it’s barely recognizable. ETL (extract, transform, load) has been replaced with ELT (extract, load, transform). The “Modern Data Stack” has emerged, underpinned by the rise of cloud-native data warehouses like Snowflake (my current employer) that finally provide scalable SQL on top of Internet-scale data. Fishtown Analytics’s dbt project has matured significantly, providing an orchestration layer that empowers anyone who can write SQL to be a Data Engineer. And finally, as a result of all this, Data Infrastructure roles have become less engineering and more operations.
Why was ETL such a soul-sucking job in 2016? Most ETL jobs at that time were data integration jobs -- move data from system A (e.g., Netsuite or Salesforce) to system B (e.g., Redshift). These jobs are relatively complex, and there’s a never-ending list of integrations to be built. In the startup spirit of “do the most impactful thing”, ETL engineers were often focused solely on getting data into the data warehouse and didn’t have time or energy to do anything with that data -- that task was left to “data scientists”, who were left with the unenviable job of sorting through yet another pile of raw data to make some sense of it. Oftentimes, these data scientists lacked any context about the source system, making this job all but impossible.
In today’s “ELT” world, “E” and “L” are often done with off-the-shelf tools like Fivetran that enable push-button integrations. The “T” step is now where the majority of engineering effort can focus, aided by tools like dbt. As such, I’ve noticed a renewed interest in data modeling techniques like Kimball or Data Vault. Several years ago, the idea that a data engineer would have the time or energy to build out a star schema was true only in large enterprises with lots of resources. Today, it’s not uncommon for a single data engineer to own the entire stack, including using best practices such as Kimball’s dimensional modeling.
In general, I strongly believe that data modeling is the next “critical skill” that data engineers (or as Fishtown likes to call them, Analytics Engineers) need to be successful. An ideal data warehouse is one where a user with basic SQL knowledge can easily navigate all the schemas and tables, and without any particular domain knowledge, expertise, or prior understanding of the warehouse’s structure, be able to write queries to derive actual business insights. Just as important (if not moreso), the data model should protect a naive, well-intentioned business analyst from writing perfectly sensical SQL, but getting bad results (e.g., as often happens, a single fact table storing values at varying levels of granularity). This sounds simple in theory, but in practice requires a lot of knowledge and thoughtfulness. In short, engineers should write (and be proud of their) ELT. Or at least T.