Open the floodgates!
I’m talking about data pipelines, of course.
In today’s data-hungry world, business intelligence, analytics, and machine learning are run on data coming from a (large) number of different sources. Volumes of data are being generated from sensors, business systems, CRMs, and mobile devices, to name just a few. Organizations from health to retail are increasingly building secure data sharing capabilities to leverage each others’ data. And the trend towards more distributed data systems is only accelerating. According to Gartner, by 2025, 75% of enterprise-generated data will be created and processed outside a traditional centralized data center or cloud.
As more data is being piped across distributed systems, the ETL pipelines moving data from point A to point B and powering data integrations are growing in volume, scale, complexity… and importance.
What are ETL pipelines?
ETL is an acronym for “extract, transform, and load.” In the ETL process (sometimes also called the ELT process), data engineers extract a copy of data from distributed sources, transform the raw data into a format that can be used by downstream intelligence, analytics, and machine learning applications, and then load the converted data to a data warehouse or data lake so it can be accessible to those applications. ETL pipelines are the tools and activities built by said data engineers to automate the entire process.
The simplest of pipelines will automatically pipe data from one data source to one data warehouse. But in reality, most data pipelines are much more complex, extracting and transforming data from a number of different sources, which exist in different formats and levels of cleanliness. To add to the complexity, many organizations are managing tens, if not hundreds, of pipelines and data integrations. Building, running, and managing these pipelines becomes costly. According to McKinsey, a typical mid-size financial institution spends $60-$90M per year on data access, which includes pipelines.
What drives ETL pipeline costs?
ETL process costs can be broken out into five basic categories:
- Engineering - the cost of your engineering team’s time and the tools they use to build and maintain pipelines
- Data movement - charges from cloud providers like AWS, Azure, or GCP charge per GB of data that moves between regions or off their cloud
- Data storage - charges from cloud providers, data warehouses, or data lakes to store a second copy of the data once it has been transformed and loaded
- Compute - costs to provision and utilize cloud infrastructure to execute ETL queries
For all four categories, you can see how scale and complexity can lead to runaway costs. As the number of data sources and the volume of data grows, you’re paying to move and store more data, you’re provisioning more infrastructure for the compute, and your engineering team is spending more time building and maintaining complex pipelines.
Whether you’re building your first pipeline, or you’re managing hundreds of pipelines, right now is always the best time to start controlling your data pipeline costs. Pipelines will only continue to get more complex as your data grows, which will make them more costly. Getting into the habit of considering costs along with pipeline health, latency, and throughput will pay off big dividends in the long run.
ETL Best Practices
Here are six tips for managing your ETL pipeline costs:
- Make cost management part of your process - It’s a trite phrase, but it’s true - you can’t manage what you can’t measure. Allegro data engineering leader Marcin Kuthan describes this as applying FinOps discipline to your data engineering organization. Label all cloud resources used by data pipelines so that you can track pipeline costs in billing reports. Give your data engineering team ownership over cloud resource usage and costs, and ensure that cost analysis is part of all development work.
- Buy an ETL tool, instead of building - According to Fivetran and Wakefield research, a typical data engineering team will spend about $520,000 annually to build and manage custom ETL pipelines. There are a wide number of out of the box ETL tools on the market that offer pipeline-building features, have pre-build connections with dozens of databases, and support most data types, so that your team doesn’t need to reinvent the wheel. Most ETL tools (dbt is one popular example) are priced on a subscription model, but depending on your team’s needs, the cost savings can offset the price of the subscription.
- Don’t move more data than necessary - Work with your stakeholders to understand what data is necessary for their BI, data analytics, or machine learning tasks, and then apply a filter to only load those columns or datasets that are needed. Consider newer technologies that don’t require copying data at all, like zero copy operations, if working within Snowflake or Databricks, or federated learning and federated analytics if working across data platforms and edge devices. By cutting down the data loaded in your pipeline, you’ll avoid incurring costs from moving the data and costs from storing a new copy of the data.
- Only pay for the infrastructure you need - Leverage serverless and on-demand database, data warehouse, and data lake infrastructure like Databricks, Snowflake, or Amazon Glue that automatically provisions only the exact resources needed for your job, and then tears it down once the data is loaded. This way, you’re not paying for compute clusters that you’re not using.
- Optimize your transformation queries - Filter early in the pipeline to reduce overall data movement. For compute-intensive operations, make sure you’re working with the most efficient data types. And consider the appropriate partitioning strategies to minimize the communication needed between processing units. Your queries will be more efficient, run quicker, and use less costly compute power.
ETL Pipeline Alternatives
You might think this sounds heretical in an article about pipelines, but for machine learning and analytics tasks, there is no need to build complex and costly pipelines. Federated learning and federated analytics enable model training and data analytics across distributed data, so you can generate insights while avoiding all of the costs of moving large datasets.
It’s quick and easy to start experimenting with federated learning and analytics, with integrate.ai’s developer tools that manage all of the federated infrastructure, security, and data science tooling. You may find that you can replace some or all of your costly data pipelines with federation. Click here to learn more.