

If your workflows are inching towards data science and spanning across hybrid and multi-cloud environments, then Cloud Composer (which is Airflow under the hood) is a better choice. Cloud ComposerĬloud Composer is a fully managed workflow orchestration service built on Apache Airflow. If your team is not tech-heavy and building mainly analytics applications, then Cloud Data Fusion will suffice. It is based on CDAP, an open-source framework for building data analytic applications. Cloud Data FusionĬloud Data Fusion is a fully managed GUI tool to define ETL/ELT data pipelines. Both are built on open-source technology. Google is a technology-first company, and it offers only two clearly differentiable choices. Data Pipeline Orchestration Tools on Google Cloud It is a vendor-independent option, and you don’t need to master a new tool. It is a secure and highly available managed workflow orchestration for Apache Airflow. If you want to stick to Apache Airflow, MWAA may suit you the most. Amazon Managed Workflow for Apache Airflow (MWAA) I am biased to use AWS Data Pipeline by default, and use Glue Workflow only if the whole data infra is built of AWS Glue and Amazon Athena. It is unclear to me why Amazon has two tools with largely overlapping use cases. AWS Glue WorkflowĪWS Glue Workflow is another tool on AWS to perform ETL workflows. It is easy to use, and yet very powerful and versatile. You can knit together an AWS Data Pipeline with S3, RDS, DynamoDB, and Redshift as data storage and EC2 and EMR as compute services. AWS Data PipelineĪWS Data Pipeline is a service to move data from AWS and on-premises data sources to AWS compute services, run transformations, and store it in a data warehouse or a data lake. But I am very reluctant to advise you to do so. If your data pipeline is simple and consists of a few steps, this is probably the easiest to start with. But it can be used for building data pipelines too. AWS Step FunctionĪWS Step Function is an apt tool for automating business process workflows. Then the unavoidable migration will carry a good amount of outage risk. Data pipelines have an uncanny ability to quickly grow and become very complex. I suggest not blindly picking the easiest AWS tool for you, but pausing and thinking a little bit about your future needs. AWS services are easy to start with, but the one you pick first may not remain suitable as your use case expands. Data Pipeline Orchestration Tools on AWSĪmazon is a customer-first company, and its offerings reflect that. It fills some of the gaps in Airflow w.r.t. FlyteĪs you expand the pipeline orchestration from ETL to machine learning tasks, you may want to check out if Flyte suits your case better. You can’t go wrong by choosing Apache Airflow, but take a look and keep an eye on the next tool in the list. No wonder both AWS and Google Cloud offer a managed Airflow. It is the default choice and a pretty good one too. Apache AirflowĪt the moment, nobody gets fired for choosing Apache Airflow. It still got life and can carry you to some distance.īut do not set up new projects or new infra with it. If your organization is using it, it will a big endeavor to move away from it. It integrates nicely with the Hadoop ecosystem. Apache OozieĪpache Oozie has been around for quite a while for executing workflow DAG. No wonder that 2 out of the 3 tools listed here are apache projects. Open Source Data Pipeline Orchestration ToolsĪpache ecosystem was and continues to be an important part of the data stack. Let’s examine the choices open source and various cloud ecosystems offer. In the future, these may converge and become one. The boundaries between these three are fluid and overlapping. ML pipelines transform data before training or inference.Data pipelines get data to the warehouse or lake.Let me lay out the choices so you can make informed decisions.īut first, let’s quickly recap data, ML, and MLOps pipelines: Yet, it is one of the most critical decisions in building your data and ML infra. Therefore, the data pipeline at most has a nuisance value for you, an unavoidable chore.Ĭonsidering this pyramid of what you really care about and what you are forced to care about, where do you place data pipeline orchestration tools? How much of your mind share is occupied by that choice? My guess is that it is very little, almost nothing, or maybe even total indifference. You may also care about how you clean, curate, and transform the raw data into usable data. And that forces you to care about the data. Actually, you really care about the insights from that data or ML models you train with that data.
