Cloud Data Platform: Chapter 3: Part 3: Six-layer data platform architecture – Open source and commercial alternatives

This is my learning note from the book Designing Cloud Data Platforms written by Danil Zburivsky and Lynda Partner. Support the authors by buying the book from Designing Cloud Data Platforms – Manning Publications

Batch data ingestion

Data ingestion is a major component of any data platform, whether it’s a cloud data platform or a traditional on-premises warehouse. That’s why there are lots of tools, both open source and third-party commercial offerings, that can play this role.

Apache NiFi is one of the popular open source solutions that allows you to connect to various data sources and bring data into your cloud data platform. NiFi uses a pluggable architecture that allows you to create new connectors using Java, but also comes with a large library of existing connectors.

Talend is another popular ETL solution that you can use to implement a data ingestion layer. Talend uses an open-core model, where basic functionality is free and open source, but enterprise-level features require a commercial license. Talend is not just an ingestion tool, and it makes more sense to use its whole ecosystem of solutions that include data profiling, scheduling, and so on.

There are also many existing third-party SaaS solutions that specialize in bringing data from various sources into your cloud environment. Alooma (acquired by Google in 2019) and Fivetran are two examples of such services. These SaaS services usually provide a very rich set of connectors and additional features such as monitoring or lightweight transformation of data.

The limitations of using SaaS providers for data ingestion include having to send your data through a third party, which may not always be acceptable from a security perspective. Also, these tools specialize in writing data directly into the warehouse, making it challenging to integrate them into a flexible data platform architecture.

Streaming data ingestion and real-time analytics

We often see fast message store, streaming data ingestion, and real-time data processing implemented using open source solutions instead of cloud-native services. This is because Apache Kafka is a leading open source solution in this space. Kafka offers a fast message bus for your streaming data sources, but also has a Kafka Connect component that allows you to easily ingest data from various sources into Kafka.

It also comes with support for Kafka Streams—a way for you to implement real-time data processing or analytics applications. The reasons to choose Kafka instead of a cloud native solution are performance, feature set richness, and existing expertise. If you have existing investments into Kafka or have very specific performance requirements, you may consider implementing your own streaming solution in the cloud using this technology.

The downside, as with any open-source solution, is that you will need to invest in managing your Kafka cluster.

Orchestration layer

Apache Airflow is a popular open source job orchestration tool. It allows you to construct complex job dependencies, and provides logging, alerting, and retry mechanisms. The Google Cloud Composer orchestration service is based on Apache Airflow technology.

The benefit of using Airflow over a cloud service is flexibility because Airflow job configurations are created using the Python programming language, which allows you to create dynamic job definitions. For example, you can create an Airflow job that can change its behavior based on external configuration or reach out to an external service to fetch configuration parameters.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.