Apache Airflow is a platform to programmatically author, schedule, and monitor workflows. By using Docker, we can easily create a reproducible environment for running Airflow, making it simpler to manage dependencies and configurations.
This article provides a step-by-step guide to setting up Apache Airflow using Docker.
Prerequisites
Before you begin, ensure you have the following installed on your machine:
- Docker
- Docker Compose
Step 1: Download a Docker Compose File
Create a directory for your Airflow project and navigate into it. Then, download the compose file using the following curl command:
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.10.1/docker-compose.yaml'
You will find a compose file downloaded on your system:
x-airflow-common:
&airflow-common
# In order to add custom dependencies or upgrade provider packages you can use your extended image.
# Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
# and uncomment the "build" line below, Then run `docker-compose build` to build the images.
image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.10.1}
# build: .
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
# yamllint disable rule:line-length
# Use simple http server on scheduler for health checks
# See https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html#scheduler-health-check-server
# yamllint enable rule:line-length
AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true'
# WARNING: Use _PIP_ADDITIONAL_REQUIREMENTS option ONLY for a quick checks
# for other purpose (development, test and especially production usage) build/extend Airflow image.
_PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
# The following line can be used to set a custom config file, stored in the local config folder
# If you want to use it, outcomment it and replace airflow.cfg with the name of your config file
# AIRFLOW_CONFIG: '/opt/airflow/config/airflow.cfg'
volumes:
- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
- ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
- ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
- ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
user: "${AIRFLOW_UID:-50000}:0"
depends_on:
&airflow-common-depends-on
redis:
condition: service_healthy
postgres:
condition: service_healthy
services:
postgres:
image: postgres:13
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 10s
retries: 5
start_period: 5s
restart: always
redis:
# Redis is limited to 7.2-bookworm due to licencing change
# https://redis.io/blog/redis-adopts-dual-source-available-licensing/
image: redis:7.2-bookworm
expose:
- 6379
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 30s
retries: 50
start_period: 30s
restart: always
airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-scheduler:
<<: *airflow-common
command: scheduler
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-worker:
<<: *airflow-common
command: celery worker
healthcheck:
# yamllint disable rule:line-length
test:
- "CMD-SHELL"
- 'celery --app airflow.providers.celery.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}" || celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
environment:
<<: *airflow-common-env
# Required to handle warm shutdown of the celery workers properly
# See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
DUMB_INIT_SETSID: "0"
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-triggerer:
<<: *airflow-common
command: triggerer
healthcheck:
test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"']
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-init:
<<: *airflow-common
entrypoint: /bin/bash
# yamllint disable rule:line-length
command:
- -c
- |
if [[ -z "${AIRFLOW_UID}" ]]; then
echo
echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
echo "If you are on Linux, you SHOULD follow the instructions below to set "
echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
echo "For other operating systems you can get rid of the warning with manually created .env file:"
echo " See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
echo
fi
one_meg=1048576
mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
disk_available=$$(df / | tail -1 | awk '{print $$4}')
warning_resources="false"
if (( mem_available < 4000 )) ; then
echo
echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
echo
warning_resources="true"
fi
if (( cpus_available < 2 )); then
echo
echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
echo "At least 2 CPUs recommended. You have $${cpus_available}"
echo
warning_resources="true"
fi
if (( disk_available < one_meg * 10 )); then
echo
echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
echo
warning_resources="true"
fi
if [[ $${warning_resources} == "true" ]]; then
echo
echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
echo "Please follow the instructions to increase amount of resources available:"
echo " https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
echo
fi
mkdir -p /sources/logs /sources/dags /sources/plugins
chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
exec /entrypoint airflow version
# yamllint enable rule:line-length
environment:
<<: *airflow-common-env
_AIRFLOW_DB_MIGRATE: 'true'
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
_PIP_ADDITIONAL_REQUIREMENTS: ''
user: "0:0"
volumes:
- ${AIRFLOW_PROJ_DIR:-.}:/sources
airflow-cli:
<<: *airflow-common
profiles:
- debug
environment:
<<: *airflow-common-env
CONNECTION_CHECK_MAX_COUNT: "0"
# Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
command:
- bash
- -c
- airflow
# You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up
# or by explicitly targeted on the command line e.g. docker-compose up flower.
# See: https://docs.docker.com/compose/profiles/
flower:
<<: *airflow-common
command: celery flower
profiles:
- flower
ports:
- "5555:5555"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
volumes:
postgres-db-volume:
Step 2: Setup an environment variable
AIRFLOW_UID=197613
AIRFLOW_GID=0
Step 3: Start Airflow
Now that you have your docker-compose.yml
file and the dags
directory set up, you can start Airflow by running the following command:
docker compose up -d
This command will start the Airflow web server and scheduler in detached mode.
Step 4: Access the Airflow Web Interface
Once the containers are up and running, you can access the Airflow web interface by navigating to http://localhost:8080
in your web browser.
You will see the following login screen:
Supply “airflow” as username and password.
Step 5: Create Your First DAG
First, create a requirements.txt with the following entries:
pandas==2.1.4
numpy==1.26.2
scikit-learn==1.5.1
apache-airflow
Install the required modules
pip3 install apache-airflow
Create a simple DAG file in the dags
directory. For example, create a file named sample_dags.py
with the following content:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
# Define default arguments for the DAG
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2024, 9, 5),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
# Define the DAG
dag = DAG(
'An_example_python_bash_dag',
default_args=default_args,
description='A simple DAG with Python and Bash tasks',
schedule=timedelta(days=1),
)
# Python function to be executed
def print_hello():
return 'Hello from Python!'
# Define tasks
t1 = PythonOperator(
task_id='print_hello',
python_callable=print_hello,
dag=dag,
)
t2 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag,
)
t3 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
dag=dag,
)
t4 = BashOperator(
task_id='print_end',
bash_command='echo "DAG has finished!"',
dag=dag,
)
# Set task dependencies
# t1 runs first, then t2 and t3 in parallel, finally t4
t1 >> [t2, t3] >> t4
The provided script defines a simple Airflow DAG (Directed Acyclic Graph) that consists of Python and Bash tasks. Here’s a breakdown of its functionality:
DAG Definition and Default Arguments:
- Creates a DAG named
An_example_python_bash_dag
with default arguments specifying the owner, start date, email notifications, retries, and retry delay.
Python Function:
- Defines a Python function
print_hello
that returns the string “Hello from Python!”.
Tasks:
t1
(PythonOperator): Executes theprint_hello
function.t2
(BashOperator): Runs thedate
command to print the current date.t3
(BashOperator): Executessleep 5
, pausing the DAG for 5 seconds.t4
(BashOperator): Runs theecho "DAG has finished!"
command to print a message at the end.
Task Dependencies:
- Defines the order in which the tasks should run:
t1
runs first.t2
andt3
run in parallel aftert1
finishes.t4
runs after botht2
andt3
complete.
Overall, this DAG demonstrates a basic workflow combining Python and Bash tasks, with defined scheduling and dependencies.
Step 6: Verify Your DAG
python3 sample_dags.py
Airflow attempts to parse and create a DAG object based on the configuration in sample_dag.py
. It checks the DAG’s schedule (defined using schedule_interval
or schedule
). If it’s configured for immediate execution or if you’re manually triggering it, the DAG proceeds to the next step.
Airflow starts by running the first task in the DAG (based on task dependencies). If it’s a PythonOperator, the specified Python function (print_hello
in dags_example.py
) is executed. If it’s a BashOperator, the defined Bash command (e.g., date
) is run on the system.
Next, Airflow follows the defined task dependencies. For example, if t1
needs to finish before t2
starts, t2
won’t begin until t1
completes successfully.
Once all tasks in the DAG have finished running successfully (based on success criteria within the tasks), the DAG is marked as complete.
If there are print statements in the Python function or Bash command, their output might be displayed on the console. The echo
statement in the final task (t4
in dags_example.py
) might print “DAG has finished!” indicating successful completion.
Conclusion
You have successfully set up Apache Airflow using Docker and created your first DAG. This setup allows you to easily manage and scale your workflows. You can now explore more complex DAGs and integrate various operators to suit your needs.
-
Testcontainers and Playwright
Discover how Testcontainers-Playwright simplifies browser automation and testing without local Playwright installations. Learn about its features, limitations, compatibility, and usage with code examples.
-
Docker and Wasm Containers – Better Together
Learn how Docker Desktop and CLI both manages Linux containers and Wasm containers side by side.
-
Ollama vs. vLLM: Choosing the Best Tool for AI Model Workflows
As AI models grow in size and complexity, tools like vLLM and Ollama have emerged to address different aspects of serving and interacting with large language models (LLMs). While vLLM focuses on high-performance inference for scalable AI deployments, Ollama simplifies local inference for developers and researchers. This blog takes a deep dive into their architectures, use cases, and performance, complete with code snippets…
-
The Ultimate Guide to Top LLMs for 2024: Speed, Accuracy, and Value
Introduction Large Language Models (LLMs) have revolutionized the field of artificial intelligence, enabling machines to understand, interpret, and generate human-like text with unprecedented accuracy. As we enter 2024, the landscape of LLMs continues to evolve at breakneck speed, with new models emerging regularly. In this comprehensive guide, we’ll explore the top-performing LLMs of 2024, highlighting…
-
Cracking the Code: Estimating GPU Memory for Large Language Models
As AI enthusiasts and developers, we’ve all encountered the daunting task of deploying Large Language Models (LLMs). One crucial aspect of this process is estimating the GPU memory required to serve these massive models efficiently. Let’s explore the fascinating world of LLM deployment and explore how to calculate the GPU memory needed for your AI…