Join our Discord Server
Tanvir Kour Tanvir Kour is a passionate technical blogger and open source enthusiast. She is a graduate in Computer Science and Engineering and has 4 years of experience in providing IT solutions. She is well-versed with Linux, Docker and Cloud-Native application. You can connect to her via Twitter https://x.com/tanvirkour

Running Airflow in a Docker container

6 min read

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows. By using Docker, we can easily create a reproducible environment for running Airflow, making it simpler to manage dependencies and configurations.

This article provides a step-by-step guide to setting up Apache Airflow using Docker.

Prerequisites

Before you begin, ensure you have the following installed on your machine:

Step 1: Download a Docker Compose File

Create a directory for your Airflow project and navigate into it. Then, download the compose file using the following curl command:

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.10.1/docker-compose.yaml'

You will find a compose file downloaded on your system:

x-airflow-common:
  &airflow-common
  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.10.1}
  # build: .
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
    AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
    # yamllint disable rule:line-length
    # Use simple http server on scheduler for health checks
    # See https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html#scheduler-health-check-server
    # yamllint enable rule:line-length
    AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true'
    # WARNING: Use _PIP_ADDITIONAL_REQUIREMENTS option ONLY for a quick checks
    # for other purpose (development, test and especially production usage) build/extend Airflow image.
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
    # The following line can be used to set a custom config file, stored in the local config folder
    # If you want to use it, outcomment it and replace airflow.cfg with the name of your config file
    # AIRFLOW_CONFIG: '/opt/airflow/config/airflow.cfg'
  volumes:
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
    - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
    - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
    - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
  user: "${AIRFLOW_UID:-50000}:0"
  depends_on:
    &airflow-common-depends-on
    redis:
      condition: service_healthy
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres-db-volume:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 10s
      retries: 5
      start_period: 5s
    restart: always

  redis:
    # Redis is limited to 7.2-bookworm due to licencing change
    # https://redis.io/blog/redis-adopts-dual-source-available-licensing/
    image: redis:7.2-bookworm
    expose:
      - 6379
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 30s
      retries: 50
      start_period: 30s
    restart: always

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-worker:
    <<: *airflow-common
    command: celery worker
    healthcheck:
      # yamllint disable rule:line-length
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.providers.celery.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}" || celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    environment:
      <<: *airflow-common-env
      # Required to handle warm shutdown of the celery workers properly
      # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
      DUMB_INIT_SETSID: "0"
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-triggerer:
    <<: *airflow-common
    command: triggerer
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"']
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    # yamllint disable rule:line-length
    command:
      - -c
      - |
        if [[ -z "${AIRFLOW_UID}" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
          echo "If you are on Linux, you SHOULD follow the instructions below to set "
          echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
          echo "For other operating systems you can get rid of the warning with manually created .env file:"
          echo "    See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
          echo
        fi
        one_meg=1048576
        mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
        cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
        disk_available=$$(df / | tail -1 | awk '{print $$4}')
        warning_resources="false"
        if (( mem_available < 4000 )) ; then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
          echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
          echo
          warning_resources="true"
        fi
        if (( cpus_available < 2 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
          echo "At least 2 CPUs recommended. You have $${cpus_available}"
          echo
          warning_resources="true"
        fi
        if (( disk_available < one_meg * 10 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
          echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
          echo
          warning_resources="true"
        fi
        if [[ $${warning_resources} == "true" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
          echo "Please follow the instructions to increase amount of resources available:"
          echo "   https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
          echo
        fi
        mkdir -p /sources/logs /sources/dags /sources/plugins
        chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
        exec /entrypoint airflow version
    # yamllint enable rule:line-length
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_MIGRATE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
      _PIP_ADDITIONAL_REQUIREMENTS: ''
    user: "0:0"
    volumes:
      - ${AIRFLOW_PROJ_DIR:-.}:/sources

  airflow-cli:
    <<: *airflow-common
    profiles:
      - debug
    environment:
      <<: *airflow-common-env
      CONNECTION_CHECK_MAX_COUNT: "0"
    # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
    command:
      - bash
      - -c
      - airflow

  # You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up
  # or by explicitly targeted on the command line e.g. docker-compose up flower.
  # See: https://docs.docker.com/compose/profiles/
  flower:
    <<: *airflow-common
    command: celery flower
    profiles:
      - flower
    ports:
      - "5555:5555"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

volumes:
  postgres-db-volume:

Step 2: Setup an environment variable

AIRFLOW_UID=197613
AIRFLOW_GID=0

Step 3: Start Airflow

Now that you have your docker-compose.yml file and the dags directory set up, you can start Airflow by running the following command:

docker compose up -d

This command will start the Airflow web server and scheduler in detached mode.

Step 4: Access the Airflow Web Interface

Once the containers are up and running, you can access the Airflow web interface by navigating to http://localhost:8080 in your web browser.

You will see the following login screen:

Supply “airflow” as username and password.

Step 5: Create Your First DAG

First, create a requirements.txt with the following entries:

pandas==2.1.4
numpy==1.26.2
scikit-learn==1.5.1
apache-airflow

Install the required modules

pip3 install apache-airflow

Create a simple DAG file in the dags directory. For example, create a file named sample_dags.py with the following content:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator

# Define default arguments for the DAG
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 9, 5),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Define the DAG
dag = DAG(
    'An_example_python_bash_dag',
    default_args=default_args,
    description='A simple DAG with Python and Bash tasks',
    schedule=timedelta(days=1),
)

# Python function to be executed
def print_hello():
    return 'Hello from Python!'

# Define tasks
t1 = PythonOperator(
    task_id='print_hello',
    python_callable=print_hello,
    dag=dag,
)

t2 = BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag,
)

t3 = BashOperator(
    task_id='sleep',
    bash_command='sleep 5',
    dag=dag,
)

t4 = BashOperator(
    task_id='print_end',
    bash_command='echo "DAG has finished!"',
    dag=dag,
)

# Set task dependencies
# t1 runs first, then t2 and t3 in parallel, finally t4
t1 >> [t2, t3] >> t4

The provided script defines a simple Airflow DAG (Directed Acyclic Graph) that consists of Python and Bash tasks. Here’s a breakdown of its functionality:

DAG Definition and Default Arguments:

  • Creates a DAG named An_example_python_bash_dag with default arguments specifying the owner, start date, email notifications, retries, and retry delay.

Python Function:

  • Defines a Python function print_hello that returns the string “Hello from Python!”.

Tasks:

  • t1 (PythonOperator): Executes the print_hello function.
  • t2 (BashOperator): Runs the date command to print the current date.
  • t3 (BashOperator): Executes sleep 5, pausing the DAG for 5 seconds.
  • t4 (BashOperator): Runs the echo "DAG has finished!" command to print a message at the end.

Task Dependencies:

  • Defines the order in which the tasks should run:
    • t1 runs first.
    • t2 and t3 run in parallel after t1 finishes.
    • t4 runs after both t2 and t3 complete.

Overall, this DAG demonstrates a basic workflow combining Python and Bash tasks, with defined scheduling and dependencies.

Step 6: Verify Your DAG

python3 sample_dags.py


Airflow attempts to parse and create a DAG object based on the configuration in sample_dag.py. It checks the DAG’s schedule (defined using schedule_interval or schedule). If it’s configured for immediate execution or if you’re manually triggering it, the DAG proceeds to the next step.

Airflow starts by running the first task in the DAG (based on task dependencies). If it’s a PythonOperator, the specified Python function (print_hello in dags_example.py) is executed. If it’s a BashOperator, the defined Bash command (e.g., date) is run on the system.

Next, Airflow follows the defined task dependencies. For example, if t1 needs to finish before t2 starts, t2 won’t begin until t1 completes successfully.

Once all tasks in the DAG have finished running successfully (based on success criteria within the tasks), the DAG is marked as complete.

If there are print statements in the Python function or Bash command, their output might be displayed on the console. The echo statement in the final task (t4 in dags_example.py) might print “DAG has finished!” indicating successful completion.

Conclusion

You have successfully set up Apache Airflow using Docker and created your first DAG. This setup allows you to easily manage and scale your workflows. You can now explore more complex DAGs and integrate various operators to suit your needs.

  • Testcontainers and Playwright

    Testcontainers and Playwright

    Discover how Testcontainers-Playwright simplifies browser automation and testing without local Playwright installations. Learn about its features, limitations, compatibility, and usage with code examples.

  • Docker and Wasm Containers – Better Together

    Docker and Wasm Containers – Better Together

    Learn how Docker Desktop and CLI both manages Linux containers and Wasm containers side by side.

  • Ollama vs. vLLM: Choosing the Best Tool for AI Model Workflows

    Ollama vs. vLLM: Choosing the Best Tool for AI Model Workflows

    As AI models grow in size and complexity, tools like vLLM and Ollama have emerged to address different aspects of serving and interacting with large language models (LLMs). While vLLM focuses on high-performance inference for scalable AI deployments, Ollama simplifies local inference for developers and researchers. This blog takes a deep dive into their architectures, use cases, and performance, complete with code snippets…

  • The Ultimate Guide to Top LLMs for 2024: Speed, Accuracy, and Value

    The Ultimate Guide to Top LLMs for 2024: Speed, Accuracy, and Value

    Introduction Large Language Models (LLMs) have revolutionized the field of artificial intelligence, enabling machines to understand, interpret, and generate human-like text with unprecedented accuracy. As we enter 2024, the landscape of LLMs continues to evolve at breakneck speed, with new models emerging regularly. In this comprehensive guide, we’ll explore the top-performing LLMs of 2024, highlighting…

  • Cracking the Code: Estimating GPU Memory for Large Language Models

    Cracking the Code: Estimating GPU Memory for Large Language Models

    As AI enthusiasts and developers, we’ve all encountered the daunting task of deploying Large Language Models (LLMs). One crucial aspect of this process is estimating the GPU memory required to serve these massive models efficiently. Let’s explore the fascinating world of LLM deployment and explore how to calculate the GPU memory needed for your AI…

Have Queries? Join https://launchpass.com/collabnix

Tanvir Kour Tanvir Kour is a passionate technical blogger and open source enthusiast. She is a graduate in Computer Science and Engineering and has 4 years of experience in providing IT solutions. She is well-versed with Linux, Docker and Cloud-Native application. You can connect to her via Twitter https://x.com/tanvirkour
Join our Discord Server
Index