Join our Discord Server
Adesoji Alu Adesoji brings a proven ability to apply machine learning(ML) and data science techniques to solve real-world problems. He has experience working with a variety of cloud platforms, including AWS, Azure, and Google Cloud Platform. He has a strong skills in software engineering, data science, and machine learning. He is passionate about using technology to make a positive impact on the world.

Docker for Data Science: An Introduction

5 min read

Docker is an open platform designed to simplify the development, shipping, and running containerised applications. By using Docker, you can isolate applications from the underlying infrastructure, which allows for faster and more efficient software delivery.

Docker helps manage infrastructure similarly to how you manage applications, reducing the time between writing code and deploying it to production. Imagine you’re a chef who needs to cook a perfect dish in any kitchen around the world. Each kitchen might have different equipment, ingredients, or conditions that could affect the outcome. Docker is like a high-tech cooking kit that you carry with you. It contains everything you need to prepare your dish—utensils, ingredients, and recipes—so you can cook the same perfect meal no matter which kitchen you’re in.

In the world of software, Docker packages your application and all its dependencies into a single, portable container, ensuring that it runs reliably and consistently across any environment, just like your cooking kit guarantees the same great meal every time. Any software faces the same problem as the chef because he or she must ensure cooking is done perfectly in any kitchen; thus, we need a container.

Docker Architecture

The Docker Platform

Docker packages applications into containers, which are lightweight, isolated environments that contain everything needed to run the application. These containers can run simultaneously on the same host without interfering with each other. Docker’s tools allow you to manage the entire container lifecycle, from development to production. Whether your environment is on-premises, in the cloud, or a hybrid, Docker ensures that your applications run consistently everywhere.

Use cases for Docker

  • Fast, Consistent Application Delivery: Docker standardizes the development environment, making it easier for teams to collaborate. Containers are integral to CI/CD workflows, enabling smooth transitions from development to testing and finally to production.
  • Responsive Deployment and Scaling: Docker containers are portable and lightweight, making it easy to scale applications up or down as needed, whether running on local machines, data centers, or the cloud.
  • Efficient Use of Resources: Docker’s lightweight nature allows you to run more workloads on the same hardware compared to traditional virtual machines, making it ideal for environments that need to maximize resource efficiency.

Docker uses a client-server model where the Docker client interacts with the Docker daemon (dockerd) to build, run, and manage containers. The client and daemon communicate via a REST API. Docker Compose is another tool that helps manage multi-container applications.

  • Docker Daemon: Manages Docker objects like images, containers, networks, and volumes and can interact with other daemons.
  • Docker Client: The main interface for users, which sends commands to the Docker daemon.
  • Docker Desktop: A user-friendly application for Mac, Windows, and Linux that includes all necessary Docker components.
  • Docker Registries: Stores Docker images, with Docker Hub being the default public registry. Users can also set up private registries.

Docker Objects

Docker’s key objects include images, containers, networks, and volumes, which together enable the packaging, running, and scaling of applications in isolated environments.

What are VMs, containers, and images?

A Docker image is essentially a blueprint for creating Docker containers. It’s a static file containing everything needed—application code, libraries, and configurations—to build a container. When a container is started, it’s based on this image, which means the container will have all the necessary components to run the application.

Docker simplifies the process of developing and deploying applications by allowing developers to package their apps and all their dependencies into a container. This container can then be deployed on any system with Docker installed, ensuring consistent behaviour across different environments.

Think of Docker as a way to prepare a pre-configured workstation for a project. Instead of setting up software and files from scratch on each new machine, you create a Docker image with everything. included. Then, you can deploy this setup anywhere with Docker, similar to handing over a ready-to-go workstation.

In terms of virtualization, Docker and virtual machines (VMs) both aim to create isolated environments. However, while VMs run entire operating systems and provide strong isolation, Docker containers share the host OS’s kernel, making them lighter and more resource-efficient. This approach allows Docker to quickly and consistently deploy applications across various environments, whereas VMs are more suited for running different operating systems on the same hardware.

Docker for data science: Practical tutorial Implementation

Prerequisites:

  1. Install Docker Desktop for Windows: Download and install Docker Desktop for Windows from the official website.
  2. Install Visual Studio Code: Download and install Visual Studio Code from the official website: https://code.visualstudio.com/

Step 1: Set Up Your Project Directory

Create a project directory for your data science project. Inside the project directory, create a Python script (e.g., intro_1.py) that contains your data science code.

Let us suppose you have a file named intro_1.py that contains:

Please note that this Python script is used just for demonstration purposes for a data science project.

Step 2: Dockerfile

Next, we create a file in the project directory. The file contains instructions to build the image for your data science application. A Dockerfile has a name ending with no extension.

Step 3: Build the Docker Image

Open a terminal in your Linux, Windows or MacOS or command prompt(cmd) in the Windows environment and navigate to your project directory. You can build the image using the following command:

Step 4: Compile the Docker Image

To compile, use the docker build command:

Descriptive Alt Text

docker build -t myimage:1.0

This builds an image stored on your local machine. The -t parameter defines the image name  `myimage` and gives it a tag 1.0. To list all the images, run:

Docker Image List
REPOSITORY TAG IMAGE ID CREATED SIZE
<none> <none> 85eb1ea6d4be 6 days ago 1.9GB
myimagename 1.0 ff732d925c6e 6 days ago 1.9GB
myimagename 1.1 ff732d925c6e 6 days ago 1.9GB
myimagename latest ff732d925c6e 6 days ago 1.9GB
python 3.9 f88f0508dc46 13 days ago 412MB

This should execute the Python script inside the container, and you will see the output in the terminal.

Docker container

Containers are the real-life instances of a kitchen kit. They are not helpful in the wardrobe, so the chef should perform a task or two while using the kitchen kit.

The instructions can be baked into the image or provided just in time before starting the container. Let’s do the latter.

Step 5: Integration with Code Editors

VS Code

  1. Install the “Remote – Containers” Extension: This extension allows you to work with development containers directly from VS Code.
  2. Open Project in VS Code: Navigate to your project directory and open it in VS Code.
  3. Reopen in Container: Click the green icon at the bottom-left corner of the VS Code window, and select Remote-Containers: Reopen in Container.
  4. Work and Run: Open your script_1.py script, start coding, and run it via the integrated terminal.

PyCharm

  1. Install Docker Plugin: Ensure the Docker plugin is installed and enabled in PyCharm.
  2. Configure Docker Interpreter: Go to Settings > Project Interpreter > Add Interpreter > Docker, and configure it to use your Docker container.
  3. Open Project: Open your project, and PyCharm will use the Docker container as the environment.
  4. Work and Run: Edit your code_1.py script and run it directly in PyCharm.

Other Editors (e.g., Sublime Text, Atom)

  1. Use Docker CLI: Since these editors do not have direct Docker integration, use the Docker CLI for container interaction.
  2. Run Code in Container: Open a terminal, and navigate to your project directory.
  3. Execute in Docker: Use docker exec -it <container_name> python /path/to/code_1.py to run your script within the container.

Step 6: Set Up Docker Container Registry (Docker Hub):

  1. Create an account on Docker Hub.
  2. Log in to Docker Hub using the following command in your terminal or command prompt:

docker login

Step 7: Tag and push the Docker Image to Docker Hub:

After building the image, tag it with your docker Hub username and the name of the repository . The repository name can be anything you choose but It is good practice to include a version number as well. See below the interface of the docker hub

To tag and push, use the following command.  This works only if collabnix12 is your Docker Hub ID.

Step 8: Pull and Run the Docker Image from Docker Hub:

Now, let us demonstrate how to pull the Docker image from Docker Hub on a different machine or another environment:

  1. On your machine, install it and ensure it is running.
  2. Pull the Docker image from Docker Hub using the following command below

Here’s an example of an image pulled from Docker Hub. It is represented with the ↡arrow seen below as

This will run the same script that was originally executed within the container, resulting in the same output as previously observed.

Docker: A Data Scientist’s Best Friend

Docker is a game-changer for data scientists. Its ability to package applications and their dependencies into self-contained units, called containers, offers unparalleled advantages. By encapsulating your entire data science environment within a Docker container, you ensure consistent results across different machines. This eliminates the “works on my machine” problem, a common headache in collaborative projects. Moreover, Docker’s lightweight nature surpasses traditional virtual machines in terms of speed and efficiency. Docker Hub, a centralized repository, simplifies project sharing. You can easily distribute your containerized applications to colleagues or the wider community. To delve deeper into Docker’s capabilities and explore a wide range of commands essential for data scientists, check out our comprehensive guide on Docker commands.

All codes used in this tutorial could be accessed here at Docker Labs Tutorials

Have Queries? Join https://launchpass.com/collabnix

Adesoji Alu Adesoji brings a proven ability to apply machine learning(ML) and data science techniques to solve real-world problems. He has experience working with a variety of cloud platforms, including AWS, Azure, and Google Cloud Platform. He has a strong skills in software engineering, data science, and machine learning. He is passionate about using technology to make a positive impact on the world.
Join our Discord Server
Index