HubScraper: A Docker Hub Scraper Tool built using Python, Selenium and Docker

Table of Contents

Web scraping has become increasingly popular in recent years, as businesses try to stay competitive and relevant in the ever-changing digital landscape. Nowadays the success of a business is not merely dependent on technical implementations but on serious analytical work and planning. In order to avoid losing time as well as money, it has become mandatory for marketers and businesses to study the market at once: analyze demand, competitors, target audience and external factors so that you can make decisions based on real market opportunities rather than on your hunches.

Web scraping can have a positive impact on businesses by providing them with real-time data that can be used to inform decisions. Web scraping can help businesses gain insights into customer preferences and behaviours, understand market trends and identify potential competitors. It can also be used to monitor prices and gather product information from various sources. Web scraping can also be used to automate processes such as product listings and email marketing campaigns. Ultimately, web scraping can help businesses increase efficiency and reduce costs while providing them with valuable data that can be used to inform future decisions.

Web scrapers are the tools used to carry out the web scraping process. They are typically automated programs that are written in a scripting language such as Python, or a web automation tool such as Selenium. These tools can be used to scrape data from any website that has an HTML interface, such as e-commerce websites, social media networks, and search engines. Web scraping is used for a variety of purposes, including collecting data for research and analysis, building customer databases, tracking customer behaviour, collecting market intelligence, and monitoring competitor activities. The most popular web scraping tools are Python, Selenium, BeautifulSoup, and more.

For more difficult use cases, there are other automated HTML scraping and extraction tools, like this example, that can save hours of coding but are not free to use.

Why is Web Scraping so popular?

Web scraping is popular because it provides a way to quickly and easily obtain structured and unstructured data from websites without having to manually copy and paste the data. It is also a cost-effective solution as it eliminates the need to pay for expensive data extraction services. Additionally, web scraping provides a way to quickly access and aggregate data from multiple sources, which is useful for data analysis.

Architecture of Web Scraping

Web Scraping architecture consists of three components:

1. Web Crawler: The web crawler is responsible for gathering the required data from the web page. It visits the web pages, parses the HTML/XML and extracts the required data.

2. Parser: The parser is responsible for interpreting the HTML/XML and extracting the data. It can be used to traverse the DOM tree and extract the required data.

3. Data Store: The data store is responsible for storing the extracted data. It can be a database, a file or any other form of persistent storage. T

How Does Web Scraper Work?

The web crawler visits the web pages, parses the HTML/XML and extracts the required data. The parser then interprets the HTML/XML and extracts the data. Finally, the data store stores the extracted data.

Why Python for Web Scraping?

The most common language used for web scraping is Python. Python is a popular, general-purpose programming language that is used for data analysis, web scraping, machine learning, and other applications. Additionally, Python is a relatively easy language to learn, making it a popular choice for web scraping. Other languages such as Java, PHP, and Ruby can also be used for web scraping. Python is a popular language for web scraping because of its libraries like BeautifulSoup, lxml, and html5lib, which can be used to parse and extract data from webpages.

Why Selenium and Beautiful Soup for Web Scraping?

Selenium is an automation tool that can be used for web scraping. It allows you to write scripts that can control a web browser, navigate web pages, locate elements on a page, and interact with them. It is useful for automating web scraping tasks.

Selenium is used for web scraping because it is a powerful tool that can automate web browsers and interact with web pages. It can also simulate user actions such as clicking, filling in forms, and navigating through web pages. This makes it ideal for scraping data from websites that require a user to interact with them, such as login forms, search boxes, and dynamic content. Selenium can also be used to scrape data from the server-side, meaning it can access HTML and JavaScript code that is not visible in the browser.

Beautiful Soup for Web Scraping is a library in Python that allows developers to parse HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Beautiful Soup is a popular Python library that makes it easy to scrape data from websites. It has a simple syntax, is easy to learn and use, and can parse HTML and XML documents quickly and efficiently. It can also be used to access APIs and scrape data from social media sites and other web sources. Beautiful Soup is a great tool for web scraping, data mining, and web automation.

Why to containerize Web Scraping Tools?

Docker helps you containerize your Web Scraping tools — letting you bundle together your web scraping scripts into self-contained containers that can be deployed anywhere. This makes it easier to scale and manage web scraping tasks. This includes everything needed to ship a cross-platform, multi-architecture web scraping application.

Docker can be used for web scraping in a few different ways. The first is by using a pre-built image that is specifically designed for web scraping. This image can be used to quickly set up and run web scraping jobs without having to manually install and configure software and libraries. The second way to use Docker for web scraping is to create custom images and containers for more complex web scraping jobs. By using Docker, developers can quickly create and deploy custom web scraping containers that can be used to run complex jobs without having to manually install and configure software and libraries.

Running the Web Scraper Tool

In this blog, we will see how to implement Web scraping using Python, Selenium and Docker.

Key Components:

Python
Selenium
Chrome Driver
Docker

Deploying a Web Scraper app is a fast process. You’ll clone the repository, install the Chrome driver, and then bring up the Python script. Let’s jump in.

Pre-requisites

Download the Chrome driver for your platform from this link. As I am using Apple M1 Pro, I downloaded this driver as per my requirement.
Ensure that Python 3.6+ is installed on your system

Clone the repository

git clone https://github.com/collabnix/hubscraper/

Install the required modules

Change the directory to hubscrapper and run the below command to install the required Python modules:

pip3 install -r requirements.txt

Navigating the script

First, we need to import the base Python libraries. Selenium 4 library is used to load the Chrome browser driver which loads the required HTML page.

# Base Libraries 
import os
import time
import csv
import re
# Selenium 4 for loading the Browser Driver 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

Beautiful Soup is a Python library used for scraping HTML documents.

# BeautifulSoup Library used for Parsing the HTML 
from bs4 import BeautifulSoup

Below code is used to initialize the Chrome driver.

# Initialising the Chrome Driver
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

The below code snippets selects the official Docker Images from the Docker Hub based out of Linux. It bucketize into 3 major categories – Official Docker Images, Linux and arm64. The core snippet runs through the loop where it loads all the Docker images on the Docker Hub and renders it on the Chrome Browser. The Beautiful Soup library picks up the loaded HTML page and reads the text from the HTML tags.

As shown in the following example, we are scraping

Name of the Docker Hub image
Number of Docker Pulls
Popularity of the Docker Hub image marked with “stars”

Finally, we are writing the output to *.csv file.

The code automatically navigates through all the pages and records the entries into the CSV file. It terminates when it hits the last page.

# Images Type which have to filtered from the DockerHub 
images = ["official"]
verifiedImages = list()
officialImages = list()
for i in images:
    counter = 1
    while True:
    
        # Load the Docker Hub HTML page
        driver.get(
            "https://hub.docker.com/search?q=&type=image&image_filter=" + i + "&operating_system=linux&architecture"
                                                                              "=arm64&page=" + str(counter))
        
        # Delay to load the contents of the HTML FIle
        time.sleep(2)
        
        # Parse processed webpage with BeautifulSoup
        soup = BeautifulSoup(driver.page_source, features="html.parser")
        
        nextCheck = soup.find('p', attrs={'class': 'styles__limitedText___HDSWL'})
        
        
        if not isinstance(nextCheck, type(None)):
            break
            
        results = soup.find(id="searchResults")
        
        if isinstance(results, type(None)):
            print("Error: results is NoneType")
            break
        imagesList = results.find_all('a',attrs={'data-testid': 'imageSearchResult'})
        
        if len(imagesList) == 0:
            break   # Stopping the parsing when no images are found
        for image in imagesList:
            # Getting the Name of the Image
            image_name = image.find('span',{"class":re.compile('.*MuiTypography-root.*')}).text
            counts = image.find_all('p',{"class":re.compile('.*MuiTypography-root MuiTypography-body1.*')})
            #Download Counts
            if len(counts) <=  1:
                download_count = "0"
                stars_count = counts[0].text
            
            else:
                download_count = counts[0].text
                # Stars Count
                stars_count = counts[1].text
            
            # Writing the Image Name, Download Count and Stars Count to File
            writer.writerow([image_name,download_count,stars_count])
            
        if len(imagesList) == 0:
            break
        
        counter += 1
# Closing of the CSV File Handle           
csv_file.close()
# Closing of the Chrome Driver 
driver.quit()

Here’s the complete Python script:

# Base Libraries 
import os
import time
import csv
import re
# Selenium 4 for loading the Browser Driver 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
# Web Driver Manager
from webdriver_manager.chrome import ChromeDriverManager
# BeautifulSoup Library used for Parsing the HTML 
from bs4 import BeautifulSoup
# Change the base_dir with your path.
base_dir = '/Users/ajeetraina/Downloads' + os.sep
# Opening the CSV File Handle
csv_file = open('results.csv', 'w')
# Create the csv writer
writer = csv.writer(csv_file)
# Writing the Headers for the CSV File
writer.writerow(['Image Name','Downloads','Stars'])
# Initialising the Chrome Driver
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
# Images Type which have to filitered from the DockerHub 
images = ["official"]
verifiedImages = list()
officialImages = list()
for i in images:
    counter = 1
    while True:
    
        # Load the Docker Hub HTML page
        driver.get(
            "https://hub.docker.com/search?q=&type=image&image_filter=" + i + "&operating_system=linux&architecture"
                                                                              "=arm64&page=" + str(counter))
        
        # Delay to load the contents of the HTML FIle
        time.sleep(2)
        
        # Parse processed webpage with BeautifulSoup
        soup = BeautifulSoup(driver.page_source, features="html.parser")
        
        nextCheck = soup.find('p', attrs={'class': 'styles__limitedText___HDSWL'})
        
        
        if not isinstance(nextCheck, type(None)):
            break
            
        results = soup.find(id="searchResults")
        
        if isinstance(results, type(None)):
            print("Error: results is NoneType")
            break
        imagesList = results.find_all('a',attrs={'data-testid': 'imageSearchResult'})
        
        if len(imagesList) == 0:
            break   # Stopping the parsing when no images are found
        for image in imagesList:
            # Getting the Name of the Image
            image_name = image.find('span',{"class":re.compile('.*MuiTypography-root.*')}).text
            counts = image.find_all('p',{"class":re.compile('.*MuiTypography-root MuiTypography-body1.*')})
            #Download Counts
            if len(counts) <=  1:
                download_count = "0"
                stars_count = counts[0].text
            
            else:
                download_count = counts[0].text
                # Stars Count
                stars_count = counts[1].text
            
            # Writing the Image Name, Download Count and Stars Count to File
            writer.writerow([image_name,download_count,stars_count])
            
        if len(imagesList) == 0:
            break
        
        counter += 1
# Closing of the CSV File Handle           
csv_file.close()
# Closing of the Chrome Driver 
driver.quit()

You will need to make the changes as per your directory structure:

# Change the base_dir with your path.
base_dir = '/Users/ajeetraina/Downloads' + os.sep
# MS Edge Driver
# driver = webdriver.Edge(service=Service(EdgeChromiumDriverManager().install()))
# Safari Driver
csv_file = open('results.csv', 'w')
# create the csv writer
writer = csv.writer(csv_file)
writer.writerow(['Image Name','Downloads','Stars'])
driver = webdriver.Chrome(executable_path = "/Users/ajeetraina/Downloads/chromedriver\ 3")

Execute the script


python3 scraper.py

Once you run this script, it will start scrapping the Docker Hub repository for Docker Official images.

It will start dumping the results in a csv file as shown:

ajeetraina@Docker-Ajeet-Singh-Rainas-MacBook-Pro hubscraper % cat results.csv 
Image Name,Downloads,Stars
alpine,1B+,9.5K
busybox,1B+,2.8K
nginx,1B+,10K+
ubuntu,1B+,10K+
python,1B+,8.2K
postgres,1B+,10K+
redis,1B+,10K+
httpd,1B+,4.3K
node,1B+,10K+
mongo,1B+,9.3K
mysql,1B+,10K+
memcached,1B+,2

You can open it on the browser to view it in a much prettier manner.

Containerising the Web Scraping Tool

Docker helps you containerize your web scraping — letting you bundle together your complete hub scraper application, runtime, configuration, and OS-level dependencies. This includes everything needed to ship a cross-platform, multi-architecture web application.

Docker uses a Dockerfile to create each image’s layers. Each layer stores important changes stemming from your base image’s standard configuration. Let’s create an empty Dockerfile in the root of our project repository.

Let us try to create a complete Dockerfile that includes all our dependencies in a single text file:

FROM --platform=linux/amd64  python:3.9
# Set environment variables
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1
WORKDIR /app
COPY . /app
# install google chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list'
RUN apt-get update -qqy --no-install-recommends && apt-get install -qqy --no-install-recommends google-chrome-stable
# install chromedriver
RUN apt-get install -yqq unzip
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/
# set display port to avoid crash
ENV DISPLAY=:99
RUN pip install --upgrade pip
RUN pip install -r /app/requirements.txt

In the above Dockerfile, we picked up Python 3.9 as a base image. Then we chose /app as a working directory in a Docker container. Next, instructed Dockerfile to install headless Google Chrome and Chrome Driver. Finally, included requirements.txt to install the required Python modules.

Building the Hub Scraper Docker Image

git clone https://github.com/collabnix/hubscraper/
docker build -t ajeetraina/hubscraper .

Running the Hubscraper in a Docker container

docker run --platform=linux/amd64 -it -w /app -v $(pwd):/app ajeetraina/scraperhubb bash
root@960e8b9fa2c2:/usr/workspace# python scraper.py 
[WDM] - Downloading: 100%|███████████████████████████████████████████████████████████████| 6.96M/6.96M [00:00<00:00, 8.90MB/s]

Conclusion

Web scraping is a powerful tool for businesses, allowing them to gather data from different sources and analyze it to gain valuable insights. By using tools like Hubscraper, developers can now stick to the popular and verified Docker images from the trusted source rather than picking up random images from the Docker Hub.