Web scraping has become increasingly popular in recent years, as businesses try to stay competitive and relevant in the ever-changing digital landscape. Nowadays the success of a business is not merely dependent on technical implementations but on serious analytical work and planning. In order to avoid losing time as well as money, it has become mandatory for marketers and businesses to study the market at once: analyze demand, competitors, target audience and external factors so that you can make decisions based on real market opportunities rather than on your hunches.
Web scraping can have a positive impact on businesses by providing them with real-time data that can be used to inform decisions. Web scraping can help businesses gain insights into customer preferences and behaviours, understand market trends and identify potential competitors. It can also be used to monitor prices and gather product information from various sources. Web scraping can also be used to automate processes such as product listings and email marketing campaigns. Ultimately, web scraping can help businesses increase efficiency and reduce costs while providing them with valuable data that can be used to inform future decisions.
Web scrapers are the tools used to carry out the web scraping process. They are typically automated programs that are written in a scripting language such as Python, or a web automation tool such as Selenium. These tools can be used to scrape data from any website that has an HTML interface, such as e-commerce websites, social media networks, and search engines. Web scraping is used for a variety of purposes, including collecting data for research and analysis, building customer databases, tracking customer behaviour, collecting market intelligence, and monitoring competitor activities. The most popular web scraping tools are Python, Selenium, BeautifulSoup, and more.
For more difficult use cases, there are other automated HTML scraping and extraction tools, like this example, that can save hours of coding but are not free to use.
Why is Web Scraping so popular?
Web scraping is popular because it provides a way to quickly and easily obtain structured and unstructured data from websites without having to manually copy and paste the data. It is also a cost-effective solution as it eliminates the need to pay for expensive data extraction services. Additionally, web scraping provides a way to quickly access and aggregate data from multiple sources, which is useful for data analysis.
Architecture of Web Scraping
Web Scraping architecture consists of three components:
1. Web Crawler: The web crawler is responsible for gathering the required data from the web page. It visits the web pages, parses the HTML/XML and extracts the required data.
2. Parser: The parser is responsible for interpreting the HTML/XML and extracting the data. It can be used to traverse the DOM tree and extract the required data.
3. Data Store: The data store is responsible for storing the extracted data. It can be a database, a file or any other form of persistent storage. T
How Does Web Scraper Work?
The web crawler visits the web pages, parses the HTML/XML and extracts the required data. The parser then interprets the HTML/XML and extracts the data. Finally, the data store stores the extracted data.
Why Python for Web Scraping?
The most common language used for web scraping is Python. Python is a popular, general-purpose programming language that is used for data analysis, web scraping, machine learning, and other applications. Additionally, Python is a relatively easy language to learn, making it a popular choice for web scraping. Other languages such as Java, PHP, and Ruby can also be used for web scraping. Python is a popular language for web scraping because of its libraries like BeautifulSoup, lxml, and html5lib, which can be used to parse and extract data from webpages.
Why Selenium and Beautiful Soup for Web Scraping?
Selenium is an automation tool that can be used for web scraping. It allows you to write scripts that can control a web browser, navigate web pages, locate elements on a page, and interact with them. It is useful for automating web scraping tasks.
Selenium is used for web scraping because it is a powerful tool that can automate web browsers and interact with web pages. It can also simulate user actions such as clicking, filling in forms, and navigating through web pages. This makes it ideal for scraping data from websites that require a user to interact with them, such as login forms, search boxes, and dynamic content. Selenium can also be used to scrape data from the server-side, meaning it can access HTML and JavaScript code that is not visible in the browser.
Beautiful Soup for Web Scraping is a library in Python that allows developers to parse HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Beautiful Soup is a popular Python library that makes it easy to scrape data from websites. It has a simple syntax, is easy to learn and use, and can parse HTML and XML documents quickly and efficiently. It can also be used to access APIs and scrape data from social media sites and other web sources. Beautiful Soup is a great tool for web scraping, data mining, and web automation.
Why to containerize Web Scraping Tools?
Docker helps you containerize your Web Scraping tools — letting you bundle together your web scraping scripts into self-contained containers that can be deployed anywhere. This makes it easier to scale and manage web scraping tasks. This includes everything needed to ship a cross-platform, multi-architecture web scraping application.
Docker can be used for web scraping in a few different ways. The first is by using a pre-built image that is specifically designed for web scraping. This image can be used to quickly set up and run web scraping jobs without having to manually install and configure software and libraries. The second way to use Docker for web scraping is to create custom images and containers for more complex web scraping jobs. By using Docker, developers can quickly create and deploy custom web scraping containers that can be used to run complex jobs without having to manually install and configure software and libraries.
Running the Web Scraper Tool
In this blog, we will see how to implement Web scraping using Python, Selenium and Docker.
Key Components:
- Python
- Selenium
- Chrome Driver
- Docker
Deploying a Web Scraper app is a fast process. You’ll clone the repository, install the Chrome driver, and then bring up the Python script. Let’s jump in.
Pre-requisites
- Download the Chrome driver for your platform from this link. As I am using Apple M1 Pro, I downloaded this driver as per my requirement.
- Ensure that Python 3.6+ is installed on your system
Clone the repository
git clone https://github.com/collabnix/hubscraper/
Install the required modules
Change the directory to hubscrapper and run the below command to install the required Python modules:
pip3 install -r requirements.txt
Navigating the script
First, we need to import the base Python libraries. Selenium 4 library is used to load the Chrome browser driver which loads the required HTML page.
# Base Libraries import os import time import csv import re # Selenium 4 for loading the Browser Driver from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service
Beautiful Soup is a Python library used for scraping HTML documents.
# BeautifulSoup Library used for Parsing the HTML from bs4 import BeautifulSoup
Below code is used to initialize the Chrome driver.
# Initialising the Chrome Driver options = Options() options.add_argument("start-maximized") driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
The below code snippets selects the official Docker Images from the Docker Hub based out of Linux. It bucketize into 3 major categories – Official Docker Images, Linux and arm64. The core snippet runs through the loop where it loads all the Docker images on the Docker Hub and renders it on the Chrome Browser. The Beautiful Soup library picks up the loaded HTML page and reads the text from the HTML tags.
As shown in the following example, we are scraping
- Name of the Docker Hub image
- Number of Docker Pulls
- Popularity of the Docker Hub image marked with “stars”
Finally, we are writing the output to *.csv file.
The code automatically navigates through all the pages and records the entries into the CSV file. It terminates when it hits the last page.
# Images Type which have to filtered from the DockerHub images = ["official"] verifiedImages = list() officialImages = list() for i in images: counter = 1 while True: # Load the Docker Hub HTML page driver.get( "https://hub.docker.com/search?q=&type=image&image_filter=" + i + "&operating_system=linux&architecture" "=arm64&page=" + str(counter)) # Delay to load the contents of the HTML FIle time.sleep(2) # Parse processed webpage with BeautifulSoup soup = BeautifulSoup(driver.page_source, features="html.parser") nextCheck = soup.find('p', attrs={'class': 'styles__limitedText___HDSWL'}) if not isinstance(nextCheck, type(None)): break results = soup.find(id="searchResults") if isinstance(results, type(None)): print("Error: results is NoneType") break imagesList = results.find_all('a',attrs={'data-testid': 'imageSearchResult'}) if len(imagesList) == 0: break # Stopping the parsing when no images are found for image in imagesList: # Getting the Name of the Image image_name = image.find('span',{"class":re.compile('.*MuiTypography-root.*')}).text counts = image.find_all('p',{"class":re.compile('.*MuiTypography-root MuiTypography-body1.*')}) #Download Counts if len(counts) <= 1: download_count = "0" stars_count = counts[0].text else: download_count = counts[0].text # Stars Count stars_count = counts[1].text # Writing the Image Name, Download Count and Stars Count to File writer.writerow([image_name,download_count,stars_count]) if len(imagesList) == 0: break counter += 1 # Closing of the CSV File Handle csv_file.close() # Closing of the Chrome Driver driver.quit()
Here’s the complete Python script:
# Base Libraries import os import time import csv import re # Selenium 4 for loading the Browser Driver from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service # Web Driver Manager from webdriver_manager.chrome import ChromeDriverManager # BeautifulSoup Library used for Parsing the HTML from bs4 import BeautifulSoup # Change the base_dir with your path. base_dir = '/Users/ajeetraina/Downloads' + os.sep # Opening the CSV File Handle csv_file = open('results.csv', 'w') # Create the csv writer writer = csv.writer(csv_file) # Writing the Headers for the CSV File writer.writerow(['Image Name','Downloads','Stars']) # Initialising the Chrome Driver options = Options() options.add_argument("start-maximized") driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) # Images Type which have to filitered from the DockerHub images = ["official"] verifiedImages = list() officialImages = list() for i in images: counter = 1 while True: # Load the Docker Hub HTML page driver.get( "https://hub.docker.com/search?q=&type=image&image_filter=" + i + "&operating_system=linux&architecture" "=arm64&page=" + str(counter)) # Delay to load the contents of the HTML FIle time.sleep(2) # Parse processed webpage with BeautifulSoup soup = BeautifulSoup(driver.page_source, features="html.parser") nextCheck = soup.find('p', attrs={'class': 'styles__limitedText___HDSWL'}) if not isinstance(nextCheck, type(None)): break results = soup.find(id="searchResults") if isinstance(results, type(None)): print("Error: results is NoneType") break imagesList = results.find_all('a',attrs={'data-testid': 'imageSearchResult'}) if len(imagesList) == 0: break # Stopping the parsing when no images are found for image in imagesList: # Getting the Name of the Image image_name = image.find('span',{"class":re.compile('.*MuiTypography-root.*')}).text counts = image.find_all('p',{"class":re.compile('.*MuiTypography-root MuiTypography-body1.*')}) #Download Counts if len(counts) <= 1: download_count = "0" stars_count = counts[0].text else: download_count = counts[0].text # Stars Count stars_count = counts[1].text # Writing the Image Name, Download Count and Stars Count to File writer.writerow([image_name,download_count,stars_count]) if len(imagesList) == 0: break counter += 1 # Closing of the CSV File Handle csv_file.close() # Closing of the Chrome Driver driver.quit()
You will need to make the changes as per your directory structure:
# Change the base_dir with your path. base_dir = '/Users/ajeetraina/Downloads' + os.sep # MS Edge Driver # driver = webdriver.Edge(service=Service(EdgeChromiumDriverManager().install())) # Safari Driver csv_file = open('results.csv', 'w') # create the csv writer writer = csv.writer(csv_file) writer.writerow(['Image Name','Downloads','Stars']) driver = webdriver.Chrome(executable_path = "/Users/ajeetraina/Downloads/chromedriver\ 3")
Execute the script
python3 scraper.py
Once you run this script, it will start scrapping the Docker Hub repository for Docker Official images.
It will start dumping the results in a csv file as shown:
ajeetraina@Docker-Ajeet-Singh-Rainas-MacBook-Pro hubscraper % cat results.csv Image Name,Downloads,Stars alpine,1B+,9.5K busybox,1B+,2.8K nginx,1B+,10K+ ubuntu,1B+,10K+ python,1B+,8.2K postgres,1B+,10K+ redis,1B+,10K+ httpd,1B+,4.3K node,1B+,10K+ mongo,1B+,9.3K mysql,1B+,10K+ memcached,1B+,2
You can open it on the browser to view it in a much prettier manner.
Containerising the Web Scraping Tool
Docker helps you containerize your web scraping — letting you bundle together your complete hub scraper application, runtime, configuration, and OS-level dependencies. This includes everything needed to ship a cross-platform, multi-architecture web application.
Docker uses a Dockerfile to create each image’s layers. Each layer stores important changes stemming from your base image’s standard configuration. Let’s create an empty Dockerfile in the root of our project repository.
Let us try to create a complete Dockerfile that includes all our dependencies in a single text file:
FROM --platform=linux/amd64 python:3.9 # Set environment variables ENV PYTHONDONTWRITEBYTECODE 1 ENV PYTHONUNBUFFERED 1 WORKDIR /app COPY . /app # install google chrome RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - RUN sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' RUN apt-get update -qqy --no-install-recommends && apt-get install -qqy --no-install-recommends google-chrome-stable # install chromedriver RUN apt-get install -yqq unzip RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/ # set display port to avoid crash ENV DISPLAY=:99 RUN pip install --upgrade pip RUN pip install -r /app/requirements.txt
In the above Dockerfile, we picked up Python 3.9 as a base image. Then we chose /app
as a working directory in a Docker container. Next, instructed Dockerfile to install headless Google Chrome and Chrome Driver. Finally, included requirements.txt
to install the required Python modules.
Building the Hub Scraper Docker Image
git clone https://github.com/collabnix/hubscraper/ docker build -t ajeetraina/hubscraper .
Running the Hubscraper in a Docker container
docker run --platform=linux/amd64 -it -w /app -v $(pwd):/app ajeetraina/scraperhubb bash root@960e8b9fa2c2:/usr/workspace# python scraper.py [WDM] - Downloading: 100%|███████████████████████████████████████████████████████████████| 6.96M/6.96M [00:00<00:00, 8.90MB/s]
Conclusion
Web scraping is a powerful tool for businesses, allowing them to gather data from different sources and analyze it to gain valuable insights. By using tools like Hubscraper, developers can now stick to the popular and verified Docker images from the trusted source rather than picking up random images from the Docker Hub.