Building a Dockerized Python Web Scraper with MySQL Integration

Introduction

Web scraping is a powerful technique for extracting data from websites, but efficiently storing and managing that data requires a robust setup. In this project, we built a Python web scraper that extracts quotes from a website and saves them in a MySQL database, all running inside Docker containers. This approach ensures seamless container communication using a custom Docker network, making deployment easier and more consistent across different environments.

Project Overview

Our setup consists of:

Web Scraper: A Python script using requests and BeautifulSoup to extract quotes.
MySQL Database: A dedicated MySQL container to store the extracted data.
Dockerized Setup: Both the scraper and database run in separate containers, communicating securely through a custom bridge network.
Environment Variables: Secure database configuration using dynamic variables.
Scalability & Portability: With Docker, the same environment can be replicated anywhere, ensuring consistency.

Step 1: Set Up MySQL in Docker

First, we need a MySQL database to store the scraped data. We will run a MySQL container using Docker.

1.1 Create a Custom Docker Network

 docker network create --driver bridge scraper_net

This network allows our containers to communicate securely.

1.2 Run the MySQL Container

docker run -d \
 --name mysql_container \
 -e MYSQL_USER=faizan \
 -e MYSQL_PASSWORD=redhat \
 -e MYSQL_ROOT_PASSWORD=redhat \
 -e MYSQL_DATABASE=scraper_db \
 --network scraper_net \
 -p 3306:3306 \
 mysql:latest

This command runs a MySQL container and sets up a new database called scraper_db.

Step 2: Create a Table in MySQL

Now that MySQL is running, we need to create a table to store the scraped quotes.

2.1 Connect to the MySQL Container

docker exec -it mysql_container mysql -uroot -predhat

2.2 Create the `quotes` Table

CREATE TABLE quotes (
 id INT AUTO_INCREMENT PRIMARY KEY,
 text TEXT NOT NULL,
 author VARCHAR(255) NOT NULL
);

id: Auto-incremented primary key.
text: Stores the quote text.
author: Stores the name of the person who said the quote.

Step 3: Dockerize the Web Scraper

Next, we containerize the Python web scraper to ensure it runs consistently in any environment.

3.1 Create a Dockerfile for the Scraper

FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY scraper.py .
CMD ["python", "scraper.py"]

3.2 Build the Scraper Docker Image

docker build -t scraper-app:v1 .

This command creates a Docker image named scraper-app:v1.

Step 4: Run the Scraper Container

Now, we launch the scraper container, ensuring it can communicate with the MySQL database.

docker run -d \
 -e DB_HOST=mysql_container \
 -e DB_USER=faizan \
 -e DB_PASSWORD=redhat \
 -e DB_NAME=scraper_db \
 --network scraper_net \
 scraper-app:v1

Explanation:

-e DB_HOST=mysql_container: Uses the MySQL container name for communication.
-e DB_USER, -e DB_PASSWORD, -e DB_NAME: Matches the MySQL container credentials for secure access.
--network scraper_net: Connects both containers within the same network.

Step 5: Verify Data Insertion

After running the scraper, we check if the data was successfully inserted.

5.1 Connect to the MySQL Database

docker exec -it mysql_container mysql -uroot -predhat

5.2 Check the Stored Data

SHOW DATABASES;
USE scraper_db;
SHOW TABLES;
SELECT * FROM quotes;

This will display the extracted quotes stored in the MySQL database.

Conclusion

By leveraging Docker, we created an efficient, portable, and scalable web scraping setup. This project demonstrates how containerization improves consistency across development and production environments. The use of a custom Docker network ensures secure communication between services, and environment variables enhance security by avoiding hardcoded credentials.

💡 Whether you're automating data collection or building a larger data pipeline, this approach lays a strong foundation for managing web-scraped data efficiently.

#WebScraping #Docker #Python #MySQL #DataEngineering