Building a Dockerized Python Web Scraper with MySQL Integration
Introduction
Web scraping is a powerful technique for extracting data from websites, but efficiently storing and managing that data requires a robust setup. In this project, we built a Python web scraper that extracts quotes from a website and saves them in a MySQL database, all running inside Docker containers. This approach ensures seamless container communication using a custom Docker network, making deployment easier and more consistent across different environments.
Project Overview
Our setup consists of:
Web Scraper: A Python script using
requests
andBeautifulSoup
to extract quotes.MySQL Database: A dedicated MySQL container to store the extracted data.
Dockerized Setup: Both the scraper and database run in separate containers, communicating securely through a custom bridge network.
Environment Variables: Secure database configuration using dynamic variables.
Scalability & Portability: With Docker, the same environment can be replicated anywhere, ensuring consistency.
Step 1: Set Up MySQL in Docker
First, we need a MySQL database to store the scraped data. We will run a MySQL container using Docker.
1.1 Create a Custom Docker Network
docker network create --driver bridge scraper_net
This network allows our containers to communicate securely.
1.2 Run the MySQL Container
docker run -d \
--name mysql_container \
-e MYSQL_USER=faizan \
-e MYSQL_PASSWORD=redhat \
-e MYSQL_ROOT_PASSWORD=redhat \
-e MYSQL_DATABASE=scraper_db \
--network scraper_net \
-p 3306:3306 \
mysql:latest
This command runs a MySQL container and sets up a new database called scraper_db
.
Step 2: Create a Table in MySQL
Now that MySQL is running, we need to create a table to store the scraped quotes.
2.1 Connect to the MySQL Container
docker exec -it mysql_container mysql -uroot -predhat
2.2 Create the quotes
Table
CREATE TABLE quotes (
id INT AUTO_INCREMENT PRIMARY KEY,
text TEXT NOT NULL,
author VARCHAR(255) NOT NULL
);
id
: Auto-incremented primary key.text
: Stores the quote text.author
: Stores the name of the person who said the quote.
Step 3: Dockerize the Web Scraper
Next, we containerize the Python web scraper to ensure it runs consistently in any environment.
3.1 Create a Dockerfile for the Scraper
FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY scraper.py .
CMD ["python", "scraper.py"]
3.2 Build the Scraper Docker Image
docker build -t scraper-app:v1 .
This command creates a Docker image named scraper-app:v1
.
Step 4: Run the Scraper Container
Now, we launch the scraper container, ensuring it can communicate with the MySQL database.
docker run -d \
-e DB_HOST=mysql_container \
-e DB_USER=faizan \
-e DB_PASSWORD=redhat \
-e DB_NAME=scraper_db \
--network scraper_net \
scraper-app:v1
Explanation:
-e DB_HOST=mysql_container
: Uses the MySQL container name for communication.-e DB_USER, -e DB_PASSWORD, -e DB_NAME
: Matches the MySQL container credentials for secure access.--network scraper_net
: Connects both containers within the same network.
Step 5: Verify Data Insertion
After running the scraper, we check if the data was successfully inserted.
5.1 Connect to the MySQL Database
docker exec -it mysql_container mysql -uroot -predhat
5.2 Check the Stored Data
SHOW DATABASES;
USE scraper_db;
SHOW TABLES;
SELECT * FROM quotes;
This will display the extracted quotes stored in the MySQL database.
Conclusion
By leveraging Docker, we created an efficient, portable, and scalable web scraping setup. This project demonstrates how containerization improves consistency across development and production environments. The use of a custom Docker network ensures secure communication between services, and environment variables enhance security by avoiding hardcoded credentials.
💡 Whether you're automating data collection or building a larger data pipeline, this approach lays a strong foundation for managing web-scraped data efficiently.
#WebScraping #Docker #Python #MySQL #DataEngineering