You are currently offline

An In-Depth Guide to Web Scraping with Python and Beautiful Soup

Learn how to scrape data using Python and Beautiful Soup through a practical tutorial that covers data extraction, manipulation, and visualization.

May 2, 2025 • 15 min read •

An In-Depth Guide to Web Scraping with Python and Beautiful Soup

Web scraping is a powerful technique utilized to extract large volumes of data from the web, allowing data scientists, engineers, and analysts to gather essential information for their projects.

By employing Python, particularly with the Beautiful Soup library, you can efficiently extract, manipulate, and visualize data. This guide explores how to scrape data effectively and addresses a specific dataset centered on a 10K race held in Hillsboro, OR.

Understanding Python for Web Scraping

Web scraping with Python can be broken down into several key components:

Data Extraction: Using Beautiful Soup to gather data from HTML.
Data Manipulation: Cleaning and editing the extracted data with Pandas.
Data Visualization: Employing Matplotlib to illustrate the results.

Setting Up Your Environment

To follow this tutorial, ensure you have Jupyter Notebook installed. Anaconda is highly recommended for an easy installation of Jupyter, along with the packages needed for this purpose.

Start your Jupyter Notebook and import the required libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Next, import the modules for web scraping:

from urllib.request import urlopen
from bs4 import BeautifulSoup

Extracting Data from HTML

To scrape data, specify the URL of the desired dataset and call the urlopen function to obtain the HTML source:

url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)

Construct a Beautiful Soup object from the received HTML:

soup = BeautifulSoup(html, 'lxml')

This soup object allows access to various HTML elements, including the page title:

title = soup.title
print(title)

Extracting Relevant Data

To gather data, utilize the find_all method. For instance, extracting all hyperlinks would look like this:

soup.find_all('a')

To specifically pull the table rows that interest you, use:

rows = soup.find_all('tr')
print(rows[:10])

Transforming Data into a DataFrame

The objective is to convert the table into a Pandas DataFrame, which simplifies data manipulation. To do this, extract and clean the data from the rows:

list_rows = []
for row in rows:
    cells = row.find_all('td')
    clean = [BeautifulSoup(str(cell), "lxml").get_text() for cell in cells]
    list_rows.append(clean)

With the raw data stored in list_rows, convert this list into a DataFrame:

df = pd.DataFrame(list_rows)
df.head()

To enhance the DataFrame's structure, split the content within the DataFrame columns:

df = df[0].str.split(',', expand=True)

Cleaning the Data

The dataset often requires further cleaning. Eliminate unwanted characters and fill in the necessary headers:

df = df.applymap(lambda x: x.strip('[]'))

Data Analysis and Visualization

Now that the data is usable, analyze it to draw conclusions. For instance, what was the average finishing time for the runners?

df[' Chip Time'] = df[' Chip Time'].str.split(':')
df['Runner_mins'] = df[' Chip Time'].apply(lambda x: int(x[0]) * 60 + int(x[1]) + int(x[2]) / 60)

Create visual representations with box plots to identify performance distributions:

df.boxplot(column='Runner_mins')
plt.grid(True, axis='y')
plt.ylabel('Chip Time')
plt.show()

Conclusion

This tutorial offered a pathway to effectively web scrape with Python using Beautiful Soup, culminating in insights drawn from the analysis of 10K race data. Adopting such techniques equips data enthusiasts with hands-on skills suitable for a variety of applications.

Chirag Jakhariya

Founder and CEO

Founder and tech expert with over 10 years of experience, helping global clients solve complex problems, build scalable solutions, and deliver high-quality software and data systems.

ProjectManagmentSoftwareDevelopmentDataEngineeringWebScrapingStartupSupportScalableSolutions