In today’s data-driven world, the traditional method of manual data collection is outdated. A Computer with an internet connection on every desk made the web a huge source of data. Thus, the more efficient and time-saving modern method for data collection is web scraping. And when it comes to web scraping, Python has a tool called Beautiful Soup. In this post, I will walk you through the installation steps of Beautiful Soup to get started with web scraping.

Before installing and working with Beautiful Soup, let’s find out why you should go for it.

What is a Beautiful Soup?

Let’s pretend you’re researching “COVID’s impact on people’s health” and have found a few web pages containing relevant data. But what if they don’t offer you a single-click download option to borrow their data? Here comes the Beautiful Soup into play.

Beautiful Soup is among the index of Python libraries to pull out the data from targeted sites. It is more comfortable retrieving data from HTML or XML pages.

Leonard Richardson brought the idea of Beautiful Soup for scraping the web to light in 2004. But his contribution to the project continues to this today. He proudly updates every Beautiful Soup’s new release on his Twitter account.

Although Beautiful Soup for web scraping was developed using Python 3.8, it works perfectly with both Python 3 and Python 2.4 as well. 

<img alt="webscrapingapi-1" data- data-src="https://kirelos.com/wp-content/uploads/2023/03/echo/webscrapingapi-1.jpg" data- decoding="async" height="442" src="data:image/svg xml,” width=”800″>

Often websites use captcha protection to rescue their data from AI tools. In this case, a few changes to the ‘user-agent’ header in the Beautiful Soup or using Captcha-solving APIs can mimic a reliable browser and trick the detection tool.

However, If you don’t have time to explore Beautiful Soup or want the scraping to be done efficiently and at ease, then you shouldn’t miss checking out this web scraping API, where you can just provide a URL and get the data in your hands.

If you are already a programmer, using Beautiful Soup for scraping won’t be daunting because of its straightforward syntax in navigating web pages and extracting the desired data based on conditional parsing. At the same time, it’s newbie friendly too.

Though Beautiful Soup is not for advanced scraping, it works best to scrape the data from files written in markup languages.

Clear and detailed documentation is another brownie point that Beautiful Soup bagged.

Let’s find an easy way to get beautiful Soup into your machine.

How to Install Beautiful Soup for Web Scraping?

Pip – An effortless Python package manager developed in 2008 is now a standard tool among developers to install any Python libraries or dependencies.

Pip comes default with the installation of recent Python versions. Thus, if you have any recent Python versions installed on your system, you are good to go.

Open the command prompt and type the following pip command to install the beautiful Soup instantly.

pip install beautifulsoup4

You will see something similar to the following screenshot on your display.

<img alt="installbeautifulsoup-1" data- data-src="https://kirelos.com/wp-content/uploads/2023/03/echo/installbeautifulsoup-1.png" data- decoding="async" height="376" src="data:image/svg xml,” width=”800″>

Make sure you have updated the PIP installer to the latest version to avoid common errors.

The command to update the pip installer to the latest version is:

pip install --upgrade pip

We’ve successfully covered half the ground in this post.

Now you have Beautiful Soup installed on your machine, so let’s dive into how to use it for web scraping.

How to Import and Work with Beautiful Soup for Web Scraping?

Type the following command in your python IDE to import beautiful Soup into the current python script.

from bs4 import BeautifulSoup

Now the Beautiful Soup is in your Python file to use for scraping.

Let’s look at a code example to learn how to extract the desired data with beautiful Soup.

We can tell beautiful Soup to look for specific HTML tags in the source website and scrape the data present in those tags.

In this piece, I will be using marketwatch.com, which updates the real-time stock prices of various companies. Let’s pull out some data from this website to familiarize yourself with the Beautiful Soup library.

Import “requests” package that will allow us to receive and respond to HTTP requests and “urllib” to load the webpage from its URL.

from urllib.request import urlopen
import requests

Save the web page link in a variable so that you can easily access it later.

url = 'https://www.marketwatch.com/investing/stock/amzn'

The next would be to use the “urlopen” method from “urllib” library to store the HTML page in a variable. Pass the URL to the “urlopen” function and save the result in a variable.

page = urlopen(url)

Create a Beautiful Soup object and parse the desired web page using “html.parser”.

soup_obj = BeautifulSoup(page, 'html.parser')

Now the entire HTML script of the targeted web page is stored in the ‘soup_obj’ variable.

Before proceeding, let’s look at the targeted page source code to know more about the HTML script and tags.

Right-click anywhere on the web page with your mouse. Then you will find an inspect option, as displayed below.

<img alt="webscraping" data- data-src="https://kirelos.com/wp-content/uploads/2023/03/echo/webscraping-1.png" data- decoding="async" height="414" src="data:image/svg xml,” width=”800″>

Click on inspect to view the source code.

<img alt="webscraping" data- data-src="https://kirelos.com/wp-content/uploads/2023/03/echo/webscrape.png" data- decoding="async" height="408" src="data:image/svg xml,” width=”800″>

In the above source code, you can find tags, classes, and more specific information about every element visible on the website’s interface.

The “find” method in beautiful Soup allows us to search for the requested HTML tags and retrieve the data. To do this, we give the class name and tags to the method that extracts specific data.

For instance, “Amazon.com Inc.” shown on the web page has the class name: ‘company__name’ tagged under ‘h1’. We can input this information into the ‘find’ method to extract the relevant HTML snippet into a variable.

name = soup_obj.find('h1', attrs={'class': 'company__name'})

Let’s output the HTML script stored in the variable “name” and the required text on the screen.

print(name)

print(name.text)
<img alt="beautifulsoupscrapper" data- data-src="https://kirelos.com/wp-content/uploads/2023/03/echo/beautifulsoupscrapper.png" data- decoding="async" height="151" src="data:image/svg xml,” width=”800″>

You can witness the extracted data printed on the screen.

Web Scrape the IMDb website

Many of us look for movie ratings on IMBb’s site before watching a movie. This demonstration will give you a list of top-rated movies and helps you get used to the beautiful Soup for web scraping.

Step 1: Import the beautiful Soup and requests libraries.

from bs4 import BeautifulSoup
import requests

Step 2: Let’s assign the URL we want to scrape to a variable called ‘url’ for easy access in the code.

The “requests” package is used to get the HTML page from the URL.

url = requests.get('https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating')

Step 3: In the following code snippet, we will parse the HTML page of the current URL to create an object of beautiful Soup.

soup_obj = BeautifulSoup(url.text, 'html.parser')

Variable “soup_obj” now contains the entire HTML script of the desired webpage, as in the following image.

<img alt="imdbwebscrape" data- data-src="https://kirelos.com/wp-content/uploads/2023/03/echo/imdbwebscrape.png" data- decoding="async" height="299" src="data:image/svg xml,” width=”800″>

Let’s inspect the source code of the web page to find the HTML script of the data we want to scrape.

Hover the cursor over the web page element which you want to extract. Next, right-click on it and go with the inspect option to view the source code of that specific element. The following visuals will guide you better.

<img alt="beautifulsoupforscraping" data- data-src="https://kirelos.com/wp-content/uploads/2023/03/echo/imdbscraping.png" data- decoding="async" height="487" src="data:image/svg xml,” width=”800″>

The class’ lister-list’ contains all the top-rated movie-related data as sub-divisions in successive div tags.

In each movie card’s HTML script, under the class’ lister-item mode-advanced’, we have a tag ‘h3’ that stores the movie name, rank, and year of release, as highlighted in the below image.

<img alt="beautifulsoupscraping" data- data-src="https://kirelos.com/wp-content/uploads/2023/03/echo/imdbmoviescraping.png" data- decoding="async" height="333" src="data:image/svg xml,” width=”800″>

Note: The “find” method in beautiful Soup searches for the first tag that matches the input name given to it. Unlike “find”, the “find_all” method looks for all the tags that match the given input.

Step 4: You can use the “find” and “find_all” methods to save the HTML script of every movie’s name, rank, and year in a list variable.

top_movies = soup_obj.find('div',attrs={'class': 'lister-list'}).find_all('h3')

Step 5: Loop through the list of movies stored in the variable: “top_movies” and extract the name, rank, and year of each movie in text format from its HTML script using the below code.

for movie in top_movies:
    movi_name = movie.a.text
    rank = movie.span.text.rstrip('.')
    year = movie.find('span', attrs={'class': 'lister-item-year text-muted unbold'})
    year = year.text.strip('()')
    print(movi_name   " ", rank  " ", year  " ")

In the output screenshot, you can see the list of movies with their name, rank, and year of release.

<img alt="imdbscrapeddata" data- data-src="https://kirelos.com/wp-content/uploads/2023/03/echo/imdbscrapeddata.png" data- decoding="async" height="506" src="data:image/svg xml,” width=”800″>

You can effortlessly move the printed data into an excel sheet with some python code and use it for your analysis.

Final Words

This post guides you in installing beautiful Soup for web scraping. Also, the scraping examples I have shown should help you get started with Beautiful Soup.

As you are interested in how to install Beautiful Soup for web scraping, I highly recommend you check out this comprehensible guide to know more about web scraping using Python.