BLOG

Let's Build a Web Scraper with Python & BeautifulSoup4

thecodingpie . . 17 min read

Ever wondered how to automate the process of scraping a website, collecting data, and exporting it to a useful format like CSV? If you are doing data science/machine learning then you may have been in this situation several times.

That's why I wrote this tutorial, In this tutorial, you will learn all about Web Scraping by building a Python script that will scrape a movie website and fetches useful information, and finally, it will export the collected data to a CSV (Comma Separated Values) file.

And the good thing is that you don't have to do web scraping manually by your hands anymore!

Sounds interesting? Then let's jump right in.

You can download the finished code here from my Github repo - Web Scraper.

What is Web Scraping

Web Scraping is the process of collecting useful/needed information from any website on the internet. Like any other process, there are two ways to do it: one is to manually copy-paste the needed data from the website. And the other way, the way of legends, is to smartly automate it!

I hope you want to be in the second category. But there are some challenges in doing so...

The first challenge is that not all website owners like the process of scraping their website. So if you are going to do web scraping on a website, then please make sure that they allow you to do so.

The second challenge is that not all websites are alike. I mean the script you wrote for one website can't be used in other websites. Because the structures of both websites are entirely different. And maybe you even can't use that same script on the same website after several days, because web developers change their website's layout all the time to battle with the web scrapers.

Web Scraping Alternative

If there are so many challenges, is there any alternative? Yes, there is an alternative called API. Application Programming Interface is the only legal and stable way of getting data from any website.

Most of the websites provide an API through which you can get the data you wanted in a more sweet format like JSON or XML. But there's a catch, you may have to pay money. Of course, there may be a free plan, but in the long run, you have to pay them in order to use their precious data.

That's where the concept of web scraping comes in handy!

What We are Going to Build

We will learn all about Web Scraping using Python and BeautifulSoup4 by building a real-world project.

I don't want to give you a headache by teaching you how to scrape an ever-changing dynamic website. So, I built a static movie website, named TopMovies, which contains a list of the top 25 IMDb movies. This is the website we are going to scrape. So before moving forward please examine it first - TopMovies.

See the TopMovies website has a list of the top 25 IMDb movies. Each movie holds the following details in it:

  • title
  • genre
  • rating
  • length - movie runtime
  • year
  • budget
  • gross
  • img

We are going to scrape those details from that TopMovies website. Then after getting all those details, we are going to export it to a useful format like CSV so that you can later import it into your data science project and can do some predictions without any worry!

In short, we are going to scrape the following website:

1
We will scrape this web site and...

Export the scraped data into a CSV file like this:

2
...generate a CSV file like this

And, later, if you want, you can read it as a Pandas DataFrame like below inside Jupyter notebook and you can do all your analysis and predictions easily!:

3
Generated CSV file read as Pandas DataFrame inside jupyter notebook.

If you are not a Data Science/Machine learning person, then don't worry about this last image, just forgot it!

By doing this simple project, you will learn the skills to build any sort of Web Scraper that is capable of scraping almost any website you wish to scrap. And also you will learn how to generate CSV files using Python.

How we are going to do that?

It's very straight forward:

  • First, we will fetch the web page we want using the requests library.
  • Then, we will turn that page into a BeautifulSoup object with the help of a suitable parser like lxml. This will make the scraping process a lot easier.
  • Then we will scrape all the needed data from that soup object.
  • Finally, we will export all the scraped data into a file called top25.csv with the help of the csv module.

That's it!

Prerequisites

  • You should be good at python3.
  • You should have a decent understanding of HTML and a little bit of CSS.
  • You should have python3.4 or a higher version installed on your computer. You can read this post to learn how to setup python3 on any operating system - https://realpython.com/installing-python/
  • You should have venv installed.
  • Finally, you will need a modern code editor like visual studio code. You can download visual studio code from here according to your Operating system - https://code.visualstudio.com/download.

With these things set up, now let's get started.

Initial Setups

  • First, create a folder named web_scraper anywhere on your computer.
  • Then open it inside visual studio code.
  • Now let's create a new virtual environment using venv and activate it. To do that:

    • From within your text editor, Open Terminal > New Terminal.
    • Then type:
💾
python3 -m venv venv

This command will create a virtual environment named venv for us.

  • To activate it, if you are on windows, type the following:
💾
venv\Scripts\activate.bat
  • If you are on Linux/Mac, then type this instead:
💾
source venv/bin/activate

Now you should see something like this:

4
This prefix means that you have successfully activated the virtual environment

  • Finally, create a new file named scraper.py directly inside the web_scraper folder:

Now you should have a file structure similar to this:

5
the final file structure

Note: If you are still confused about how to set up a virtual environment, then read this Quick Guide.

That's it you have done your initial setups, now it's time for the fun stuff!

Getting the WebPage

Let me ask you a question. What you will do first in order to scrape a website manually, I mean to copy-paste the data from a website?

First, you need to open up the web browser and type the URL, right? Because in order to get data from a web page, you must load it first. And that's exactly what we are going to do here.

First, we need to load the web page from the website. But we are not going to use the web browser at all. Instead, we are going to use a Python module called requests.

So, type the following command in the terminal and install the requests module:

💾
pip install requests

Then, in the scraper.py file type:

💾
import requests

# fetch the web page
page = requests.get('https://the-coding-pie.github.io/top_movies/')
  • This code will get the Response object in whole from the URL https://the-coding-pie.github.io/top_movies/. But we want the web page itself or the web page content itself, right?

In order to get the web page content, you have to access the content from the page variable like this:

6
This will get the page's content, ie. the whole HTML contents.

If you print(page.content):

7
print(page.content) line at the bottom with the old code

then you will see the HTML of the web page like this:

8
HTML structure of the web page in "bytes" format in the terminal

But there's a problem there. If you take a look at the type(page.content):

9
print(type(page.content)) line at the very bottom of the file

Then you can see it is of the type bytes:

10
type "bytes" printed in the terminal

We can't parse those bytes type! Bytes types are useless unless you convert them into some other useful format/type.

What should we do now?

BeautifulSoup for the Rescue!

Beautiful Soup is a Python library for pulling data out of HTML and XML format like above.

BeautfulSoup with the help of a parser transforms a complex HTML document into a complex tree of Python objects.

Note: I don't want to go in-depth about how the BeautifulSoup works in this tutorial. If you are curious to know that, then please use this link - Official Beautiful Soup Docs.

In short, with the help of BeautfulSoup and a parser, we can easily navigate, search, scrape, and modify the parsed HTML/XML content like above (bytes type) by treating everything in it as a Python Object!

So, let's install the BeautifulSoup4 and a parser like lxml. Type the following commands in the terminal window:

💾
pip install beautifulsoup4
💾
pip install lxml
  • lxml is the recommended parser by the BeautifulSoup community. There is also an alternative like html5lib. But we are going to stick with the lxml parser.

Now type the following code in the scraper.py file at the very top:

💾
from bs4 import BeautifulSoup
  • Here we are importing bs4 from the BeautifulSoup library.

Then below this line - page = requests.get('https://the-coding-pie.github.io/top_movies/'), type the following:

💾
# turn page into a BeautifulSoup object
soup = BeautifulSoup(page.content, 'lxml')
  • Here, we are converting our page.content which is of type bytes to a BeautifulSoup object.

Let's Scrape the Page

Now we have that whole web page in our hands (in a useful format). One of the two jobs left is to scrape it. So let's do that. We need to scrape the following things from the web page:

  • titles - all the movie titles
  • genres - all the genres
  • ratings - all the movie ratings
  • lengths - all the movie runtimes
  • years - all the years the movie was released
  • budgets - all the budgets
  • grosses - all the gross information
  • img_urls - src URLs of all the image.

So let's do them one by one.

First, let's scrape all the titles:

The title we are looking for is inside an HTML element called <h3>. Wait, how do I know that?

It's simple:

  • Open up the URL you want to scrape for inside a browser.

In our case, open this TopMovies website. Then:

  • Inspect the data with the help of Developer Tools on your browser. In my case, I am using Chrome, so

    • right-click on the element you want to scrape,
    • And click on Inspect

11
Right Click on the Element and click Inspect

  • Now a new box will pop up like this:

12
This is the Chrome Developer Tools

  • And See, I told you that the title we are looking for is inside an HTML element called <h3>

13
The title data located inside <h3> </h3> element

Now we know where our data is sitting, let's scrape it. Type this below the last line you typed:

💾
""" first, scraping using find_all() method """
# scrape all the titles
titles = [] 
for h3 in soup.find_all('h3'):
  titles.append(h3.string.strip())

Here's the line by line explanation of the above code:

  • We are going to store all our titles inside an array called titles and that's what we are doing in the first line, we are creating that titles array.
  • Then we use the find_all() method in the soup object which we earlier created, to find all the h3 elements we need. This find_all() method returns an iterable list. So we loop through all those found h3 elements and...
  • And in the last line, inside the for loop, we take the h3.string value. Why string value? Because each h3 in whole will be like this <h3> Title inside </h3>. But we only need the innermost string inside that, right? So we use h3.string. After taking it, we .strip() it for removing all the trailing whitespaces. Then we .append() it to the titles array.

Whoo, there's a lot going on in there. So please take a moment to understand it. This is the exact step we are going to repeat from here on to scrape all the other data.

The soup object we initially created using the HTML bytes type data gives us so many built-in methods to easily navigate, and scrape the HTML tree. find_all() is just one of them. We will explore a few of them as we go down the road.

And the reason why we are storing our scraped data inside a Python list is that it will be a lot easier to convert those lists into a CSV file and that's why we are doing that.

Now we have scraped all the titles we need. Now let's move on to scraping all the genres. Type the following code:

💾
# genres
genres = []
for genre in soup.find_all('p', class_='genre'):
  genres.append(genre.string.strip())
  • It's very similar but this time we are finding all the <p> elements with the class_='genre'. class is a reserved keyword in python, so we can't use it and that's why we are giving the underscore(_) after the class.

The rest is self-explanatory.

Now let's scrape all the ratings, but using a different method called select():

💾
""" scraping using css_selector eg: select('span.class_name') """
# ratings, selecting all span with class="rating"
ratings = []
for rating in soup.select('span.rating'):
  ratings.append(rating.string.strip())
  • the select() method is used to find all the elements using CSS selector like syntax. Here we are selecting all the span with the class rating like this - span.rating. Then we store them in a Python list named ratings.

Now it's time for a small exercise. Using the select() method you have to scrape all the lengths (movie runtimes), and years (the year the movie released). I can give you two hints:

  • Each movie length is inside a span with the class length (span.length).
  • And each year is inside a span with the class year (span.year).

The code will be very similar to the above piece of code. You just have to change the corresponding parts.

If you did it, then congratulations! Make sure to cross-check your code with the solution below. If you were unable to do that, then no worry, just type the following solution.

The solution:

💾
# lengths, selecting all span with class="length"
lengths = []
for length in soup.select('span.length'):
  lengths.append(length.string.strip())

# years, selecting all span with class="year"
years = []
for year in soup.select('span.year'):
  years.append(year.string.strip())
  • I think no explanation is needed here.

The remaining things to scrape are the budgets, grosses, and img_urls. Here we are going to use the good old find_all() method to do that:

💾
""" scraping by navigating through elements eg: div.span.string """
# budget
budgets = []
for budget in soup.find_all('div', class_='budget'):
  # from <div class="budget"></div>, get the span.string
  budgets.append(budget.span.string.strip())

# gross
grosses = []
for gross in soup.find_all('div', class_='gross'):
  grosses.append(gross.span.string.strip())


""" parsing all the "src" attribute's value of <img /> tag """
img_urls = []
for img in soup.find_all('img', class_='poster'):
  img_urls.append(img.get('src').strip())
  • The one thing to note here is that, in the last few lines, we try to get the img's src attributes. Because there's where the img's URL is located. To access any of the attributes of an element, we can use the .get() method after finding that particular element. Beautiful Soup stores all the element's attributes as a Python dictionary at the time of converting bytes type to BeautifulSoup type. And that's why we are using the .get() method to access the values in the dictionary.

And that's it, we have successfully scraped all the needed data.

Now let's export those data to a CSV file.

Creating a CSV file

In order to generate CSV files using Python, we need a module named csv. It's a built-in module, so you don't have to install it. You just have to import it at the very top of the scraper.py file.

So type this at the very top:

💾
import csv

Now at the very bottom of the file, type the following code:

💾
""" writing data to CSV """

# open top25.csv file in "write" mode
with open('top25.csv', 'w') as file:
  # create a "writer" object
  writer = csv.writer(file, delimiter=',')

  # use "writer" obj to write 
  # you should give a "list"
  writer.writerow(["title", "genre", "ratings", "length", "year", "budget", "gross", "img_url"])

  for i in range(25):
    writer.writerow([
      titles[i], 
      genres[i], 
      ratings[i], 
      lengths[i], 
      years[i], 
      budgets[i], 
      grosses[i], 
      img_urls[i]
    ])
  • First, we open the file in 'w' mode. 'w' for write mode. If there's no file with the given filename exists then it will create one. And if such a file exists, then it will overwrite that file. Here we are opening/creating a new file named top25.csv.
  • Then we create a csv.writer() object by giving the file and the comma ',' as the delimiter character.
  • Then using that writer object, we write.row(). The first row we wrote is for captions, you can think of them as the table headings.
  • Then finally, we loop 25 times and in each iteration, we write one row which will be a single movie. Each row is all about a single movie.

That's it, let's try to run our script. I hope your terminal (inside your code editor) is already opened and your venv is active. Now type this:

💾
python scraper.py

If everything went smooth, then you should have a new file created namely top25.csv in the same directory and it will contain data like this:

14
top25.csv

If you got any errors, then please make sure that the code you typed up to this point inside the scraper.py file is exactly like the final code below...

Final Code

💾
import requests
from bs4 import BeautifulSoup
import csv

# fetch the web page
page = requests.get('https://the-coding-pie.github.io/top_movies/')

# turn page into a BeautifulSoup object
soup = BeautifulSoup(page.content, 'lxml')


""" first, scraping using find_all() method """
# scrape all the titles
titles = [] 
for h3 in soup.find_all('h3'):
  titles.append(h3.string.strip())

# genres
genres = []
for genre in soup.find_all('p', class_='genre'):
  genres.append(genre.string.strip())


""" scraping using css_selector eg: select('span.class_name') """
# ratings, selecting all span with class="rating"
ratings = []
for rating in soup.select('span.rating'):
  ratings.append(rating.string.strip())

# lengths, selecting all span with class="length"
lengths = []
for length in soup.select('span.length'):
  lengths.append(length.string.strip())

# years, selecting all span with class="year"
years = []
for year in soup.select('span.year'):
  years.append(year.string.strip())


""" scraping by navigating through elements eg: div.span.string """
# budget
budgets = []
for budget in soup.find_all('div', class_='budget'):
  # from <div class="budget"></div>, get the span.string
  budgets.append(budget.span.string.strip())

# gross
grosses = []
for gross in soup.find_all('div', class_='gross'):
  grosses.append(gross.span.string.strip())


""" parsing all the "src" attribute's value of <img /> tag """
img_urls = []
for img in soup.find_all('img', class_='poster'):
  img_urls.append(img.get('src').strip())


""" writing data to CSV """

# open top25.csv file in "write" mode
with open('top25.csv', 'w') as file:
  # create a "writer" object
  writer = csv.writer(file, delimiter=',')

  # use "writer" obj to write 
  # you should give a "list"
  writer.writerow(["title", "genre", "ratings", "length", "year", "budget", "gross", "img_url"])

  for i in range(25):
    writer.writerow([
      titles[i], 
      genres[i], 
      ratings[i], 
      lengths[i], 
      years[i], 
      budgets[i], 
      grosses[i], 
      img_urls[i]
    ])

Wrapping Up

I hope you enjoyed this tutorial. In some places, I intentionally skipped the explanation part. Because those codes were simple and self-explanatory. That's why I left it to you to decode it on your own.

True learning takes place when you try things on your own. By simply following a tutorial won't make you a better programmer. You have to use your own brain.

If you still have any error, first try to decode it on your own by googling it.

If you didn't find any solutions, then only comment on them below. Because you should know how to find and resolve a bug on your own and that's a skill that every programmer should have!

And that's it, Thank you ;)

About Me

Hey folks, my name is Aravind, and I am the man behind this website. To know more about me, check out the About Me page. If you like and enjoy my content, then please consider supporting what I do through - Buy Me a coffee.

Comments