Let's Build a Web Scraper with Python & BeautifulSoup4
Ever wondered how to automate the process of scraping a website, collecting data, and exporting it to a useful format like CSV? If you are doing data science/machine learning then you may have been in this situation several times.
That's why I wrote this tutorial, In this tutorial, you will learn all about Web Scraping by building a Python script that will scrape a movie website and fetches useful information, and finally, it will export the collected data to a CSV (Comma Separated Values) file.
And the good thing is that you don't have to do web scraping manually by your hands anymore!
Sounds interesting? Then let's jump right in.
You can download the finished code here from my Github repo - Web Scraper.
What is Web Scraping
Web Scraping is the process of collecting useful/needed information from any website on the internet. Like any other process, there are two ways to do it: one is to manually copy-paste the needed data from the website. And the other way, the way of legends, is to smartly automate it!
I hope you want to be in the second category. But there are some challenges in doing so...
The first challenge is that not all website owners like the process of scraping their website. So if you are going to do web scraping on a website, then please make sure that they allow you to do so.
The second challenge is that not all websites are alike. I mean the script you wrote for one website can't be used in other websites. Because the structures of both websites are entirely different. And maybe you even can't use that same script on the same website after several days, because web developers change their website's layout all the time to battle with the web scrapers.
Web Scraping Alternative
If there are so many challenges, is there any alternative? Yes, there is an alternative called API. Application Programming Interface is the only legal and stable way of getting data from any website.
Most of the websites provide an API through which you can get the data you wanted in a more sweet format like JSON or XML. But there's a catch, you may have to pay money. Of course, there may be a free plan, but in the long run, you have to pay them in order to use their precious data.
That's where the concept of web scraping comes in handy!
What We are Going to Build
We will learn all about Web Scraping using Python and BeautifulSoup4 by building a real-world project.
I don't want to give you a headache by teaching you how to scrape an ever-changing dynamic website. So, I built a static movie website, named TopMovies, which contains a list of the top 25 IMDb movies. This is the website we are going to scrape. So before moving forward please examine it first - TopMovies.
See the TopMovies website has a list of the top 25 IMDb movies. Each movie holds the following details in it:
length- movie runtime
We are going to scrape those details from that TopMovies website. Then after getting all those details, we are going to export it to a useful format like CSV so that you can later import it into your data science project and can do some predictions without any worry!
In short, we are going to scrape the following website:
Export the scraped data into a CSV file like this:
And, later, if you want, you can read it as a Pandas DataFrame like below inside Jupyter notebook and you can do all your analysis and predictions easily!:
If you are not a Data Science/Machine learning person, then don't worry about this last image, just forgot it!
By doing this simple project, you will learn the skills to build any sort of Web Scraper that is capable of scraping almost any website you wish to scrap. And also you will learn how to generate CSV files using Python.
How we are going to do that?
It's very straight forward:
- First, we will fetch the web page we want using the
- Then, we will turn that page into a
BeautifulSoupobject with the help of a suitable parser like
lxml. This will make the scraping process a lot easier.
- Then we will scrape all the needed data from that soup object.
- Finally, we will export all the scraped data into a file called
top25.csvwith the help of the
- You should be good at python3.
- You should have a decent understanding of HTML and a little bit of CSS.
- You should have python3.4 or a higher version installed on your computer. You can read this post to learn how to setup python3 on any operating system - https://realpython.com/installing-python/
- You should have venv installed.
- Finally, you will need a modern code editor like visual studio code. You can download visual studio code from here according to your Operating system - https://code.visualstudio.com/download.
With these things set up, now let's get started.
- First, create a folder named
web_scraperanywhere on your computer.
- Then open it inside visual studio code.
Now let's create a new virtual environment using venv and activate it. To do that:
- From within your text editor, Open Terminal > New Terminal.
- Then type:
python3 -m venv venv
This command will create a virtual environment named venv for us.
- To activate it, if you are on windows, type the following:
- If you are on Linux/Mac, then type this instead:
Now you should see something like this:
- Finally, create a new file named
scraper.pydirectly inside the
Now you should have a file structure similar to this:
Note: If you are still confused about how to set up a virtual environment, then read this Quick Guide.
That's it you have done your initial setups, now it's time for the fun stuff!
Getting the WebPage
Let me ask you a question. What you will do first in order to scrape a website manually, I mean to copy-paste the data from a website?
First, you need to open up the web browser and type the URL, right? Because in order to get data from a web page, you must load it first. And that's exactly what we are going to do here.
First, we need to load the web page from the website. But we are not going to use the web browser at all. Instead, we are going to use a Python module called
So, type the following command in the terminal and install the
pip install requests
Then, in the
scraper.py file type:
import requests # fetch the web page page = requests.get('https://the-coding-pie.github.io/top_movies/')
- This code will
Responseobject in whole from the URL
https://the-coding-pie.github.io/top_movies/. But we want the web page itself or the web page
In order to get the web page
content, you have to access the
content from the
page variable like this:
then you will see the HTML of the web page like this:
But there's a problem there. If you take a look at the
Then you can see it is of the type
We can't parse those bytes type! Bytes types are useless unless you convert them into some other useful format/type.
What should we do now?
BeautifulSoup for the Rescue!
Beautiful Soup is a Python library for pulling data out of HTML and XML format like above.
BeautfulSoup with the help of a
parser transforms a complex HTML document into a complex tree of Python objects.
Note: I don't want to go in-depth about how the BeautifulSoup works in this tutorial. If you are curious to know that, then please use this link - Official Beautiful Soup Docs.
In short, with the help of
BeautfulSoup and a
parser, we can easily navigate, search, scrape, and modify the parsed HTML/XML content like above (bytes type) by treating everything in it as a Python Object!
So, let's install the
BeautifulSoup4 and a parser like
lxml. Type the following commands in the terminal window:
pip install beautifulsoup4
pip install lxml
lxmlis the recommended parser by the BeautifulSoup community. There is also an alternative like
html5lib. But we are going to stick with the
Now type the following code in the
scraper.py file at the very top:
from bs4 import BeautifulSoup
- Here we are importing
Then below this line -
page = requests.get('https://the-coding-pie.github.io/top_movies/'), type the following:
# turn page into a BeautifulSoup object soup = BeautifulSoup(page.content, 'lxml')
- Here, we are converting our
page.contentwhich is of type
Let's Scrape the Page
Now we have that whole web page in our hands (in a useful format). One of the two jobs left is to scrape it. So let's do that. We need to scrape the following things from the web page:
titles- all the movie titles
genres- all the genres
ratings- all the movie ratings
lengths- all the movie runtimes
years- all the years the movie was released
budgets- all the budgets
grosses- all the gross information
img_urls- src URLs of all the image.
So let's do them one by one.
First, let's scrape all the
title we are looking for is inside an HTML element called
<h3>. Wait, how do I know that?
- Open up the URL you want to scrape for inside a browser.
In our case, open this TopMovies website. Then:
Inspect the data with the help of Developer Tools on your browser. In my case, I am using Chrome, so
- right-click on the element you want to scrape,
- And click on Inspect
- Now a new box will pop up like this:
- And See, I told you that the
titlewe are looking for is inside an HTML element called
Now we know where our data is sitting, let's scrape it. Type this below the last line you typed:
""" first, scraping using find_all() method """ # scrape all the titles titles =  for h3 in soup.find_all('h3'): titles.append(h3.string.strip())
Here's the line by line explanation of the above code:
- We are going to store all our titles inside an array called
titlesand that's what we are doing in the first line, we are creating that
- Then we use the
find_all()method in the
soupobject which we earlier created, to find all the
h3elements we need. This
find_all()method returns an iterable list. So we loop through all those found
- And in the last line, inside the
for loop, we take the
stringvalue? Because each
h3in whole will be like this
<h3> Title inside </h3>. But we only need the innermost
stringinside that, right? So we use
h3.string. After taking it, we
.strip()it for removing all the trailing whitespaces. Then we
.append()it to the
Whoo, there's a lot going on in there. So please take a moment to understand it. This is the exact step we are going to repeat from here on to scrape all the other data.
soup object we initially created using the HTML
bytes type data gives us so many built-in methods to easily navigate, and scrape the HTML tree.
find_all() is just one of them. We will explore a few of them as we go down the road.
And the reason why we are storing our scraped data inside a Python list is that it will be a lot easier to convert those lists into a CSV file and that's why we are doing that.
Now we have scraped all the
titles we need. Now let's move on to scraping all the
genres. Type the following code:
# genres genres =  for genre in soup.find_all('p', class_='genre'): genres.append(genre.string.strip())
- It's very similar but this time we are finding all the
<p>elements with the
classis a reserved keyword in python, so we can't use it and that's why we are giving the underscore(
_) after the
The rest is self-explanatory.
Now let's scrape all the
ratings, but using a different method called
""" scraping using css_selector eg: select('span.class_name') """ # ratings, selecting all span with class="rating" ratings =  for rating in soup.select('span.rating'): ratings.append(rating.string.strip())
select()method is used to find all the elements using CSS selector like syntax. Here we are selecting all the
spanwith the class
ratinglike this -
span.rating. Then we store them in a Python list named
Now it's time for a small exercise. Using the
select() method you have to scrape all the
lengths (movie runtimes), and
years (the year the movie released). I can give you two hints:
- Each movie
lengthis inside a
spanwith the class
- And each
yearis inside a
spanwith the class
The code will be very similar to the above piece of code. You just have to change the corresponding parts.
If you did it, then congratulations! Make sure to cross-check your code with the solution below. If you were unable to do that, then no worry, just type the following solution.
# lengths, selecting all span with class="length" lengths =  for length in soup.select('span.length'): lengths.append(length.string.strip()) # years, selecting all span with class="year" years =  for year in soup.select('span.year'): years.append(year.string.strip())
- I think no explanation is needed here.
The remaining things to scrape are the
img_urls. Here we are going to use the good old
find_all() method to do that:
""" scraping by navigating through elements eg: div.span.string """ # budget budgets =  for budget in soup.find_all('div', class_='budget'): # from <div class="budget"></div>, get the span.string budgets.append(budget.span.string.strip()) # gross grosses =  for gross in soup.find_all('div', class_='gross'): grosses.append(gross.span.string.strip()) """ parsing all the "src" attribute's value of <img /> tag """ img_urls =  for img in soup.find_all('img', class_='poster'): img_urls.append(img.get('src').strip())
- The one thing to note here is that, in the last few lines, we try to get the img's
srcattributes. Because there's where the img's URL is located. To access any of the
element, we can use the
.get()method after finding that particular element.
Beautiful Soupstores all the element's
attributesas a Python dictionary at the time of converting
BeautifulSouptype. And that's why we are using the
.get()method to access the values in the dictionary.
And that's it, we have successfully scraped all the needed data.
Now let's export those data to a CSV file.
Creating a CSV file
In order to generate CSV files using Python, we need a module named
csv. It's a built-in module, so you don't have to install it. You just have to
import it at the very top of the
So type this at the very top:
Now at the very bottom of the file, type the following code:
""" writing data to CSV """ # open top25.csv file in "write" mode with open('top25.csv', 'w') as file: # create a "writer" object writer = csv.writer(file, delimiter=',') # use "writer" obj to write # you should give a "list" writer.writerow(["title", "genre", "ratings", "length", "year", "budget", "gross", "img_url"]) for i in range(25): writer.writerow([ titles[i], genres[i], ratings[i], lengths[i], years[i], budgets[i], grosses[i], img_urls[i] ])
- First, we open the file in
'w'for write mode. If there's no file with the given filename exists then it will create one. And if such a file exists, then it will overwrite that file. Here we are opening/creating a new file named
- Then we create a
csv.writer()object by giving the
fileand the comma
- Then using that
write.row(). The first row we wrote is for captions, you can think of them as the table headings.
- Then finally, we loop 25 times and in each iteration, we write one row which will be a single movie. Each row is all about a single movie.
That's it, let's try to run our script. I hope your terminal (inside your code editor) is already opened and your
venv is active. Now type this:
If everything went smooth, then you should have a new file created namely
top25.csv in the same directory and it will contain data like this:
If you got any errors, then please make sure that the code you typed up to this point inside the
scraper.py file is exactly like the final code below...
import requests from bs4 import BeautifulSoup import csv # fetch the web page page = requests.get('https://the-coding-pie.github.io/top_movies/') # turn page into a BeautifulSoup object soup = BeautifulSoup(page.content, 'lxml') """ first, scraping using find_all() method """ # scrape all the titles titles =  for h3 in soup.find_all('h3'): titles.append(h3.string.strip()) # genres genres =  for genre in soup.find_all('p', class_='genre'): genres.append(genre.string.strip()) """ scraping using css_selector eg: select('span.class_name') """ # ratings, selecting all span with class="rating" ratings =  for rating in soup.select('span.rating'): ratings.append(rating.string.strip()) # lengths, selecting all span with class="length" lengths =  for length in soup.select('span.length'): lengths.append(length.string.strip()) # years, selecting all span with class="year" years =  for year in soup.select('span.year'): years.append(year.string.strip()) """ scraping by navigating through elements eg: div.span.string """ # budget budgets =  for budget in soup.find_all('div', class_='budget'): # from <div class="budget"></div>, get the span.string budgets.append(budget.span.string.strip()) # gross grosses =  for gross in soup.find_all('div', class_='gross'): grosses.append(gross.span.string.strip()) """ parsing all the "src" attribute's value of <img /> tag """ img_urls =  for img in soup.find_all('img', class_='poster'): img_urls.append(img.get('src').strip()) """ writing data to CSV """ # open top25.csv file in "write" mode with open('top25.csv', 'w') as file: # create a "writer" object writer = csv.writer(file, delimiter=',') # use "writer" obj to write # you should give a "list" writer.writerow(["title", "genre", "ratings", "length", "year", "budget", "gross", "img_url"]) for i in range(25): writer.writerow([ titles[i], genres[i], ratings[i], lengths[i], years[i], budgets[i], grosses[i], img_urls[i] ])
I hope you enjoyed this tutorial. In some places, I intentionally skipped the explanation part. Because those codes were simple and self-explanatory. That's why I left it to you to decode it on your own.
True learning takes place when you try things on your own. By simply following a tutorial won't make you a better programmer. You have to use your own brain.
If you still have any error, first try to decode it on your own by googling it.
If you didn't find any solutions, then only comment on them below. Because you should know how to find and resolve a bug on your own and that's a skill that every programmer should have!
And that's it, Thank you ;)