When you hear terms like “Python coding”, “data science”, and “web scraping”, your mind might conjure up images of complex code being run by expert hackers. It’s actually far simpler than that. Thanks to how easy and intuitive Python is to learn, you can master the art of web scraping in a relatively short amount of time.
When getting into web scraping with Python, you might think it’s all about using RegEx, but it’s more about how you understand the data structure and layout of the website, much of which can be found in the CSS and HTML. So, if you have experience in HTML / CSS for website design, you’re halfway there.
This is because what you’re really doing with Python is telling it to display only the useful bits of information from a website’s HTML / CSS source code, depending on what type of information you need to extract.
There are a lot of scenarios where web scraping makes sense, such as:
- Checking retail websites for discounts and promos or competing products.
- Compiling trending topics from aggregation websites.
- Scraping contact details of businesses and individuals.
- Finding how many times a keyword is used on a page.
Getting ready to scrape
One of the first things you should do when planning to scrape is to analyze the target website. When getting into web scraping with Python, you might think it’s all about using RegEx, but it’s more about how you understand the data structure and layout of the website, much of which can be found in the CSS. Our basic tutorial below will cover some very simple HTML extraction, but you can learn a lot more in-depth stuff from this tutorial (bookmark it for reference).
For example, say you want to scrape the data from page headers, or buttons of different colors. These are things found in internal CSS, so analyzing the website’s architecture before you start spending a lot of time on your code will be very helpful.
Basic scraping tutorial with Beautiful Soup
To give an extremely basic example of scraping with Python, we’ll be grabbing the title from a website. It’s best to try this on your own website, even if you make a free WordPress blog or something.
You’ll need a few (free) packages for this, which are:
- Beautiful Soup
You can install these easily with ‘pip install requests’, ‘pip install beautifulsoup4’, and ‘pip install lxml’ on Windows, or ‘pip3 install x’’ on Mac. This should be followed by “import requests” and “import bs4” to get us started.
The reason we need these packages together is because Beautiful Soup cannot make requests by itself onto a webpage, so we need the Requests package along with it. LXML is a feature-rich library for processing XML/HTML in Python, so it’s also very useful to have.
Create an object with “res = requests.get(https://websitename.com’)
Now if you just type “res.text” it will display literally the entire contents of the webpage, kind of like what you would see if you clicked “View page source” in your browser. This isn’t really usable, so you need to extract the useful information from it, which is where Beautiful Soup comes in.
So, if you typed in your console:
soup = bs4.BeautifulSoul(res.text, ‘lxml’)
Hi = soup.select(‘title’)
So, what we did here was create a variable that instructs Beautiful Soup to extract the ‘title’ tags from the webpage’s source code. You could also pass on the anchor tag, for example, but it’s not a favorable strategy for extracting all of the links from a website.
So now if you simply type ‘hi’ in your console, it should output something like:
<title>Your website title here</title>
And so now it’s just a matter of experimenting and going through the data you extract in this manner, looking for patterns. For example, you could extract and display all of the header tags (H1, H2, H3) and find keywords being focused on, much faster than actually visiting the website in your browser and scrolling through the page yourself.
So, this is just an extremely basic example tutorial of what you can do with Python for scraping data. To go even deeper and get into the more complex stuff, you should experiment on your own website, and maybe take Python courses online that are specifically focused on web scraping.