Web Scraping using Beautiful Soup

Introduction

In this blog, we will learn about what is web scraping, and how to perform it using beautiful soup (BS4). First, we will see basic web scraping and then we will see how to scrap a table and store its data into csv file.

What is web scraping?

Web scraping is the practice of extracting data from websites using automated tools or scripts. It involves parsing the HTML or XML structure of a webpage and extracting the relevant information, such as text, images, links, and more. Web scraping enables you to gather data from multiple sources, analyze it, and use it for various purposes, including research, analysis, and business intelligence.

What is Beautiful Soup?

Beautiful Soup is a Python library designed for web scraping purposes. It provides a convenient way to navigate, search, and modify the parsed HTML or XML documents. Beautiful Soup makes it easy to extract data from complex and nested structures by providing intuitive methods and powerful querying capabilities.

Installation

To start using Beautiful Soup, you need to have Python installed on your system. You can install Beautiful Soup by running the following command:

pip install beautifulsoup4

Basic Usage

Beautiful Soup provides a simple and intuitive interface for parsing HTML or XML documents. You can pass the HTML content or a file to the BeautifulSoup constructor and start working with the parsed document.

Here's a basic example that demonstrates how to extract the title of a webpage using Beautiful Soup:

from bs4 import BeautifulSoup
import requests

# Make a request to the webpage
response = requests.get("https://www.python.org/")

# Create a BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

# Extract the title
title = soup.title.text

print(title)

The above code sends a GET request to the specified URL, retrieves the HTML content, and uses Beautiful Soup to parse the document. It then extracts the title of the webpage and prints it.

Navigating the document

Once you have parsed a document with Beautiful Soup, you can navigate through its structure using various methods and attributes. Here are some commonly used ones:

Tag Names

You can access elements by their tag names. For example, to find all the <a> tags in the document, you can use the find_all() method:

links = soup.find_all("a")

CSS Selectors

Beautiful Soup also supports CSS selectors for querying elements. You can use the select() method to find elements that match a CSS selector:

paragraphs = soup.select("p")

Traversing the Tree

You can traverse the document tree by accessing parent, sibling, or child elements. For example, to get the parent of an element, you can use the parent attribute:

parent_element = element.parent

Extracting table and storing it in CSV

In this section, we will extract the Indian company names and their ticker symbol from the following website https://indiancompanies.in/listed-companies-in-nse-with-symbol/ and store it in CSV file.

import requests
from bs4 import BeautifulSoup
import csv

Importing the necessary libraries.

The import requests statement imports the requests module, which allows making HTTP requests to retrieve web content.
The from bs4 import BeautifulSoup line imports the BeautifulSoup class from the bs4 module. BeautifulSoup is a library used for parsing HTML or XML documents.
The import csv statement imports the csv module, which provides functionality for reading and writing CSV files.

url = "https://indiancompanies.in/listed-companies-in-nse-with-symbol/"
r = requests.get(url)

This code assigns the URL "https://indiancompanies.in/listed-companies-in-nse-with-symbol/" to the variable url.
The requests.get(url) line sends an HTTP GET request to the specified URL and stores the response in the variable r

count = 0
quotes = [""]
quote = {}
soup = BeautifulSoup(r.content, "html.parser")

The count variable is initialized to 0.
The quotes list is initialized with a single empty string element.
The quote dictionary is initialized as an empty dictionary.
The BeautifulSoup(r.content, "html.parser") line creates a BeautifulSoup object named soup by parsing the content of the HTTP response (r.content) using the "html.parser" parser.

So here the idea is to store the company name and its ticker symbol in the dictionary. Then adding that dictionary to the list. Once this is done for one company, the dictionary is re-initialized to empty and the same process is repeated for other companies.

for row in soup.find_all('td'):
    if count == 1:
        quote['cmp'] = row.text
        count = count + 1
    elif count == 2:
        quote['ticker'] = row.text
        count = count - 2
        quotes.append(quote)
    else:
        count = count + 1
        quote = {}

As there are 3 columns, in the first column SNO. is given, and in the second column Name of the company is given and in the third column symbol is given. We need to avoid SNO. because we don't need it.

Since SNO. is the first column it will be fetched first and then the company name and symbol. So we have initialized a count variable equal to zero, when the count is zero it means SNO. is being fetched, when it is 1 it means company name is fetched and when it is 2 symbol is being fetched.

This loop iterates over each <td> tag found in the soup object.
Inside the loop, it checks the value of the count variable.
If count is 1, it assigns the text content of the current <td> tag to the 'cmp' key of the quote dictionary.
If count is 2, it assigns the text content of the current <td> tag to the 'ticker' key of the quote dictionary.
When count is 2, it appends the quote dictionary to the quotes list.
If count is neither 1 nor 2, it increments the count variable by 1 and resets the quote dictionary to an empty dictionary.

del quotes[0]
del quotes[0]
keys = quotes[0].keys()
fileName = 'ticker.csv'

The del quotes[0] lines remove the first two elements from the quotes list. These elements are empty strings (when we initialized the list we added an empty string to it.) and placeholder values (i.e. SNO, NAME OF COMPANY, SYMBOL).
The keys variable is assigned the keys of the first dictionary element in the quotes list using the .keys() method.
The fileName variable is assigned the string 'ticker.csv'

with open(fileName, 'w', newline='') as f:
    w = csv.DictWriter(f, keys)
    w.writeheader()
    for dictionary in quotes:
        w.writerow(dictionary)

Now we have our list containing company names and symbols. Now its time to store it in CSV file.

This code block opens the file 'ticker.csv' in write mode using a context manager (with open(...)).
The csv.DictWriter class is instantiated with the file object f and the keys variable, which represents the column names in the CSV file.
The writeheader() method writes the column names as the header row in the CSV file.
The loop then iterates over each dictionary in the quotes list and writes the dictionary as a row in the CSV file using the writerow() method of the csv.DictWriter object.

After doing this we will have a CSV file containing company names and ticker symbol.

Advanced Usage.

Beautiful Soup offers many advanced features that allow you to manipulate and transform the parsed documents. Here are a few notable features:

Modifying the Document

You can modify the document by adding, removing, or modifying elements. Beautiful Soup provides methods like append(), insert(), extract(), and replace_with() to manipulate the document structure.

Searching with Regular Expressions

Beautiful Soup supports searching for elements using regular expressions. You can pass a compiled regular expression object or a string pattern with the re module to the find_all() method. This allows for more advanced and flexible searching capabilities.

Handling Different Encodings

Web pages often use different character encodings, and Beautiful Soup can handle them seamlessly. When parsing a document, Beautiful Soup automatically detects and converts the character encoding to Unicode, ensuring that you can work with the text consistently.

Dealing with Broken HTML

Beautiful Soup has a built-in HTML parser that can handle imperfect and broken HTML. It can parse HTML documents even if they are not well-formed, allowing you to extract data from a wide range of web pages.