Introduction
In this blog, we will learn about what is web scraping, and how to perform it using beautiful soup (BS4). First, we will see basic web scraping and then we will see how to scrap a table and store its data into csv file.
What is web scraping?
Web scraping is the practice of extracting data from websites using automated tools or scripts. It involves parsing the HTML or XML structure of a webpage and extracting the relevant information, such as text, images, links, and more. Web scraping enables you to gather data from multiple sources, analyze it, and use it for various purposes, including research, analysis, and business intelligence.
What is Beautiful Soup?
Beautiful Soup is a Python library designed for web scraping purposes. It provides a convenient way to navigate, search, and modify the parsed HTML or XML documents. Beautiful Soup makes it easy to extract data from complex and nested structures by providing intuitive methods and powerful querying capabilities.
Installation
To start using Beautiful Soup, you need to have Python installed on your system. You can install Beautiful Soup by running the following command:
pip install beautifulsoup4
Basic Usage
Beautiful Soup provides a simple and intuitive interface for parsing HTML or XML documents. You can pass the HTML content or a file to the BeautifulSoup
constructor and start working with the parsed document.
Here's a basic example that demonstrates how to extract the title of a webpage using Beautiful Soup:
from bs4 import BeautifulSoup
import requests
# Make a request to the webpage
response = requests.get("https://www.python.org/")
# Create a BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")
# Extract the title
title = soup.title.text
print(title)
The above code sends a GET request to the specified URL, retrieves the HTML content, and uses Beautiful Soup to parse the document. It then extracts the title of the webpage and prints it.
Navigating the document
Once you have parsed a document with Beautiful Soup, you can navigate through its structure using various methods and attributes. Here are some commonly used ones:
Tag Names
You can access elements by their tag names. For example, to find all the <a>
tags in the document, you can use the find_all()
method:
links = soup.find_all("a")
CSS Selectors
Beautiful Soup also supports CSS selectors for querying elements. You can use the select()
method to find elements that match a CSS selector:
paragraphs = soup.select("p")
Traversing the Tree
You can traverse the document tree by accessing parent, sibling, or child elements. For example, to get the parent of an element, you can use the parent
attribute:
parent_element = element.parent
Extracting table and storing it in CSV
In this section, we will extract the Indian company names and their ticker symbol from the following website https://indiancompanies.in/listed-companies-in-nse-with-symbol/ and store it in CSV file.
import requests
from bs4 import BeautifulSoup
import csv
Importing the necessary libraries.
The
import requests
statement imports therequests
module, which allows making HTTP requests to retrieve web content.The
from bs4 import BeautifulSoup
line imports theBeautifulSoup
class from thebs4
module. BeautifulSoup is a library used for parsing HTML or XML documents.The
import csv
statement imports thecsv
module, which provides functionality for reading and writing CSV files.
url = "https://indiancompanies.in/listed-companies-in-nse-with-symbol/"
r = requests.get(url)
This code assigns the URL
"
https://indiancompanies.in/listed-companies-in-nse-with-symbol/
"
to the variableurl
.The
requests.get(url)
line sends an HTTP GET request to the specified URL and stores the response in the variabler
count = 0
quotes = [""]
quote = {}
soup = BeautifulSoup(r.content, "html.parser")
The
count
variable is initialized to 0.The
quotes
list is initialized with a single empty string element.The
quote
dictionary is initialized as an empty dictionary.The
BeautifulSoup(r.content, "html.parser")
line creates a BeautifulSoup object namedsoup
by parsing the content of the HTTP response (r.content
) using the "html.parser" parser.
So here the idea is to store the company name and its ticker symbol in the dictionary. Then adding that dictionary to the list. Once this is done for one company, the dictionary is re-initialized to empty and the same process is repeated for other companies.
for row in soup.find_all('td'):
if count == 1:
quote['cmp'] = row.text
count = count + 1
elif count == 2:
quote['ticker'] = row.text
count = count - 2
quotes.append(quote)
else:
count = count + 1
quote = {}
As there are 3 columns, in the first column SNO. is given, and in the second column Name of the company is given and in the third column symbol is given. We need to avoid SNO. because we don't need it.
Since SNO. is the first column it will be fetched first and then the company name and symbol. So we have initialized a count variable equal to zero, when the count is zero it means SNO. is being fetched, when it is 1 it means company name is fetched and when it is 2 symbol is being fetched.
This loop iterates over each
<td>
tag found in thesoup
object.Inside the loop, it checks the value of the
count
variable.If
count
is 1, it assigns the text content of the current<td>
tag to the'cmp'
key of thequote
dictionary.If
count
is 2, it assigns the text content of the current<td>
tag to the'ticker'
key of thequote
dictionary.When
count
is 2, it appends thequote
dictionary to thequotes
list.If
count
is neither 1 nor 2, it increments thecount
variable by 1 and resets thequote
dictionary to an empty dictionary.
del quotes[0]
del quotes[0]
keys = quotes[0].keys()
fileName = 'ticker.csv'
The
del quotes[0]
lines remove the first two elements from thequotes
list. These elements are empty strings (when we initialized the list we added an empty string to it.) and placeholder values (i.e. SNO, NAME OF COMPANY, SYMBOL).The
keys
variable is assigned the keys of the first dictionary element in thequotes
list using the.keys()
method.The
fileName
variable is assigned the string'ticker.csv'
with open(fileName, 'w', newline='') as f:
w = csv.DictWriter(f, keys)
w.writeheader()
for dictionary in quotes:
w.writerow(dictionary)
Now we have our list containing company names and symbols. Now its time to store it in CSV file.
This code block opens the file
'ticker.csv'
in write mode using a context manager (with open(...)
).The
csv.DictWriter
class is instantiated with the file objectf
and thekeys
variable, which represents the column names in the CSV file.The
writeheader()
method writes the column names as the header row in the CSV file.The loop then iterates over each dictionary in the
quotes
list and writes the dictionary as a row in the CSV file using thewriterow()
method of thecsv.DictWriter
object.
After doing this we will have a CSV file containing company names and ticker symbol.
Advanced Usage.
Beautiful Soup offers many advanced features that allow you to manipulate and transform the parsed documents. Here are a few notable features:
Modifying the Document
You can modify the document by adding, removing, or modifying elements. Beautiful Soup provides methods like append()
, insert()
, extract()
, and replace_with()
to manipulate the document structure.
Searching with Regular Expressions
Beautiful Soup supports searching for elements using regular expressions. You can pass a compiled regular expression object or a string pattern with the re
module to the find_all()
method. This allows for more advanced and flexible searching capabilities.
Handling Different Encodings
Web pages often use different character encodings, and Beautiful Soup can handle them seamlessly. When parsing a document, Beautiful Soup automatically detects and converts the character encoding to Unicode, ensuring that you can work with the text consistently.
Dealing with Broken HTML
Beautiful Soup has a built-in HTML parser that can handle imperfect and broken HTML. It can parse HTML documents even if they are not well-formed, allowing you to extract data from a wide range of web pages.