这是用户在 2025-1-8 11:31 为 https://www.geeksforgeeks.org/python-web-scraping-tutorial/?ref=outindfooter 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Open In App

Python Web Scraping Tutorial

Last Updated : 02 Jan, 2025
Summarize
Comments
Improve
Suggest changes
70 Likes
Like
Save
Share
Report
News Follow

In today’s digital world, data is the key to unlocking valuable insights, and much of this data is available on the web. But how do you gather large amounts of data from websites efficiently? That’s where Python web scraping comes in.Web scraping, the process of extracting data from websites, has emerged as a powerful technique to gather information from the vast expanse of the internet.

In this tutorial, we’ll explore various Python libraries and modules commonly used for web scraping and delve into why Python 3 is the preferred choice for this task. Along with this you will also explore how to use powerful tools like BeautifulSoup, Scrapy, and Selenium to scrape any website.

Essential Packages and Tools for Python Web Scraping

The latest version of Python , offers a rich set of tools and libraries specifically designed for web scraping, making it easier than ever to retrieve data from the web efficiently and effectively.

Requests Module

The requests library is used for making HTTP requests to a specific URL and returns the response. Python requests provide inbuilt functionalities for managing both the request and response.

pip install requests

Example: Making a Request

Python requests module has several built-in methods to make HTTP requests to specified URI using GET, POST, PUT, PATCH, or HEAD requests. A HTTP request is meant to either retrieve data from a specified URI or to push data to a server.
It works as a request-response protocol between a client and a server. Here we will be using the GET request. The
GET method is used to retrieve information from the given server using a given URI. The GET method sends the encoded user information appended to the page request.

import requests

# Making a GET request
r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# check status code for response received
# success code - 200
print(r)

# print content of request
print(r.content)

Output

Python requests making GET request

For more information, refer to our Python Requests Tutorial .

BeautifulSoup Library

Beautiful Soup provides a few simple methods and Pythonic phrases for guiding, searching, and changing a parse tree: a toolkit for studying a document and removing what you need. It doesn’t take much code to document an application.

Beautiful Soup automatically converts incoming records to Unicode and outgoing forms to UTF-8. You don’t have to think about encodings unless the document doesn’t define an encoding, and Beautiful Soup can’t catch one. Then you just have to choose the original encoding.
Beautiful Soup sits on top of famous Python parsers like LXML and HTML, allowing you to try different parsing strategies or trade speed for flexibility.

pip install beautifulsoup4

Example

  1. Importing Libraries: The code imports the requests library for making HTTP requests and the BeautifulSoup class from the bs4 library for parsing HTML.
  2. Making a GET Request: It sends a GET request to ‘https://www.geeksforgeeks.org/python-programming-language/’ and stores the response in the variable r.
  3. Checking Status Code: It prints the status code of the response, typically 200 for success.
  4. Parsing the HTML : The HTML content of the response is parsed using BeautifulSoup and stored in the variable soup.
  5. Printing the Prettified HTML: It prints the prettified version of the parsed HTML content for readability and analysis.
import requests
from bs4 import BeautifulSoup


# Making a GET request
r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# check status code for response received
# success code - 200
print(r)

# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

Output

Python BeautifulSoup Parsing HTML

Finding Elements by Class

Now, we would like to extract some useful data from the HTML content. The soup object contains all the data in the nested structure which could be programmatically extracted. The website we want to scrape contains a lot of text so now let’s scrape all those content.
First, let’s inspect the webpage we want to scrape.


findallbs4pythontutorial-copy


In the above image, we can see that all the content of the page is under the div with class entry-content. We will use the find class. This class will find the given tag with the given attribute. In our case, it will find all the div having class as entry-content.

We can see that the content of the page is under the <p> tag. Now we have to find all the p tags present in this class. We can use the find_all class of the BeautifulSoup.

import requests
from bs4 import BeautifulSoup


# Making a GET request
r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')

s = soup.find('div', class_='entry-content')
content = soup.find_all('p')

print(content)

Output:

find_all bs4

For more information, refer to our Python BeautifulSoup .

Selenium

Selenium is a popular Python module used for automating web browsers. It allows developers to control web browsers programmatically, enabling tasks such as web scraping, automated testing, and web application interaction.
Selenium supports various web browsers, including Chrome, Firefox, Safari, and Edge, making it a versatile tool for browser automation.

Example 1: For Firefox

In this specific example, we’re directing the browser to the Google search page with the query parameter “geeksforgeeks”. The browser will load this page, and we can then proceed to interact with it programmatically using Selenium.
This interaction could involve tasks like extracting search results, clicking on links, or scraping specific content from the page.

# import webdriver 
from selenium import webdriver 

# create webdriver object 
driver = webdriver.Firefox() 

# get google.co.in 
driver.get("https://google.co.in / search?q = geeksforgeeks") 

Output

for-firefox

Example 2: For Chrome

  1. We import the webdriver module from the Selenium library.
  2. We specify the path to the web driver executable. You need to download the appropriate driver for your browser and provide the path to it. In this example, we’re using the Chrome driver.
  3. We create a new instance of the web browser using webdriver.Chrome() and pass the path to the Chrome driver executable as an argument.
  4. We navigate to a webpage by calling the get() method on the browser object and passing the URL of the webpage.
  5. We extract information from the webpage using various methods provided by Selenium. In this example, we retrieve the page title using the title attribute of the browser object.
  6. Finally, we close the browser using the quit() method.
# importing necessary packages
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# for holding the resultant list
element_list = []

for page in range(1, 3, 1):

    page_url = "https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page=" + str(page)
    driver = webdriver.Chrome(ChromeDriverManager().install())
    driver.get(page_url)
    title = driver.find_elements(By.CLASS_NAME, "title")
    price = driver.find_elements(By.CLASS_NAME, "price")
    description = driver.find_elements(By.CLASS_NAME, "description")
    rating = driver.find_elements(By.CLASS_NAME, "ratings")

    for i in range(len(title)):
        element_list.append([title[i].text, price[i].text, description[i].text, rating[i].text])

print(element_list)

#closing the driver
driver.close()

Output

For more information, refer to our Python Selenium .

Lxml

The lxml module in Python is a powerful library for processing XML and HTML documents. It provides a high-performance XML and HTML parsing capabilities along with a simple and Pythonic API. lxml is widely used in Python web scraping due to its speed, flexibility, and ease of use.

pip install lxml

Example

Here’s a simple example demonstrating how to use the lxml module for Python web scraping:

  1. We import the html module from lxml along with the requests module for sending HTTP requests.
  2. We define the URL of the website we want to scrape.
  3. We send an HTTP GET request to the website using the requests.get() function and retrieve the HTML content of the page.
  4. We parse the HTML content using the html.fromstring() function from lxml, which returns an HTML element tree.
  5. We use XPath expressions to extract specific elements from the HTML tree. In this case, we’re extracting the text content of all the <a> (anchor) elements on the page.
  6. We iterate over the extracted link titles and print them out.

Output

More information...

Urllib Module

The urllib module in Python is a built-in library that provides functions for working with URLs.
It allows you to interact with web pages by fetching URLs (Uniform Resource Locators), opening and reading data from them, and performing other URL-related tasks like encoding and parsing. Urllib is a package that collects several modules for working with URLs, such as:

  • urllib.request for opening and reading.
  • urllib.parse for parsing URLs
  • urllib.error for the exceptions raised
  • urllib.robotparser for parsing robot.txt files

If urllib is not present in your environment, execute the below code to install it.

pip install urllib3

Example

Here’s a simple example demonstrating how to use the urllib module to fetch the content of a web page:

  1. We define the URL of the web page we want to fetch.
  2. We use urllib.request.urlopen() function to open the URL and obtain a response object.
  3. We read the content of the response object using the read() method.
  4. Since the content is returned as bytes, we decode it to a string using the decode() method with ‘utf-8’ encoding.
  5. Finally, we print the HTML content of the web page.

Output

uutt

PyautoGUI

The pyautogui module in Python is a cross-platform GUI automation library that enables developers to control the mouse and keyboard to automate tasks.
While it’s not specifically designed for web scraping, it can be used in conjunction with other web scraping libraries like Selenium to interact with web pages that require user input or simulate human actions.

pip3 install pyautogui

Example

In this example, pyautogui is used to perform scrolling and take a screenshot of the search results page obtained by typing a query into the search input field and clicking the search button using Selenium.

Output

Schedule

The schedule module in Python is a simple library that allows you to schedule Python functions to run at specified intervals.
It’s particularly useful in web scraping in Python when you need to regularly scrape data from a website at predefined intervals, such as hourly, daily, or weekly.

Example

  • We import the necessary modules: schedule, time, requests, and BeautifulSoup from the bs4 package.
  • We define a function scrape_data() that performs the web scraping task.
    Inside this function, we send a GET request to a website (replace ‘https://example.com’ with the URL of the website you want to scrape), parse the HTML content using BeautifulSoup, extract the desired data, and print it.
  • We schedule the scrape_data() function to run every hour using schedule.every().hour.do(scrape_data).
  • We enter a main loop that continuously checks for pending scheduled tasks using schedule.run_pending() and sleeps for 1 second between iterations to prevent the loop from consuming too much CPU.
import schedule 
import time 

def func(): 
    print("Geeksforgeeks") 

schedule.every(1).minutes.do(func) 

while True: 
    schedule.run_pending() 
    time.sleep(1) 

Output

Why Python3 for Web Scraping?

Python’s popularity for web scraping stems from several factors:

Ease of Use : Python’s clean and readable syntax makes it easy to understand and write code, even for beginners. This simplicity accelerates the development process and reduces the learning curve for web scraping tasks.

Rich Ecosystem : Python boasts a vast ecosystem of libraries and frameworks tailored for web scraping. Libraries like BeautifulSoup, Scrapy, and Requests simplify the process of parsing HTML, making data extraction a breeze.

Versatility : Python is a versatile language that can be used for a wide range of tasks beyond web scraping. Its flexibility allows developers to integrate web scraping seamlessly into larger projects, such as data analysis, machine learning, or web development.

Community Support : Python has a large and active community of developers who contribute to its libraries and provide support through forums, tutorials, and documentation. This wealth of resources ensures that developers have access to assistance and guidance when tackling web scraping challenges.

Conclusion

this tutorial has shown you the basics of how to use Python for web scraping. With the tools we’ve discussed, you can start collecting data from the internet quickly and easily. Whether you need this data for a project, research, or just for fun, Python makes it possible.
Remember to always scrape data responsibly and follow the rules set by websites. If you’re excited to learn more about Python and web scraping, check out our
Python Course . It’s a great resource to deepen your understanding and enhance your skills, all while having fun exploring the power of Python.

Python Web Scraping – FAQs

1. What is Python web scraping?

Python web scraping refers to the process of extracting data from websites using Python programming. It involves fetching HTML content from a web page and parsing it to gather specific information.

Web scraping is legal as long as you comply with the website’s terms of service and avoid scraping personal or sensitive data. Always check the site’s robots.txt file to ensure you’re following the rules.

3. What is the difference between BeautifulSoup and Scrapy?

BeautifulSoup is a simpler library for beginners focused on HTML parsing and extraction, whereas Scrapy is a more advanced web scraping framework that can handle complex tasks like crawling large datasets or handling pagination automatically.

4. What are some common use cases for Python web scraping?

Common use cases include extracting data for price comparison, content aggregation, job listings, real estate data, and sentiment analysis. Web scraping helps gather structured data from websites for various business and research purposes.




Get IBM Certification and a 90% fee refund on completing 90% course in 90 days! Take the Three 90 Challenge today.

Master Machine Learning, Data Science & AI with this complete program and also get a 90% refund. What more motivation do you need? Start the challenge right away!


Article Tags :
Practice Tags :

Similar Reads

three90RightbarBannerImg
×
拖拽到此处完成下载
图片将完成下载
AIX智能下载器
原文
请对此翻译评分
您的反馈将用于改进谷歌翻译