Sunday, September 8, 2024

How to Create Web Scrapper with Python,Scrapy and PostgreSQL

Web scraping has become an essential skill for anyone who needs to extract information from websites. In this article, we'll explore how to use Scrapy, a powerful and open-source framework, to scrape data from a bookstore website and store it in a PostgreSQL database. We'll even include scraping the covers of the books!

Prerequisites

Before we begin, ensure you have the following installed:

  • Python 3.x: The foundation for running Scrapy.

  • Scrapy: The open-source framework for web scraping.

  • PostgreSQL: The relational database management system for storing our data.

  • psycopg2: The PostgreSQL adapter for Python, allowing Python to interact with PostgreSQL.

Setting Up Scrapy

First, let's create a virtual environment to isolate our project dependencies:

  1. Create a Project Directory:

          mkdir books_to_scrape
    cd books_to_scrape
        

  2. Create a Virtual Environment:

          python3 -m venv my-spider
        

  3. Activate the Environment:

          source my-spider/bin/activate
        

  4. Install Scrapy:

          pip install scrapy
        

With our virtual environment set up and Scrapy installed, we're ready to build our spider.

Creating the Spider

We'll create a new Scrapy project using the following command:

      scrapy startproject books
    

This generates a directory structure for our project:

      books/
    scrapy.cfg             # deploy configuration file
    books/                 # project's Python module, you'll import your code from here
       __init__.py

       items.py              # project items definition file
       middlewares.py        # project middlewares file
       pipelines.py          # project pipelines file
       settings.py           # project settings file
     
       spiders/              # a directory where you'll later put your spiders
          __init__.py
    

Defining the Items

Before writing our spider code, we need to define the data structure it will use. This is done in the items.py file.

  1. Purpose of Items:

    Items define the structure and fields of the data we want to scrape. Think of them as containers for the specific information you wish to extract from web pages. They help organize and store the data efficiently.

  2. Creating the Items File:

    Here's what the items.py file looks like:

          import scrapy
    
    class BooksItemInfo(scrapy.Item):
        title = scrapy.Field()
        category =  scrapy.Field()
        price = scrapy.Field()
        #product_description = scrapy.Field()
        product_type = scrapy.Field()
        price_excl_tax = scrapy.Field()
        price_incl_tax = scrapy.Field()
        tax = scrapy.Field()
        availability = scrapy.Field()
        number_of_reviews = scrapy.Field()
        stars = scrapy.Field()
        image_urls = scrapy.Field()    
        images = scrapy.Field()
        

    Explanation:

    • title: The title of the book.

    • category: The category or genre of the book.

    • price: The price of the book.

    • product_type: The type of product (e.g., book format).

    • price_excl_tax: Price excluding tax.

    • price_incl_tax: Price including tax.

    • tax: Tax amount.

    • availability: Availability status.

    • number_of_reviews: Number of reviews the book has received.

    • stars: Rating of the book, typically in star format.

    • image_urls: List of URLs for book cover images.

    • images: Details about downloaded images (used for image pipelines).

Writing the Spider

Now that we've defined our items, let's create a spider to scrape data from the website.

  1. Setting Up the Spider:

    We'll set up a spider within the spiders directory. This spider will be responsible for navigating the website and extracting the data according to our item definitions.

  2. Spider Code Breakdown:

    Here's the code for our spider:

          import scrapy
    from books.items import BooksItemInfo
    
    class BookSpider(scrapy.Spider):
        name = "book"
    
        def start_requests(self):
            URL = "https://books.toscrape.com"
            yield scrapy.Request(url=URL, callback=self.parse)
    
        def parse(self, response):
            books = response.css("article.product_pod")
            for book in books:
                book_page = book.css("h3 a::attr(href)").get()
                if book_page:
                    yield response.follow(book_page, callback=self.parse_book_page)
            
            next_page = response.css("li.next a::attr(href)").get()
            if next_page:
                yield response.follow(next_page, callback=self.parse)
    
        def parse_book_page(self, response):
            book_info = BooksItemInfo()
            table_info = response.css("table tr")
    
            # Book Info
            book_info["title"] = response.css(".product_main h1::text").get()
            book_info["category"] = response.xpath(
                "//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()"
            ).get()
            book_info["price"] = response.css("p.price_color ::text").get()
            """
            book_info["product_description"] = response.xpath(
                "//div[@id='product_description']/following-sibling::p/text()"
            ).get()
            """
            book_info["product_type"] = table_info[1].css("td ::text").get()
            book_info["price_excl_tax"] = table_info[2].css("td ::text").get()
            book_info["price_incl_tax"] = table_info[3].css("td ::text").get()
            book_info["tax"] = table_info[4].css("td ::text").get()
            book_info["availability"] = table_info[5].css("td ::text").get()
            book_info["number_of_reviews"] = table_info[6].css("td ::text").get()
            book_info["stars"] = response.css("p.star-rating").attrib["class"]
    
            # Books images
            book_info["image_urls"] = [
                response.urljoin(response.css(".active img::attr(src)").get())
            ]
    
            yield book_info
        

    Explanation:

    • name: Identifies the spider. It must be unique within a project.

    • start_requests: Initiates requests to the starting URL. This method is called when the spider is opened.

    • parse: Handles the main page, extracts book links, and follows pagination. It processes the response from the starting URL and yields requests for book details and next pages.

    • parse_book_page: Extracts detailed information from individual book pages and yields it as items. It collects data based on the item fields we defined earlier and stores it in the BooksItemInfo item.

Extracting Data

Now that our spider is set up, let's dive into how it extracts data from the bookstore website.

  1. Understanding the Extraction Process:

    Our spider navigates through the website, identifies book links on each page, follows these links to the individual book pages, and extracts relevant data.

  2. Detailed Walkthrough of Data Extraction:

    Main Page Parsing: Extracting book links and pagination.

          def parse(self, response):
        books = response.css("article.product_pod")
        for book in books:
            book_page = book.css("h3 a::attr(href)").get()
            if book_page:
                yield response.follow(book_page, callback=self.parse_book_page)
    
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
        

    Explanation:

    • books: Selects all books on the page using CSS selectors.

    • book_page: Extracts the link to each individual book page.

    • response.follow: Follows each book link and calls parse_book_page to extract data.

    • next_page: Identifies the link to the next page of books and follows it to continue scraping.

    Book Page Parsing: Extracting book details

        def parse_book_page(self, response):
        book_info = BooksItemInfo()
        table_info = response.css("table tr")
    
        # Book Info
        book_info["title"] = response.css(".product_main h1::text").get()
        book_info["category"] = response.xpath(
            "//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()"
        ).get()
        book_info["price"] = response.css("p.price_color ::text").get()
        """
        book_info["product_description"] = response.xpath(
            "//div[@id='product_description']/following-sibling::p/text()"
        ).get()
        """
        book_info["product_type"] = table_info[1].css("td ::text").get()
        book_info["price_excl_tax"] = table_info[2].css("td ::text").get()
        book_info["price_incl_tax"] = table_info[3].css("td ::text").get()
        book_info["tax"] = table_info[4].css("td ::text").get()
        book_info["availability"] = table_info[5].css("td ::text").get()
        book_info["number_of_reviews"] = table_info[6].css("td ::text").get()
        book_info["stars"] = response.css("p.star-rating").attrib["class"]
    
        # Books images
        book_info["image_urls"] = [
            response.urljoin(response.css(".active img::attr(src)").get())
        ]
        yield book_info
        

    Explanation:

    • title: Extracts the book title.

    • category: Extracts the book category using XPath.

    • price: Extracts the book price.

    • product_type, price_excl_tax, price_incl_tax, tax, availability, number_of_reviews: Extracts additional details from the book’s information table.

    • stars: Extracts the book’s star rating.

    • image_urls: Extracts the URL of the book cover image.

Extracting Images

To extract images, we need to enable the images pipeline in Scrapy settings and ensure our spider is set up to download images.

  1. Enabling Images Pipeline:

    In the settings.py file, add the following lines:

    ITEM_PIPELINES = { 'scrapy.pipelines.images.ImagesPipeline': 1 } 
    IMAGES_STORE = 'books/books/images'
        

  2. Image URL Extraction in Spider:

    We already extract the image URLs in our spider:

    book_info["image_urls"] = [
        response.urljoin(response.css(".active img::attr(src)").get())
    ]
        

Scrapy Pipeline

A Scrapy pipeline is a component that processes the data extracted by the spider before it’s stored or further processed. Pipelines allow you to clean, validate, or transform the data, as well as store it in a database, export it to a file, or perform other tasks.

How Pipelines Work:

Once the spider yields an item, it goes through a pipeline where each step in the pipeline processes the item. Pipelines are defined in the pipelines.py file, and each pipeline class can implement a method called process_item() which gets executed for each item.

In our project, we have two main pipelines:

  1. BooksPipeline: This cleans and processes the extracted data.

  2. SaveDataToPostgres: This pipeline connects to the PostgreSQL database and stores the processed data.

Storing Data in PostgreSQL

Processing the Data

After the Scrapy spider has extracted the data from the website, it passes this data to the pipeline for further processing. In the pipeline.py file, we have two main classes:

  1. BooksPipeline: This class is responsible for cleaning and transforming the data before it is saved. Here’s what it does:

    • Price Formatting: It removes currency symbols from price fields and converts them to floating numbers.

    • Availability Extraction: It extracts numerical values from availability strings.

    • Star Rating Conversion: It converts star ratings from textual descriptions to numerical values.

          class BooksPipeline:
        def process_item(self, item, spider):
            adapter = ItemAdapter(item)
    
            # Remove £ from all columns
            price_keys = ["price", "price_excl_tax", "price_incl_tax", "tax"]
            for price in price_keys:
                value = adapter.get(price)
                value = value.replace("£", "")
                adapter[price] = float(value)
    
            # availability
            avail = adapter.get("availability")
            value = re.findall(r"\d+", avail)
            adapter["availability"] = value[0]
    
            # stars
            stars = adapter.get("stars")
            rate = stars.split()[-1]
            rates = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
            adapter["stars"] = rates.get(rate, None)
    
            return item
        

Saving Data to PostgreSQL

Once the data is processed, it needs to be saved to a database. In our project, this is done using the SaveDataToPostgres class, which handles the connection to PostgreSQL and the insertion of data. Here’s how it works:

  1. Database Connection:

    The class initializes a connection to the PostgreSQL database using psycopg2.
    It creates a table named books if it doesn’t already exist.

          class SaveDataToPostgres:
        def __init__(self) -> None:
            try:
                self.conn = psycopg2.connect(
                    database="web-scraping",
                    user="user",
                    host="localhost",
                    password="password",
                    port=5432,
                )
                self.cur = self.conn.cursor()
    
                # create books table if not exist
                self.cur.execute(
                    """
                    CREATE TABLE IF NOT EXISTS books(
                        id SERIAL PRIMARY KEY,
                        title TEXT,
                        category VARCHAR(255), 
                        price FLOAT,
                        product_type VARCHAR(255),
                        price_excl_tax DECIMAL,
                        price_incl_tax DECIMAL,
                        tax DECIMAL,
                        availability INT,
                        number_of_reviews INT,
                        stars INT
                    )     
                    """
                )
            except Exception as e:
                print(f"Error initializing database: {e}")
        

  2. Inserting Data:

    The process_item method inserts the cleaned data into the books table.

          def process_item(self, item, spider):
        try:
            self.cur.execute(
                """
                INSERT INTO books(
                    title,
                    category,
                    price,
                    product_type,
                    price_excl_tax,
                    price_incl_tax,
                    tax,
                    availability,
                    number_of_reviews,
                    stars    
                )
                VALUES (
                    %s,%s,%s,%s,%s,%s,%s,%s,%s,%s
                )
                """,
                (
                    item["title"],
                    item["category"],
                    item["price"],
                    item["product_type"],
                    item["price_excl_tax"],
                    item["price_incl_tax"],
                    item["tax"],
                    item["availability"],
                    item["number_of_reviews"],
                    item["stars"],
                ),
            )
    
            self.conn.commit()
        except Exception as e:
            print(f"Error processing item: {e}")
    
        return item
        

  3. Closing the Connection:

    The close_spider method is used to close the database connection once the spider finishes its execution. This ensures that all resources are properly released and the connection is terminated.

          def close_spider(self, spider):
        try:
            # close cursor and connection to the database
            self.cur.close()
            self.conn.close()
        except Exception as e:
            print(f"Error closing database connection: {e}")
        

Enabling the Pipeline in settings.py

To ensure that our pipeline runs after scraping, we need to enable it in the settings.py file. In Scrapy, the settings.py file is where you configure various aspects of your project, including how data pipelines are handled.

      # settings.py

# Define the pipelines used in your project
ITEM_PIPELINES = {
    'books.pipelines.BooksPipeline': 2,  # Data cleaning and transformation
    'books.pipelines.SaveDataToPostgres': 3,  # Data storage
}
    

In this configuration:

  • books.pipelines.BooksPipeline is assigned a priority of 2. This means it will run before the SaveDataToPostgres pipeline.

  • books.pipelines.SaveDataToPostgres is assigned a priority of 3, so it will run after the BooksPipeline.

Running the Spider

Finally, we run the spider to scrape data and store it in the PostgreSQL database.

  1. Running the Spider:

    Run the spider using the Scrapy command:

          scrapy crawl book
        

Conclusion

In this article, we covered setting up Scrapy, defining items, writing a spider, and processing data with a pipeline. We also explored storing data in PostgreSQL and running the spider to automate data extraction. These steps provide a comprehensive guide to efficiently scrape and manage web data.

0 comments:

Post a Comment