Web scraping has become an essential skill for anyone who needs to extract information from websites. In this article, we'll explore how to use Scrapy, a powerful and open-source framework, to scrape data from a bookstore website and store it in a PostgreSQL database. We'll even include scraping the covers of the books!
Prerequisites
Python 3.x: The foundation for running Scrapy.Scrapy: The open-source framework for web scraping.PostgreSQL: The relational database management system for storing our data.psycopg2: The PostgreSQL adapter for Python, allowing Python to interact with PostgreSQL.
Setting Up Scrapy
Create a Project Directory: mkdir books_to_scrape cd books_to_scrape
Create a Virtual Environment: python3 -m venv my-spider
Activate the Environment: source my-spider/bin/activate
Install Scrapy: pip install scrapy
Creating the Spider
scrapy startproject books
books/
scrapy.cfg # deploy configuration file
books/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Defining the Items
Purpose of Items: Items define the structure and fields of the data we want to scrape. Think of them as containers for the specific information you wish to extract from web pages. They help organize and store the data efficiently. Creating the Items File: Here's what the items.py file looks like: import scrapy class BooksItemInfo(scrapy.Item): title = scrapy.Field() category = scrapy.Field() price = scrapy.Field() #product_description = scrapy.Field() product_type = scrapy.Field() price_excl_tax = scrapy.Field() price_incl_tax = scrapy.Field() tax = scrapy.Field() availability = scrapy.Field() number_of_reviews = scrapy.Field() stars = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field()
Explanation: title: The title of the book.category: The category or genre of the book.price: The price of the book.product_type: The type of product (e.g., book format).price_excl_tax: Price excluding tax.price_incl_tax: Price including tax.tax: Tax amount.availability: Availability status.number_of_reviews: Number of reviews the book has received.stars: Rating of the book, typically in star format.image_urls: List of URLs for book cover images.images: Details about downloaded images (used for image pipelines).
Writing the Spider
Setting Up the Spider: We'll set up a spider within the spiders directory. This spider will be responsible for navigating the website and extracting the data according to our item definitions. Spider Code Breakdown: Here's the code for our spider: import scrapy from books.items import BooksItemInfo class BookSpider(scrapy.Spider): name = "book" def start_requests(self): URL = "https://books.toscrape.com" yield scrapy.Request(url=URL, callback=self.parse) def parse(self, response): books = response.css("article.product_pod") for book in books: book_page = book.css("h3 a::attr(href)").get() if book_page: yield response.follow(book_page, callback=self.parse_book_page) next_page = response.css("li.next a::attr(href)").get() if next_page: yield response.follow(next_page, callback=self.parse) def parse_book_page(self, response): book_info = BooksItemInfo() table_info = response.css("table tr") # Book Info book_info["title"] = response.css(".product_main h1::text").get() book_info["category"] = response.xpath( "//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()" ).get() book_info["price"] = response.css("p.price_color ::text").get() """ book_info["product_description"] = response.xpath( "//div[@id='product_description']/following-sibling::p/text()" ).get() """ book_info["product_type"] = table_info[1].css("td ::text").get() book_info["price_excl_tax"] = table_info[2].css("td ::text").get() book_info["price_incl_tax"] = table_info[3].css("td ::text").get() book_info["tax"] = table_info[4].css("td ::text").get() book_info["availability"] = table_info[5].css("td ::text").get() book_info["number_of_reviews"] = table_info[6].css("td ::text").get() book_info["stars"] = response.css("p.star-rating").attrib["class"] # Books images book_info["image_urls"] = [ response.urljoin(response.css(".active img::attr(src)").get()) ] yield book_info
Explanation: name: Identifies the spider. It must be unique within a project.start_requests: Initiates requests to the starting URL. This method is called when the spider is opened.parse: Handles the main page, extracts book links, and follows pagination. It processes the response from the starting URL and yields requests for book details and next pages.parse_book_page: Extracts detailed information from individual book pages and yields it as items. It collects data based on the item fields we defined earlier and stores it in the BooksItemInfo item.
Extracting Data
Understanding the Extraction Process: Our spider navigates through the website, identifies book links on each page, follows these links to the individual book pages, and extracts relevant data. Detailed Walkthrough of Data Extraction: Main Page Parsing: Extracting book links and pagination. def parse(self, response): books = response.css("article.product_pod") for book in books: book_page = book.css("h3 a::attr(href)").get() if book_page: yield response.follow(book_page, callback=self.parse_book_page) next_page = response.css("li.next a::attr(href)").get() if next_page: yield response.follow(next_page, callback=self.parse)
Explanation: books: Selects all books on the page using CSS selectors.book_page: Extracts the link to each individual book page.response.follow: Follows each book link and calls parse_book_page to extract data.next_page: Identifies the link to the next page of books and follows it to continue scraping.
Book Page Parsing: Extracting book details def parse_book_page(self, response): book_info = BooksItemInfo() table_info = response.css("table tr") # Book Info book_info["title"] = response.css(".product_main h1::text").get() book_info["category"] = response.xpath( "//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()" ).get() book_info["price"] = response.css("p.price_color ::text").get() """ book_info["product_description"] = response.xpath( "//div[@id='product_description']/following-sibling::p/text()" ).get() """ book_info["product_type"] = table_info[1].css("td ::text").get() book_info["price_excl_tax"] = table_info[2].css("td ::text").get() book_info["price_incl_tax"] = table_info[3].css("td ::text").get() book_info["tax"] = table_info[4].css("td ::text").get() book_info["availability"] = table_info[5].css("td ::text").get() book_info["number_of_reviews"] = table_info[6].css("td ::text").get() book_info["stars"] = response.css("p.star-rating").attrib["class"] # Books images book_info["image_urls"] = [ response.urljoin(response.css(".active img::attr(src)").get()) ] yield book_info
Explanation: title: Extracts the book title.category: Extracts the book category using XPath.price: Extracts the book price.product_type, price_excl_tax, price_incl_tax, tax, availability, number_of_reviews: Extracts additional details from the book’s information table.stars: Extracts the book’s star rating.image_urls: Extracts the URL of the book cover image.
Extracting Images
Enabling Images Pipeline: In the settings.py file, add the following lines: ITEM_PIPELINES = { 'scrapy.pipelines.images.ImagesPipeline': 1 } IMAGES_STORE = 'books/books/images'
Image URL Extraction in Spider: We already extract the image URLs in our spider: book_info["image_urls"] = [ response.urljoin(response.css(".active img::attr(src)").get()) ]
Scrapy Pipeline
BooksPipeline: This cleans and processes the extracted data.SaveDataToPostgres: This pipeline connects to the PostgreSQL database and stores the processed data.
Storing Data in PostgreSQL
Processing the Data
BooksPipeline: This class is responsible for cleaning and transforming the data before it is saved. Here’s what it does:Price Formatting: It removes currency symbols from price fields and converts them to floating numbers.Availability Extraction: It extracts numerical values from availability strings.Star Rating Conversion: It converts star ratings from textual descriptions to numerical values.
class BooksPipeline: def process_item(self, item, spider): adapter = ItemAdapter(item) # Remove £ from all columns price_keys = ["price", "price_excl_tax", "price_incl_tax", "tax"] for price in price_keys: value = adapter.get(price) value = value.replace("£", "") adapter[price] = float(value) # availability avail = adapter.get("availability") value = re.findall(r"\d+", avail) adapter["availability"] = value[0] # stars stars = adapter.get("stars") rate = stars.split()[-1] rates = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5} adapter["stars"] = rates.get(rate, None) return item
Saving Data to PostgreSQL
Database Connection: The class initializes a connection to the PostgreSQL database using psycopg2. It creates a table named books if it doesn’t already exist. class SaveDataToPostgres: def __init__(self) -> None: try: self.conn = psycopg2.connect( database="web-scraping", user="user", host="localhost", password="password", port=5432, ) self.cur = self.conn.cursor() # create books table if not exist self.cur.execute( """ CREATE TABLE IF NOT EXISTS books( id SERIAL PRIMARY KEY, title TEXT, category VARCHAR(255), price FLOAT, product_type VARCHAR(255), price_excl_tax DECIMAL, price_incl_tax DECIMAL, tax DECIMAL, availability INT, number_of_reviews INT, stars INT ) """ ) except Exception as e: print(f"Error initializing database: {e}")
Inserting Data: The process_item method inserts the cleaned data into the books table. def process_item(self, item, spider): try: self.cur.execute( """ INSERT INTO books( title, category, price, product_type, price_excl_tax, price_incl_tax, tax, availability, number_of_reviews, stars ) VALUES ( %s,%s,%s,%s,%s,%s,%s,%s,%s,%s ) """, ( item["title"], item["category"], item["price"], item["product_type"], item["price_excl_tax"], item["price_incl_tax"], item["tax"], item["availability"], item["number_of_reviews"], item["stars"], ), ) self.conn.commit() except Exception as e: print(f"Error processing item: {e}") return item
Closing the Connection: The close_spider method is used to close the database connection once the spider finishes its execution. This ensures that all resources are properly released and the connection is terminated. def close_spider(self, spider): try: # close cursor and connection to the database self.cur.close() self.conn.close() except Exception as e: print(f"Error closing database connection: {e}")
Enabling the Pipeline in settings.py
# settings.py
# Define the pipelines used in your project
ITEM_PIPELINES = {
'books.pipelines.BooksPipeline': 2, # Data cleaning and transformation
'books.pipelines.SaveDataToPostgres': 3, # Data storage
}
books.pipelines.BooksPipeline is assigned a priority of 2. This means it will run before the SaveDataToPostgres pipeline. books.pipelines.SaveDataToPostgres is assigned a priority of 3, so it will run after the BooksPipeline.
Running the Spider
Running the Spider: Run the spider using the Scrapy command: scrapy crawl book
0 comments:
Post a Comment