Extracting Essential Data from Amazon Listings with Puppeteer
Written on
Introduction to Data Extraction
Welcome to the second part of my tutorial series focused on web scraping Amazon listings using NodeJS and Puppeteer. If you haven't already read the first part, I recommend checking it out here:
Web Scraping With NodeJS and Puppeteer: Part 1
A tutorial series introducing the fundamentals of web scraping with NodeJS and Puppeteer.
The aim of this tutorial is to show you how to extract essential information from a list of search results on Amazon. We'll specifically focus on retrieving the title, price, image, and page URL for various laptop listings.
Overview of the Script
In this session, we will continue from our previous index.js file, implementing code to establish the core of our web scraper. The script will traverse all search results on a single Amazon search results page.
For each search result, we will gather the title, price, image URL, and page URL. These details will be encapsulated in an object, which will then be collected in a list. Ultimately, we will save this list into a CSV file that compiles all the search results.
Let's take a look at the final output below, and then we will walk through the code step-by-step to explore its functionality. Feel free to copy the code into your index.js file and follow along with the explanations.
Stepping Through the Initial Script
We'll begin by examining the very first line of our script:
const puppeteer = require('puppeteer');
This line imports the Puppeteer library, along with all of its functions, for use in our project. As Puppeteer is a third-party library, this require statement is necessary to include it and its dependencies.
Next, we encapsulate all of our script’s contents within an asynchronous function as follows:
(async () => {
})();
This structure means that we are embedding our code within a self-contained async function that executes immediately upon runtime. While this is sufficient for this example, in more complex projects, you may prefer to define and name your function properly and call it as needed.
Exploring Puppeteer’s launch Function
Next, we initialize a browser instance with the following code:
const browser = await puppeteer.launch({
headless: false, // Make the browser visible
defaultViewport: false, // Adjust to user's screen resolution
userDataDir: './tmp' // Cache previous actions
});
The browser object we create here represents the actual browser window that will appear when our script runs. Our code will interact with this browser to fetch the necessary elements.
The launch function from Puppeteer accepts an object parameter with several properties that we configure:
- headless: A boolean value that determines whether the web browser is visible during execution. In our case, we set it to false so we can observe the actions taking place.
- defaultViewport: By setting this to false, we override Chromium’s small default viewport size, allowing us to use the full resolution of the user's screen.
- userDataDir: This property defines a directory for caching previous actions on websites, which is particularly useful for remembering if we have previously solved captchas.
Exploring Puppeteer’s newPage and goto Functions
After initializing the browser, we create a page object by calling the newPage() function:
const page = await browser.newPage();
With our page object ready, we will navigate to the specified URL, which is the Amazon search results page for “gaming laptop”:
Now that we're on the correct page, we want to collect all of the search results.
To retrieve multiple elements from a page, we can use the $$ function on our page object, passing a selector string as an argument:
const laptops = await page.$$('.s-result-item');
The string .s-result-item represents the shared CSS class among each laptop search result entry. The laptops array will store all elements matching this selector.
Identifying Selectors for Web Elements
To determine the selectors for our laptop search results, right-click on the desired web element and select "Inspect".
Once the element is highlighted in the HTML source, hover over the HTML until you see a blue selection box covering the entire web element. Look for a common CSS class shared by all search results.
In our case, we found the class .s-result-item, which we will use to extract all search results.
Iterating Through the Extracted Data
Now that we know how to grab the desired elements, we can iterate through the results and clean the data to capture only the key points. We'll initialize an array for our transformed laptops:
let transformedLaptops = [];
This array will hold objects representing each laptop's key data, which we will use to create our final CSV report.
Next, we loop through the laptops array:
for (const laptop of laptops) {
}
Within this loop, we initialize our key variables to null. These will store the data we want to extract for each laptop:
let title = null;
let priceDollar = null;
let priceCent = null;
let image = null;
let url = null;
Extracting the Laptop Title
Next, we will extract the title for each search result using the following code:
try {
title = await page.evaluate(el => el.querySelector('h2 > a > span').textContent, laptop);
} catch (err) {
console.log('no title found for this element');
console.log(err);
}
Here, the page.evaluate() function takes an arrow function that uses querySelector() to find the title element. The textContent property retrieves the inner text of the matching result.
For finding the selector, right-click the laptop title and select "Inspect," then copy the selector.
Extracting Key Price Fields
Next, we'll follow a similar approach to extract the price of the laptop, utilizing two variables for the dollar and cent amounts:
try {
priceDollar = await page.evaluate(el => el.querySelector('span.a-price-whole').textContent, laptop);
priceCent = await page.evaluate(el => el.querySelector('span.a-price-fraction').textContent, laptop);
} catch(err) {
console.log('no price found for this element');
}
Again, we use page.evaluate() and querySelector() to grab the price elements.
Extracting the Image URL
Capturing the image URL is almost identical to retrieving the title and prices, but we use getAttribute('src') to access the source URL:
try {
image = await page.evaluate(el => el.querySelector('.s-image').getAttribute('src'), laptop);
} catch(err) {
console.log('no image found for element');
}
Extracting the Page URL
Finally, we extract the page URL for each laptop, which will allow us to navigate to the details page later:
try {
url = await page.evaluate(el => el.querySelector('.a-link-normal.s-underline-text.s-underline-link-text.s-link-style.a-text-normal').href, laptop);
} catch (err) {
console.log('No URL found for this element');
}
Creating the Array of Transformed Data
With all the key information extracted, we will populate our transformedLaptops array with objects containing the cleaned data:
if (title !== null) {
transformedLaptops.push({
title: title,
price: ${priceDollar}${priceCent},
imageUrl: image,
pageUrl: url
});
}
We only push valid laptop entries, ensuring that excess elements are ignored.
Conclusion
Thank you for following this tutorial! Your feedback is always welcome. In the next part of this series, we’ll enhance the script to handle pagination, allowing us to scrape entries across multiple pages. You can find Part 3 here:
Web Scraping With NodeJS and Puppeteer: Part 3
Part 3 of my web scraping series using NodeJS and Puppeteer, focusing on pagination handling.
For further information, please refer to the official Puppeteer documentation and GitHub repository.
This video, titled "Ultimate Guide To Web Scraping - Node.js & Python (Puppeteer & Beautiful Soup)," provides insights into web scraping techniques using Node.js and Python.
In this video, "Web Scraping with Puppeteer & Node.js: Chrome Automation," you will learn about automating web scraping with Puppeteer and Node.js.