Build a Web Scraper with Node.js

Home
Blog
How to Build a Powerful Node.js Website Scraper with Playwright

Quick Summary:

This blog demonstrates how Creole Studios utilizes Node.js and the Playwright library to create a web scraper for extracting data from websites. The tutorial employs the NestJS framework and provides a step-by-step guide, including project setup, resource creation, and browser automation for scraping dynamic content. Key features include URL crawling, text extraction, data cleaning, and recursive functionality to gather data from all relevant pages. The blog concludes with instructions to make a POST request to trigger the scraper and retrieve data in JSON format, showcasing how to build a robust web scraper effectively.

Introduction:

At Creole Studios, we’re always excited to share insights and practical demonstrations that can help developers enhance their skills. As a leading Node.js development company, we take pride in building efficient and scalable solutions. In this blog, we’ll walk you through creating a web scraper using Node.js, a task that allows you to extract data from websites efficiently.

A web scraper is essentially an algorithm designed to extract data from websites. To follow along with this demo, a basic understanding of Node.js and JavaScript is recommended. For this demonstration, we’ll be utilizing Playwright, a powerful browser automation library. Experienced developers can skip the foundational steps and jump directly to point number 8 for the core logic. Let’s dive in!

Step-by-Step Guide:

1. Install the NestJS CLI Tool

First, install the NestJS CLI tool globally with the following command:

npm i -g @nestjs/cli

2. Create a New Project

Create a new project with your desired name using the command below:

nest new nodejs-web-scraper

This is how it will look when you successfully create the project.

3. Navigate to the Project Folder

Move to the project directory and run the project using these commands:

cd nodejs-web-scraper  
npm run start:dev

You’ll see the project running in the terminal window.

4. Open the Project in a Code Editor

Open the project in your preferred code editor (we recommend VS Code). The folder structure will look like this:

5. Add a Resource Named Scraper

Use the following command to generate a new resource named scraper:

nest g resource scraper

You should see the screen like below after you run the command.

6. Install the Playwright Library

The core logic in scraping a site is to get the html of the page you want to scrape and access the content from the html to get the required data. Sometimes the html is dynamically rendered, For example: Pagination, Tabs etc. So we need to perform these actions by clicking those elements. Basically we want to open a browser and do some automation. There are some browser automation libraries like Selenium, Puppeteer, Playwright etc.

To automate browser interactions, install the Playwright library:

npm i playwright

7. Install Browsers for Playwright

Next, install the browsers that Playwright will use with the command:

npx playwright install

8. Implement Core Scraping Logic

Now, let’s create a function in scraper.service.ts to launch a browser, visit the provided URL, and recursively extract text content from all pages starting with the given base URL. Here’s the code snippet:

//scraper.service.ts
import { Injectable } from '@nestjs/common';
import { Page, chromium } from 'playwright';
@Injectable()
export class ScraperService {
 //main scraping function which will be called from the controller.
 async scrape(baseUrl: string) {
   // Launch a browser and create a new page
   // headless is set to false to see the browser in action
   const browser = await chromium.launch({ headless: false });
   // Create a new browser context and a new page
   const context = await browser.newContext();
   const page = await context.newPage();
   //maintain a set of visited urls and a queue of pending urls
   const visitedUrls = new Set<string>();
   const pendingUrls = [baseUrl];
   //maintain an array of scraped data
   const scrapedData: { url: string; cleanedText: string }[] = [];
   //loop through the pending urls and scrape the data
   while (pendingUrls.length > 0) {
     const url = pendingUrls.shift();
     if (!url || visitedUrls.has(url)) continue;
     console.log('crawling', url);
     visitedUrls.add(url);
     try {
       await page.goto(url, {
         waitUntil: 'networkidle',
         timeout: 360000,
       });
       const textContent = await page.innerText('body', { timeout: 360000 });
       const cleanedText = this.cleanUpTextContent(textContent);
       scrapedData.push({ url, cleanedText });
       console.log('scrapedData length', scrapedData.length);
       const newUrls = await this.extractUrls(page, baseUrl);
       newUrls.forEach((newUrl) => {
         if (!visitedUrls.has(newUrl)) {
           pendingUrls.push(newUrl);
         }
       });
     } catch (error) {
       console.error(`Error loading ${url}:`, error);
     }
   }
   // Close the browser
   await page.close();
   await context.close();
   await browser.close();
   console.log('scrapedData length', scrapedData.length);
   return scrapedData;
 }
 cleanUpTextContent(text: string): string {
   // Remove extra whitespace and irrelevant text using regular expressions
   const cleanedText = text.replace(/\s+/g, ' ').trim(); // Replace multiple spaces with a single space and trim
   return cleanedText;
 }
 async extractUrls(page: Page, baseUrl: string): Promise<string[]> {
   const hrefs = await page.$$eval(
     'a',
     (links, baseUrl) => {
       // Function to add or remove 'www' subdomain based on baseUrl
       const adjustWwwSubdomain = (url: string, baseUrl: string) => {
         const urlObj = new URL(url);
         const baseObj = new URL(baseUrl);
         if (baseObj.hostname.startsWith('www.')) {
           // If baseUrl has 'www' subdomain, ensure 'www' in extracted URLs
           if (!urlObj.hostname.startsWith('www.')) {
             urlObj.hostname = 'www.' + urlObj.hostname;
           }
         } else {
           // If baseUrl doesn't have 'www' subdomain, remove 'www' in extracted URLs
           urlObj.hostname = urlObj.hostname.replace(/^www\./, '');
         }
         return urlObj.href;
       };
       return links.map((link) => {
         try {
           let href = link.href;
           // Ignore empty hrefs or hash-only hrefs
           if (!href || href === '#' || href.startsWith('javascript:')) {
             return null;
           }
           // Convert relative URLs to absolute URLs
           if (href.startsWith('/')) {
             const protocol = baseUrl.startsWith('https://')
               ? 'https://'
               : 'http://';
             href = protocol + new URL(href, baseUrl).hostname + href;
           }
           // Handle protocol-relative URLs
           if (href.startsWith('//')) {
             const protocol = baseUrl.startsWith('https://')
               ? 'https:'
               : 'http:';
             href = protocol + href;
           }
           const fragment = href.split('/').pop().startsWith('#');
           if (fragment) {
             const arr = href.split('#');
             href = arr[0];
           }
           const includesHash =
             !href.split('/').pop().startsWith('#') &&
             href.split('/').pop().includes('#');
           if (includesHash) {
             return null;
           }
           // Ensure 'www' subdomain consistency
           href = adjustWwwSubdomain(href, baseUrl);
           return href;
         } catch (error) {
           console.log('Error extracting URL:', error);
           return null; // Ignore invalid URLs
         }
       });
     },
     baseUrl,
   );
   const filteredUrls = hrefs.filter((href) => {
     return href !== null && href.startsWith(baseUrl);
   });
   return filteredUrls;
 }
}

9. Create a Controller

Define a controller to handle API calls for scraping. Here’s the scraper.controller.ts file

//scraper.service.ts
import { Body, Controller, Post } from '@nestjs/common';
import { ScraperService } from './scraper.service';
import { ScraperDto } from './dto/scrape.dto';
@Controller()
export class ScraperController {
    constructor(private readonly scraperService: ScraperService) {}
    @Post('scraper')
    async scrape(@Body() scraperDto: ScraperDto) {
        return this.scraperService.scrape(scraperDto.url);
    }
}

10. Create a DTO for Input Validation

Here’s the scraper.dto.ts file defining the input structure:

//scraper.dto.ts
export class ScraperDto {
    url: string;
}

11. Make a POST Request

Now, make a POST request to the /scraper endpoint with the required URL.

12. Watch the Browser in Action

The browser will visit and extract text from all pages starting with the provided URL, e.g., www.creolestudios.com

13. Receive Data in JSON Format

The scraped data will be returned in JSON format, containing the URLs and the extracted content.

Conclusion:

Web scraping is a powerful technique to extract valuable data for analysis, business insights, and automation. With tools like Node.js, NestJS, and Playwright, it’s easier than ever to build scalable and efficient scrapers. At Creole Studios, we leverage modern technologies and frameworks to create tailored solutions for complex business challenges. If you’re looking to hire Node.js developers for web scraping or automation services, feel free to reach out to us for expert guidance and development solutions!

FAQs

1. What is Playwright, and why use it for web scraping?

Ans: Playwright is a powerful browser automation library that supports multiple browsers and provides robust scraping capabilities with headless mode, auto-waiting, and stealth features.

2. How does Playwright compare to Puppeteer for web scraping?

Ans: Playwright supports more browsers (Chromium, Firefox, WebKit), offers better automation features, and has built-in support for handling complex interactions and captchas.

3. How do I install Playwright in a Node.js project?

Ans: Run npm install playwright and use npx playwright install to set up the required browsers.

4. How do I prevent getting blocked while scraping?

Ans: Rotate user agents, use proxies, add delays between requests, and limit scraping frequency to mimic human behavior.

Node JS

Shivam Kshirsagar

Software Engineer

Tech Question's?

Book a call with our experts

Discussing a project or an idea with us is easy.

30 mins free Consulting

Related Insights
#Node JS

Collective success stories, we've crafted

Stripe Subscription with Node-js

Node JS

5 min read

How to Create a Seamless Video calling App Using LiveKit in No Time

Node JS

7 min read

Everything You Need to Know About NodeJS Architecture

Node JS

8 min read

How to Build a Powerful Node.js Website Scraper with Playwright

Table of contents

Quick Summary:

Introduction:

Step-by-Step Guide:

1. Install the NestJS CLI Tool

2. Create a New Project

3. Navigate to the Project Folder

4. Open the Project in a Code Editor

5. Add a Resource Named Scraper

6. Install the Playwright Library

7. Install Browsers for Playwright

8. Implement Core Scraping Logic

9. Create a Controller

10. Create a DTO for Input Validation

11. Make a POST Request

12. Watch the Browser in Action

13. Receive Data in JSON Format

Conclusion:

FAQs

Shivam Kshirsagar

Launch your MVP in 3 months!

Hire Dedicated Developers or Team

Flexible Pricing

Book a call with our experts

Related Insights
#Node JS

Love we get from the world

India Office

A-404, Ratnaakar Nine Square, Opp ITC Narmada,Vastrapur, Ahmedabad, Gujarat, India, 380015

Hong Kong Office

Unit 06, 25/F, Metroplaza Tower II, 223 Hing Fong Road, Kwai Chung, Hong Kong.

USA Office

4059 Ida Ln, Vestavia Hills, Birmingham Alabama, United States, 35243.

Germany Office

Almunécarstr. 60, 82256 Fürstenfeldbruck, Germany.

How to Build a Powerful Node.js Website Scraper with Playwright

Table of contents

Quick Summary:

Introduction:

Step-by-Step Guide:

1. Install the NestJS CLI Tool

2. Create a New Project

3. Navigate to the Project Folder

4. Open the Project in a Code Editor

5. Add a Resource Named Scraper

6. Install the Playwright Library

7. Install Browsers for Playwright

8. Implement Core Scraping Logic

9. Create a Controller

10. Create a DTO for Input Validation

11. Make a POST Request

12. Watch the Browser in Action

13. Receive Data in JSON Format

Conclusion:

FAQs

Shivam Kshirsagar

Launch your MVP in 3 months!

Hire Dedicated Developers or Team

Flexible Pricing

Book a call with our experts

Related Insights #Node JS

Love we get from the world

India Office

A-404, Ratnaakar Nine Square, Opp ITC Narmada,Vastrapur, Ahmedabad, Gujarat, India, 380015

Hong Kong Office

Unit 06, 25/F, Metroplaza Tower II, 223 Hing Fong Road, Kwai Chung, Hong Kong.

USA Office

4059 Ida Ln, Vestavia Hills, Birmingham Alabama, United States, 35243.

Germany Office

Almunécarstr. 60, 82256 Fürstenfeldbruck, Germany.

Related Insights
#Node JS