Table of contents

Quick Summary:

This blog demonstrates how Creole Studios utilizes Node.js and the Playwright library to create a web scraper for extracting data from websites. The tutorial employs the NestJS framework and provides a step-by-step guide, including project setup, resource creation, and browser automation for scraping dynamic content. Key features include URL crawling, text extraction, data cleaning, and recursive functionality to gather data from all relevant pages. The blog concludes with instructions to make a POST request to trigger the scraper and retrieve data in JSON format, showcasing how to build a robust web scraper effectively.

Introduction:

At Creole Studios, we’re always excited to share insights and practical demonstrations that can help developers enhance their skills. As a leading Node.js development company, we take pride in building efficient and scalable solutions. In this blog, we’ll walk you through creating a web scraper using Node.js, a task that allows you to extract data from websites efficiently.

A web scraper is essentially an algorithm designed to extract data from websites. To follow along with this demo, a basic understanding of Node.js and JavaScript is recommended. For this demonstration, we’ll be utilizing Playwright, a powerful browser automation library. Experienced developers can skip the foundational steps and jump directly to point number 8 for the core logic. Let’s dive in!

Step-by-Step Guide:

1. Install the NestJS CLI Tool

First, install the NestJS CLI tool globally with the following command:

npm i -g @nestjs/cli

2. Create a New Project

Create a new project with your desired name using the command below:

nest new nodejs-web-scraper

This is how it will look when you successfully create the project.

3. Navigate to the Project Folder

Move to the project directory and run the project using these commands:

cd nodejs-web-scraper  
npm run start:dev

You’ll see the project running in the terminal window.

4. Open the Project in a Code Editor

Open the project in your preferred code editor (we recommend VS Code). The folder structure will look like this:

5. Add a Resource Named Scraper

Use the following command to generate a new resource named scraper:

nest g resource scraper

You should see the screen like below after you run the command.

6. Install the Playwright Library

The core logic in scraping a site is to get the html of the page you want to scrape and access the content from the html to get the required data. Sometimes the html is dynamically rendered, For example: Pagination, Tabs etc. So we need to perform these actions by clicking those elements. Basically we want to open a browser and do some automation. There are some browser automation libraries like Selenium, Puppeteer, Playwright etc. 

To automate browser interactions, install the Playwright library:

npm i playwright

7. Install Browsers for Playwright

Next, install the browsers that Playwright will use with the command:

npx playwright install

8. Implement Core Scraping Logic

Now, let’s create a function in scraper.service.ts to launch a browser, visit the provided URL, and recursively extract text content from all pages starting with the given base URL. Here’s the code snippet:

//scraper.service.ts
import { Injectable } from '@nestjs/common';
import { Page, chromium } from 'playwright';
@Injectable()
export class ScraperService {
 //main scraping function which will be called from the controller.
 async scrape(baseUrl: string) {
   // Launch a browser and create a new page
   // headless is set to false to see the browser in action
   const browser = await chromium.launch({ headless: false });
   // Create a new browser context and a new page
   const context = await browser.newContext();
   const page = await context.newPage();
   //maintain a set of visited urls and a queue of pending urls
   const visitedUrls = new Set<string>();
   const pendingUrls = [baseUrl];
   //maintain an array of scraped data
   const scrapedData: { url: string; cleanedText: string }[] = [];
   //loop through the pending urls and scrape the data
   while (pendingUrls.length > 0) {
     const url = pendingUrls.shift();
     if (!url || visitedUrls.has(url)) continue;
     console.log('crawling', url);
     visitedUrls.add(url);
     try {
       await page.goto(url, {
         waitUntil: 'networkidle',
         timeout: 360000,
       });
       const textContent = await page.innerText('body', { timeout: 360000 });
       const cleanedText = this.cleanUpTextContent(textContent);
       scrapedData.push({ url, cleanedText });
       console.log('scrapedData length', scrapedData.length);
       const newUrls = await this.extractUrls(page, baseUrl);
       newUrls.forEach((newUrl) => {
         if (!visitedUrls.has(newUrl)) {
           pendingUrls.push(newUrl);
         }
       });
     } catch (error) {
       console.error(`Error loading ${url}:`, error);
     }
   }
   // Close the browser
   await page.close();
   await context.close();
   await browser.close();
   console.log('scrapedData length', scrapedData.length);
   return scrapedData;
 }
 cleanUpTextContent(text: string): string {
   // Remove extra whitespace and irrelevant text using regular expressions
   const cleanedText = text.replace(/\s+/g, ' ').trim(); // Replace multiple spaces with a single space and trim
   return cleanedText;
 }
 async extractUrls(page: Page, baseUrl: string): Promise<string[]> {
   const hrefs = await page.$$eval(
     'a',
     (links, baseUrl) => {
       // Function to add or remove 'www' subdomain based on baseUrl
       const adjustWwwSubdomain = (url: string, baseUrl: string) => {
         const urlObj = new URL(url);
         const baseObj = new URL(baseUrl);
         if (baseObj.hostname.startsWith('www.')) {
           // If baseUrl has 'www' subdomain, ensure 'www' in extracted URLs
           if (!urlObj.hostname.startsWith('www.')) {
             urlObj.hostname = 'www.' + urlObj.hostname;
           }
         } else {
           // If baseUrl doesn't have 'www' subdomain, remove 'www' in extracted URLs
           urlObj.hostname = urlObj.hostname.replace(/^www\./, '');
         }
         return urlObj.href;
       };
       return links.map((link) => {
         try {
           let href = link.href;
           // Ignore empty hrefs or hash-only hrefs
           if (!href || href === '#' || href.startsWith('javascript:')) {
             return null;
           }
           // Convert relative URLs to absolute URLs
           if (href.startsWith('/')) {
             const protocol = baseUrl.startsWith('https://')
               ? 'https://'
               : 'http://';
             href = protocol + new URL(href, baseUrl).hostname + href;
           }
           // Handle protocol-relative URLs
           if (href.startsWith('//')) {
             const protocol = baseUrl.startsWith('https://')
               ? 'https:'
               : 'http:';
             href = protocol + href;
           }
           const fragment = href.split('/').pop().startsWith('#');
           if (fragment) {
             const arr = href.split('#');
             href = arr[0];
           }
           const includesHash =
             !href.split('/').pop().startsWith('#') &&
             href.split('/').pop().includes('#');
           if (includesHash) {
             return null;
           }
           // Ensure 'www' subdomain consistency
           href = adjustWwwSubdomain(href, baseUrl);
           return href;
         } catch (error) {
           console.log('Error extracting URL:', error);
           return null; // Ignore invalid URLs
         }
       });
     },
     baseUrl,
   );
   const filteredUrls = hrefs.filter((href) => {
     return href !== null && href.startsWith(baseUrl);
   });
   return filteredUrls;
 }
}

9. Create a Controller

Define a controller to handle API calls for scraping. Here’s the scraper.controller.ts file

//scraper.service.ts
import { Body, Controller, Post } from '@nestjs/common';
import { ScraperService } from './scraper.service';
import { ScraperDto } from './dto/scrape.dto';
@Controller()
export class ScraperController {
    constructor(private readonly scraperService: ScraperService) {}
    @Post('scraper')
    async scrape(@Body() scraperDto: ScraperDto) {
        return this.scraperService.scrape(scraperDto.url);
    }
}

10. Create a DTO for Input Validation

Here’s the scraper.dto.ts file defining the input structure:

//scraper.dto.ts
export class ScraperDto {
    url: string;
}

11. Make a POST Request

Now, make a POST request to the /scraper endpoint with the required URL.

12. Watch the Browser in Action

The browser will visit and extract text from all pages starting with the provided URL, e.g., www.creolestudios.com

13. Receive Data in JSON Format

The scraped data will be returned in JSON format, containing the URLs and the extracted content.

Conclusion:

Web scraping is a powerful technique to extract valuable data for analysis, business insights, and automation. With tools like Node.js, NestJS, and Playwright, it’s easier than ever to build scalable and efficient scrapers. At Creole Studios, we leverage modern technologies and frameworks to create tailored solutions for complex business challenges. If you’re looking to hire expert Node.js developers for web scraping or automation services, feel free to reach out to us for expert guidance and development solutions!


Node JS
Shivam Kshirsagar
Shivam Kshirsagar

Software Engineer

Launch your MVP in 3 months!
arrow curve animation Help me succeed img
Hire Dedicated Developers or Team
arrow curve animation Help me succeed img
Flexible Pricing
arrow curve animation Help me succeed img
Tech Question's?
arrow curve animation
creole stuidos round ring waving Hand
cta

Book a call with our experts

Discussing a project or an idea with us is easy.

client-review
client-review
client-review
client-review
client-review
client-review

tech-smiley Love we get from the world

white heart