Guide to Puppeteer: Web Scraping Using a Headless Browser

Puppeteer, crafted by Google’s Chrome team, is a Node.js library. It offers a high-level API to command headless Chrome or Chromium browsers, which operate without a visible interface, ideal for background tasks.

This tool empowers you to interact with web pages programmatically for tasks like web scraping, automated testing, capturing screenshots or PDFs, automating form submissions, and more. Puppeteer boasts navigation, DOM manipulation, network interception, and JavaScript execution within a web page’s context.

Getting started with this Puppeteer tutorial requires some basic tools to be installed.

You’ll need just two software components:

  • Node.js (including npm – the package manager for Node.js)
  • Any code editor

Node.js acts as a runtime framework, enabling JavaScript code to operate outside a browser environment.

Initiating a Node.js project for web scraping involves these steps:

  1. Folder Setup: Start by establishing a dedicated folder to house your JavaScript files. All Puppeteer code is contained within .js files and executed through Node.js.
  2. Directory Navigation: After creating the folder, access it in your terminal or command prompt.
  3. Initialization Command: Execute the initialization command:

npm init -y

As a result, a package.json file will be generated in the directory, containing information about the installed packages within this folder. The subsequent step involves installing Node.js packages within this designated folder.

Running Puppeteer is a simple process.

To install it, execute the command “npm install” from the terminal. It’s important to ensure that you are working in the directory containing the package.json file.

npm install puppeteer

Keep in mind that Puppeteer includes a complete Chromium instance. Upon installation, it fetches an up-to-date Chromium version that is specifically tailored to function seamlessly with the corresponding Puppeteer version.

Setting up a Proxy Server

         
const puppeteer = require('puppeteer');

(async () => {
    const proxyServer = 'gw.dataimpulse.com:823';
    const proxyUsername = 'your-username';
    const proxyPassword = 'your-password';

    // Launch Puppeteer with proxy configuration
    const browser = await puppeteer.launch({
        headless: true,
        args: [`--proxy-server=${proxyServer}`,
                '--disable-sync']
    });

    const page = await browser.newPage();

    // Authenticate proxies
    await page.authenticate({
        username: proxyUsername,
        password: proxyPassword
    });

    // Navigate to a website
    const response = await page.goto('https://dataimpulse.com/');
    const text = await response.text()
    console.log(text)

    await browser.close();
})();
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

Make sure to replace ‘your-username’ with your DataImpulse username and ‘your-password’ with the corresponding password.

  • Subsequently, Puppeteer is launched using the puppeteer.launch() method.
  • Within the ‘args’ parameter, we include the ‘–proxy-server’ flag, followed by the proxy server’s address and port specified in the ‘proxyServer’ variable.
  • This configuration instructs Puppeteer to utilize the designated proxy server for all requests executed by Puppeteer.
  • Upon initializing Puppeteer, we create a fresh page using browser.newPage().
  • After creating a fresh page, we authenticate proxies using page.authenticate().
  • To navigate to a specific website, like ‘https://dataimpulse.com/’, we employ the page.goto() method.
  • It’s important to replace the URL with the desired website for either web scraping or automation purposes.

Seamless continuation of scraping or automation activities within the page’s context enables interaction with the website as required. To conclude, remember to close the browser with browser.close() to terminate the Puppeteer instance.

IP Rotation with Puppeteer

IP rotation involves the utilization of a series of different IP addresses in a sequential manner for the purpose of web scraping or automation. This strategy serves to avoid IP blocking or detection by websites, enabling more comprehensive data gathering or automation while maintaining a lower risk of identification.

To configure IP rotation through a proxy server with Puppeteer, adhere to these steps:

  1. Select a Reliable Proxy Provider: Opt for a reputable proxy service offering rotating IP addresses, which assign fresh IPs per request or within specific intervals. Ensure the chosen service aligns with your preferred protocol (HTTP, HTTPS, or SOCKS) and supports IP rotation.
  2. Configure Proxy Server: Register with your chosen proxy provider and gather the essential credentials, encompassing IP address, port, username, password, and authentication method. These credentials are pivotal for establishing a connection to the proxy server.
  3. Test Proxy Server: Prior to integrating the proxy server with Puppeteer, validate its functionality. Tools like cURL or browser extensions such as FoxyProxy can help verify connectivity and responses from the proxy server.
  4. Initiate Puppeteer with Proxy Settings: Within your Puppeteer code, employ puppeteer.launch() to initiate a new Puppeteer instance equipped with appropriate proxy configurations. Feed the proxy server’s IP address and port through command-line arguments to Puppeteer. If authentication is required, supply the username and password within these arguments.

const puppeteer = require('puppeteer');

(async () => {
  const proxyServer = 'gw.dataimpulse.com:823';
  const proxyUsername = 'your-username';
  const proxyPassword = 'your-password';

  // Count of IP rotation
  const countOfRotation = 3;

  for (var i = 0; i < countOfRotation; i++) {
    // Launch Puppeteer with proxy configuration
    const browser = await puppeteer.launch({
      headless: true,
      args: [`--proxy-server=${proxyServer}`,
            '--disable-sync']
    });

    const page = await browser.newPage();
    // Authenticate proxies
    await page.authenticate({
      username: proxyUsername,
      password: proxyPassword
    });

    // Navigate to a website
    const response = await page.goto('https://api.ipify.org/');
    const text = await response.text()
    console.log(text)

    await browser.close();
  }
})();
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

Please remember to replace ‘your-username’ and ‘your-password’ with your actual DataImpulse credentials.

Setting Up Proxy Server in Puppeteer: When you initiate the Puppeteer instance with proxy settings, Puppeteer intuitively directs all browser traffic through your chosen proxy server. You won’t have to manually configure the proxy for every request in your Puppeteer code.

This streamlined approach ensures that all Puppeteer’s HTTP requests go through the proxy server, successfully achieving the goal of IP rotation.

    
// Example Puppeteer code using the proxy

const page = await browser.newPage();

await page.authenticate({
    username: proxyUsername,
    password: proxyPassword
});

const response = await page.goto('https://dataimpulse.com/');

// Continue with your automation tasks
// ...
        
        
        
        
        
        
        
        
        
        
        
        
        
      

Within your Puppeteer code, utilize browser.newPage() to craft a fresh page, while page.goto() steers you to your chosen website or executes automation tasks. Puppeteer takes charge of routing all its requests—including subsequent navigation and API calls—via the proxy server designated during the initial Puppeteer instance setup.

By following these procedures, you can effortlessly establish IP rotation through Puppeteer using a proxy server. Puppeteer efficiently manages request routing via the proxy server, enabling the attainment of IP rotation. This empowers you to conduct web scraping or automation tasks employing distinct IP addresses.

This strategic approach helps sidestep IP blocks, anti-scraping countermeasures, and detection pitfalls during web data collection or automation.

Navigating Puppeteer Proxy Woes: A Troubleshooter’s Toolkit

Unraveling the Debugging Labyrinth:

Here’s your guide to navigate through potential proxy hurdles:

  1. Cross-Check Proxy Setup:
    Ensure your Puppeteer code aligns with proxy server settings. Double-check proxy details such as address, port, and credentials. Be certain that you’re seamlessly integrating proxy options into puppeteer.launch().
  2. Test Connectivity:
    Dip your toes with other tools like curl or telnet. Connect to the proxy server directly to pinpoint if the issue dwells in Puppeteer or the proxy itself. Validate the proxy’s operational state and rule out network restrictions.
  3. Scan Proxy Response:
    Put proxies through their paces with simple HTTP requests via curl or browser extensions. Confirm the anticipated response and inspect headers for any unexpected alterations.
  4. Spotlight on Puppeteer:
    Unveil deeper insights by enabling verbose logging in Puppeteer. Seek –verbose flag usage in launches or opt for puppeteer.launch({ headless: true, devtools: true }) to scrutinize console output.
  5. Decipher Error Clues:
    Detect error cues that spotlight the proxy or network. Address authentication flops, timeouts, or proxy-related blunders within the error messages.
    Amidst this detective work, remember that each solved hiccup polishes your Puppeteer proficiency.
  6. Trial Without a Proxy
    Experience the proxy-free realm temporarily. Eliminate the proxy settings from your Puppeteer code and execute your script sans a proxy server. If your script performs flawlessly, it signifies that the trouble orbits the proxy setup.
    Venture further by experimenting with an alternate proxy server. This exploration aids in differentiating whether the issue is confined to the proxy server itself or if a broader concern lingers.

Scraping data

Unlock the potential of data scraping with advanced tools and techniques, enabling precise and efficient extraction of information from websites. Puppeteer’s arsenal of advanced proxy features unveils a realm of adaptable choices, placing commanding flexibility and management at your fingertips as you navigate proxies. This elevated approach not only surmounts IP-blocking hurdles adeptly but also amplifies the prowess of automation.

Scraping Different HTML Websites

Scraping different HTML websites with Puppeteer can be efficiently managed using proxy servers for enhanced functionality. The following code demonstrates how to iterate through multiple URLs while utilizing proxy authentication. Here’s a practical illustration of the process:

        
const puppeteer = require('puppeteer');

(async () => {
    const proxyServer = 'gw.dataimpulse.com:823';
    const proxyUsername = 'your-username';
    const proxyPassword = 'your-password';

    const urls = [
        "https://example.com/",
        "https://example.net/",
        "https://example.org/",
        //your next urls
    ];

    for (let i = 0; i < urls.length; i++) {
        // Launch Puppeteer with proxy configuration
        let url = urls[i]
        const browser = await puppeteer.launch({
            headless: true,
            args: [`--proxy-server=${proxyServer}`,
                '--disable-sync']
        });

        const page = await browser.newPage();
        // Authenticate proxies
        await page.authenticate({
            username: proxyUsername,
            password: proxyPassword
        });

        // Navigate to a website
        const response = await page.goto(url);
        const text = await response.text()
        console.log(text)

        //To ensure close previous connections, we need restart browser each time
        await page.close();
        await browser.close();
    }
})();
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

Make sure to replace ‘your-username’ and ‘your-password’ with your actual DataImpulse credentials.

Everything is ready! You can now enjoy your proxy connection.