DataImpulse Logo and Java Logo

As Java is a platform-independent language with robust libraries, multithreading capabilities, and effective error-handling mechanisms, it is a good choice for web scraping. In this detailed tutorial, we disclose what tools you need and what steps you should take to streamline scraping in Java. 

Web scraping with Java: key points

While simpler languages, like Python, are more popular for web scraping, Java also has its strong points, for example: 

  • Libraries like JSoup or HtmlUnit simplify HTML parsing 
  • Performance, speed, and efficient memory management make Java good for large-scale projects 
  • Being a strongly typed language, Java allows for compile-time error-catching 

Besides, if your infrastructure relies on Java or you aim for long-term projects, Java is also an option. At the same time, web scraping often comes with challenges like IP bans, geo-based limitations, and security concerns. Backing up your Java project with proxies helps you overcome those obstacles and get the necessary data. 

Before we head into coding, let’s see what tools you are going to need:

  • Visual Studio Code (actually, Visual Studio 2022 or any other IDE that supports Java is good)
  • The Coding Pack for Java
  • Extension Pack for Java
  • Apache Maven 

If you’re new to Visual Studio Code or Java, here is documentation with detailed explanations on how to install the Coding Pack and Extension Pack. You may also appreciate this tutorial on installing Apache Maven. 

Note: Only the “hostname:port” format is supported, so you will need to whitelist your IP address before executing the code. You can do this in your DataImpulse account. Go to the necessary proxy plan, select “Manage Whitelist IPs” from the right-top menu, and whitelist your address. If you need help or face problems, please see our detailed guide on managing your DataImpulse account

Getting started with HttpClientApp 

You must create an HTTP client to route traffic via a proxy server. This app sends requests to the designated URL and receives responses. You may use the code below:


import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.*;


public class HttpClientApp {


    public static final String proxyHost = "gw.dataimpulse.com";
    public static final int proxyPort = 823;


    public static void main(String[] args) {
        System.out.println("Performing request through proxy...");


        try {
            String response = request("https://api.ipify.org");
            System.out.println("Your IP address: " + response);
        } catch (IOException e) {
            System.err.println("Request error: " + e.getMessage());
        }
    }


    public static String request(String url) throws IOException {
        // Configure the proxy without authentication
        Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyHost, proxyPort));


        // Create and configure the connection
        HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection(proxy);
        connection.setRequestMethod("GET");
        connection.setRequestProperty("User-Agent", "JavaHttpURLConnection");


        // Handle the response
        int responseCode = connection.getResponseCode();
        String responseMessage = connection.getResponseMessage();


        if (responseCode == 200) {
            // Successful response
            try (BufferedReader reader = new BufferedReader(
                    new InputStreamReader(connection.getInputStream()))) {
                StringBuilder response = new StringBuilder();
                String line;
                while ((line = reader.readLine()) != null) {
                    response.append(line);
                }
                return response.toString();
            }
        } else {
            // Error response with detailed output
            StringBuilder errorResponse = new StringBuilder();
            if (connection.getErrorStream() != null) {
                try (BufferedReader reader = new BufferedReader(
                        new InputStreamReader(connection.getErrorStream()))) {
                    String line;
                    while ((line = reader.readLine()) != null) {
                        errorResponse.append(line);
                    }
                }
            }
            throw new IOException("HTTP error: " + responseCode + " " + responseMessage +
                    (errorResponse.length() > 0 ? "\nError details: " + errorResponse : ""));
        }
    }
}

      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

The output should look like this:

An example of Java code

Building a proxy rotator

Routing traffic via a proxy server isn’t enough. To avoid detection, you must use a new IP per every request. That’s why you need a proxy rotator. Here is an example of code you can use:


import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.*;
import java.util.Arrays;
import java.util.List;


public class ProxyRotator {


    // List of proxies to rotate through
    private static final List proxies = Arrays.asList(
            new Proxy(Proxy.Type.HTTP, new InetSocketAddress("gw.dataimpulse.com", 10000)),
            new Proxy(Proxy.Type.HTTP, new InetSocketAddress("gw.dataimpulse.com", 10001)),
            new Proxy(Proxy.Type.HTTP, new InetSocketAddress("gw.dataimpulse.com", 10002))
    );


    public static void main(String[] args) {
        // URL to request
        String url = "https://api.ipify.org";


        for (Proxy proxy : proxies) {
            System.out.println("Using proxy: " + proxy.address());
            try {
                String response = sendRequestWithProxy(url, proxy);
                System.out.println("Request succeeded with proxy " + proxy.address());
                System.out.println("IP address from API website: " + response);
            } catch (IOException e) {
                System.err.println("Request failed with proxy " + proxy.address() + ": " + e.getMessage());
            }
            System.out.println("------------------------------------------------");
        }
    }


    // Function to send request using a specific proxy
    private static String sendRequestWithProxy(String url, Proxy proxy) throws IOException {
        HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection(proxy);
        connection.setRequestMethod("GET");
        connection.setRequestProperty("User-Agent", "JavaHttpURLConnection");


        int responseCode = connection.getResponseCode();
        if (responseCode == 200) {
            try (BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()))) {
                StringBuilder response = new StringBuilder();
                String line;
                while ((line = reader.readLine()) != null) {
                    response.append(line);
                }
                return response.toString();
            }
        } else {
            throw new IOException("HTTP error code: " + responseCode + " " + connection.getResponseMessage());
        }
    }
}

      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

You should get such result from this:

An example of Java code

Compiling a proxy checker

It’s time to ensure our proxies work with a defined website. For this, you have to create a proxy checker app and provide a URL to check:

An example of Java code

There is a code you can try:


import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.*;
import java.util.Arrays;
import java.util.List;


public class ProxyChecker {


    // List of proxies to rotate through
    private static final List proxies = Arrays.asList(
            new Proxy(Proxy.Type.HTTP, new InetSocketAddress("gw.dataimpulse.com", 10000)),
            new Proxy(Proxy.Type.HTTP, new InetSocketAddress("gw.dataimpulse.com", 10001)),
            new Proxy(Proxy.Type.HTTP, new InetSocketAddress("gw.dataimpulse.com", 10002))
    );


    // Website to check through proxies
    private static final String websiteUrl = "https://api.ipify.org";


    public static void main(String[] args) {
        System.out.println("Starting Proxy Checker...");


        for (Proxy proxy : proxies) {
            System.out.println("\nChecking proxy: " + proxy.address());
            try {
                boolean isWorking = checkProxy(proxy, websiteUrl);
                if (isWorking) {
                    System.out.println("✅ Proxy " + proxy.address() + " works with " + websiteUrl);
                } else {
                    System.out.println("❌ Proxy " + proxy.address() + " does not work with " + websiteUrl);
                }
            } catch (IOException e) {
                System.out.println("❌ Proxy " + proxy.address() + " failed: " + e.getMessage());
            }
            System.out.println("------------------------------------------------");
        }
    }


    // Function to check if proxy works with the given website
    private static boolean checkProxy(Proxy proxy, String url) throws IOException {
        HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection(proxy);
        connection.setRequestMethod("GET");
        connection.setRequestProperty("User-Agent", "JavaHttpURLConnection");
        connection.setConnectTimeout(5000); // 5 seconds timeout
        connection.setReadTimeout(5000);    // 5 seconds timeout


        int responseCode = connection.getResponseCode();
        String responseMessage = connection.getResponseMessage();


        System.out.println("Response: " + responseCode + " " + responseMessage);


        if (responseCode == 200) {
            try (BufferedReader reader = new BufferedReader(
                    new InputStreamReader(connection.getInputStream()))) {
                StringBuilder response = new StringBuilder();
                String line;
                while ((line = reader.readLine()) != null) {
                    response.append(line);
                }
                System.out.println("IP Returned: " + response);
            }
            return true;
        } else {
            try (BufferedReader reader = new BufferedReader(
                    new InputStreamReader(connection.getErrorStream()))) {
                StringBuilder errorResponse = new StringBuilder();
                String line;
                while ((line = reader.readLine()) != null) {
                    errorResponse.append(line);
                }
                System.out.println("Error details: " + errorResponse);
            }
            return false;
        }
    }
}

      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

The results you see should look like this:

An example of Java code

Getting to web scraping itself

Finally, it’s time to build a web scraping app. It will extract all the needed data, so you must provide the target URL to scrape and specify what data you want the program to retrieve. In our case, we will scrape links from our homepage. You can use this code as a reference:


package com.dataimpulse;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.*;
import java.nio.charset.StandardCharsets;


public class ProxyWebScraper {


    // Proxy configuration
    private static final String PROXY_HOST = "gw.dataimpulse.com";
    private static final int PROXY_PORT = 823;


    // Target website to scrape
    private static final String TARGET_URL = "https://dataimpulse.com/";


    public static void main(String[] args) {
        System.out.println("Starting Proxy Web Scraper...");


        try {
            // Perform the web scraping through the proxy
            String htmlContent = getHtmlContent(TARGET_URL);
            if (htmlContent != null) {
                // Parse and extract links from the HTML content
                parseHtml(htmlContent);
            } else {
                System.out.println("Failed to retrieve HTML content.");
            }
        } catch (IOException e) {
            System.err.println("Error during scraping: " + e.getMessage());
        }
    }


    // Method to get HTML content through the proxy
    private static String getHtmlContent(String url) throws IOException {
        // Configure proxy
        Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(PROXY_HOST, PROXY_PORT));


        // Create and configure the connection
        HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection(proxy);
        connection.setRequestMethod("GET");
        connection.setRequestProperty("User-Agent", "JavaWebScraper/1.0");
        connection.setConnectTimeout(10000);
        connection.setReadTimeout(10000);


        int responseCode = connection.getResponseCode();
        if (responseCode == 200) {
            // Read and return the HTML content
            try (BufferedReader reader = new BufferedReader(
                    new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8))) {
                StringBuilder content = new StringBuilder();
                String line;
                while ((line = reader.readLine()) != null) {
                    content.append(line);
                }
                return content.toString();
            }
        } else {
            System.err.println("Failed to fetch HTML. HTTP error code: " + responseCode);
            return null;
        }
    }


    // Method to parse HTML and extract links using jsoup
    private static void parseHtml(String htmlContent) {
        Document doc = Jsoup.parse(htmlContent);
        Elements links = doc.select("a[href]");


        if (links.isEmpty()) {
            System.out.println("No links found on the page.");
        } else {
            System.out.println("Links found:");
            for (Element link : links) {
                String href = link.attr("abs:href");
                String title = link.text();
                System.out.println("Title: " + title + ", Link: " + href);
            }
        }
    }
}

      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

However, there is something else you need to do. To finalize the process and ensure your scraping app works correctly, you need to install the JSoup library using Maven. Here is a step-by-step explanation of how to do it:

  1. Create a Java project by using the corresponding button. 
An example of Java code
  1. Select Maven as the project type. 
An example of Java code
  1. Open the “Maven” tab and press “+.”
An example of Java code
  1. From the list that appeared, choose “maven-archetype-quickstart”:
An example of Java code
  1. Enter “com.dataimpulse” as group id.
An example of Java code
  1. Input “proxy-scraper” in the “artifact Id” field.
An example of Java code
  1. Choose a folder to store the library and the ProxyWebScraper app itself. After that, the proxy-scraper folder should appear. Now, move the code to proxy-scraper/src/main/java/com/dataimpulse:
An example of Java code
  1. Add the JSoup library via Apache Maven, as shown in the screenshot below. If you do it right, the library will appear in the “pom.xml” file. Save the file using CTRL+S for Windows or CMD+S for Mac:
An example of Java code
  1. Now, your code should work just fine. To execute it, use the button in the top-right corner of the screen:
An example of Java code

Here is the final result of our scraping project:

An example of Java code

As you can see, we now have all the data we wanted. However, modifying the code lets you get any HTML data you need. Also, you should carefully choose proxies, as such results are impossible without legally-sourced IPs. At DataImpulse, you can get 15 million ethically-obtained addresses not associated with law-breaking activities like spreading malware. If you have any difficulties, our human support team is at your service 24/7. You can also adjust targeting to get over geo-based limitations. At the same time, our proxies won’t eat up your budget as we operate on a pay-as-you-go pricing model and offer an affordable price of $1 per 1GB. Start with us at [email protected] or use the “Try now” button in the top-right corner of the screen. 

Jennifer R.

Content Editor

Content Manager at DataImpulse. Jennifer's degree in philology and translation and several years of experience in content writing help her create easy-to-understand copies, even on tangled tech topics. While writing every text, her goal is to provide an in-depth look at the given topic and give answers to all possible questions. Subscribe to our newsletter and always be updated on the best technologies for your business.