Integrating DataImpulse Proxies with Diffbot
Do you need data, and do you not want to spend too many resources building a scraper? Diffbot has your back – a multipurpose tool that can simultaneously replace several solutions. Many popular tools, such as DuckDuckGo, already trust it, so you can use Diffbot to build your solution. The main advantage is that Diffbot doesn’t rely on rules – instead, it leverages natural language and AI to extract and structure data. You can stop creating different rules for each website; instead, use one tool for whatever you need to scrape. Proxies, on the other hand, help avoid IP rate limits, bans, geo-based restrictions, and privacy threats, so you get data while saving time and money on drafting and maintaining a web scraper.
Diffbot: advantages
Diffbot is an AI-powered tool that scrapes publicly available data and presents it in a structured, easy-to-understand form—in more than 100 human languages. You can also pair different Diffbot products to get better results.
Diffbot’s main advantages include:
– makes it available to use for non-technical specialists with an intuitive UI
– able to read HTML code and return structured, clean data
– has a lot of solutions for different use cases
– offers a free basic version and three different plans with a demo for custom subscription, so you can choose the best option and avoid paying for features you don’t need
– come with detailed documentation
– provides a way to create custom rules and APIs
Diffbot offers features such as Extract, Custom API, Bulk, Crawl, Enhance, Natural Language, and Knowledge Graph. Knowledge Graph is a massive database of all publicly available pre-scraped data. You can find information there or use Enhance if you have a piece of data about something and want to know more. Bulk and Crawl are included only in paid plans. On the other hand, you can use Extract and Custom API even on a free plan. However, remember that Diffbot has a system of credits. You get a particular number of available credits depending on your plan (10k credits for a free plan), and each action costs you a credit. For example, you need one credit to extract a page and two if you use proxies.
Getting started
First, you need to create an account. The process is intuitive and straightforward and will only take several minutes. Diffbot will ask you several questions. In the end, don’t forget to read the ToS and Privacy Policy and tick boxes to confirm that. Also, you will have to visit your email and follow a link to activate your account.
Once you’ve completed registration and activation, you will be able to access your dashboard. In the top right corner, you will see your API token, which you will need to use further. Remember not to share it.
How to use proxies with Diffbot
Whether you use pre-built APIs or create a custom one, you need to include proxies in your request. Diffbot offers you its own proxy pool. However, it supports the use of third-party addresses and even recommends it.
Now, let’s get to requests. In our example, we will use a pre-built Article API and extract data from DataImpulse’s blog post Building a Custom Proxy Rotator with Python. To add proxies, you have to include the &proxy parameter to specify the IP address of a proxy you want to use and the &proxyAunt parameter to specify the authentication details in your request.
cURL request would look like this:
curl --request GET \
--url 'https://api.diffbot.com/v3/article?url=https%3A%2F%2Fdataimpulse.com%2Fblog%2Fbuilding-a-custom-proxy-rotator-with-python-a-step-by-step-tutorial%2F&token=your_token&proxy=111.222.333.444:8080&proxyAuth=youruser:yourpass' \
--header 'accept: application/json'
Python request example:
import requests
url = (
"https://api.diffbot.com/v3/article?"
"url=https%3A%2F%2Fdataimpulse.com%2Fblog%2Fbuilding-a-custom-proxy-rotator-with-python-a-step-by-step-tutorial%2F"
"&token=your_token"
"&proxy=111.222.333.444:8080"
"&proxyAuth=youruser:yourpass"
)
headers = {"accept": "application/json"}
response = requests.get(url, headers=headers)
print(response.text)
JavaScript:
const options = { method: 'GET', headers: { accept: 'application/json' } };
const apiUrl =
"https://api.diffbot.com/v3/article?" +
"url=https%3A%2F%2Fdataimpulse.com%2Fblog%2Fbuilding-a-custom-proxy-rotator-with-python-a-step-by-step-tutorial%2F" +
"&token=your_token" +
"&proxy=111.222.333.444:8080" +
"&proxyAuth=youruser:yourpass";
fetch(apiUrl, options)
.then(res => res.json())
.then(res => console.log(res))
.catch(err => console.error(err));
On the API Reference, you will find more examples of requests in different languages for various APIs. Do not forget to replace your token and credentials with actual data. You can copy your credentials from the corresponding plan’s page on your DI dashboard.
That’s all you need to do to start using Diffbot with DI proxies!