Proxy Integration with Scrapy

Scrapy is a powerful web crawling framework that allows you to extract data from websites efficiently. It offers various features such as code reusability, extensive community support, and easy integration with Python.

In this tutorial, we will walk you through the process of installing Scrapy and setting up DataImpulse proxies for your web scraping needs. We will cover two methods: using proxies as a request and using proxies as middleware. Additionally, we will show you how to implement rotating proxies for enhanced scraping capabilities. Let’s get started!

How to set up a Scrapy project?

Setting up a Scrapy project and building a basic scraper is a simple process. Let’s get started with the installation

To install Scrapy and create your initial project, follow these steps:

To get started, open your Python command terminal and run the following pip command:

pip install scrapy

The next step is to create a Scrapy project. Let’s assume you want to create a new project called “scrapyproject”. To accomplish this, run the following command:

scrapy startproject scrapyproject

That’s all there is to it! You have successfully created your first Scrapy project.

Generating spiders in a Scrapy project

After navigating into the “scrapyproject” folder, you can proceed to generate a spider, also known as a crawler. When generating a spider, you will need to provide a name for the spider and the target URL. Here is an example:

scrapy genspider <spider_name> <url_domain>

Assuming the target URL for scraping is “https://books.toscrape.com/“, you can generate a Scrapy spider named “books” by running the following command:

scrapy genspider books books.toscrape.com

This command will generate the basic code for the “books” spider in the Scrapy project (i.e., the “scrapyproject” in this case). Let’s take a look at the outcome of the previous commands executed so far

Note: It’s important to remember that a Scrapy project can have multiple spiders. In this case, you have a single spider named “books,” and its basic code is written in the “books.py” file.

How to modify a spider to scrape desired data?

To scrape the titles and prices of all the books from the target website “Books to Scrape,” you can modify the existing “books.py” file to customize the spider’s scraping behavior. Here is an example of how you can modify the code


import scrapy


class BooksSpider(scrapy.Spider):
    name = 'books'

    def start_requests(self):
        start_urls = ['http://books.toscrape.com/']
        for url in start_urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for article in response.css('article.product_pod'):
            yield {
                'title': article.css("h3 > a::attr(title)").extract_first(),
                'price': article.css(".price_color::text").extract_first()
            }
      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

The start_requests method shown above fetches the raw contents from the target URL and passes the response to the parse method. Inside the parse method, the raw contents are available in the response variable. The method uses CSS selectors to extract the title and price text from each book on the target page. These details are located within the article element with the product_pod class. The parsed data is then yielded or returned for further processing.

Executing a scrapy spider

To scrape the titles and prices of all the books from the target website “Books to Scrape,” you can modify the existing “books.py” file to customize the spider’s scraping behavior. Here is an example of how you can modify the code:

scrapy crawl books

The aforementioned command will run the books spider and save the output to a file instead of displaying it on the console.

scrapy crawl -o FileName.csv books

Setting up proxies with Scrapy

Utilizing proxies in web scraping offers several advantages, including the ability to conceal your actual IP address from the server of the target website. This helps safeguard your privacy and prevents potential bans when conducting automated scraping tasks, as opposed to manual data extraction methods like copy-pasting.

Method 1: Proxy as a request parameter

To utilize the proxy as a request parameter, you can pass the proxy credentials as a separate parameter to the scrapy.Request method. The meta parameter can be used to include the ProxyUrl and user’s proxy plan credentials in the request. Here is the general format of a proxy endpoint:

Protocol://YourProxyPlanUsername:YourProxyPlanPassword@ProxyUrl:Port

Sr# Protocol ProxyUrl Port User Credentials
Residential Proxies HTTP gw.dataimpulse.com 823 DataImpulse Proxy Plan username and password

Note: For more customization options with DataImpulse proxies, such as country-specific entry points, you can refer to the documentation for Residential Proxies.

Here’s an example code that showcases how to integrate a residential proxy with the books spider using the proxy as a parameter in the request method:


import scrapy
 

class BooksSpider(scrapy.Spider):
    name = 'books'

    def start_requests(self):
        start_urls = ['http://books.toscrape.com/']
        for url in start_urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                meta={"proxy": "http://YourProxyPlanUsername:[email protected]:823"},
            )

    def parse(self, response):
        for article in response.css('article.product_pod'):
            yield {
                'title': article.css("h3 > a::attr(title)").extract_first(),
                'price': article.css(".price_color::text").extract_first()
            }
      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      
Method 2: Creating custom Scrapy proxy middleware

Scrapy proxy middleware acts as a bridge between the spiders and the proxy server, allowing requests to be routed through the proxy. By creating custom Scrapy proxy middleware, you can easily manage proxy settings for multiple spiders without modifying their code directly.

To use a proxy middleware in Scrapy, you need to add it to the list of middleware in the settings.py file. Before registering the middleware, let’s create it by following these steps:

Open the middlewares.py file.
Write the custom proxy middleware code as shown below:


class BookProxyMiddleware(object):
   @classmethod
   def from_crawler(cls, crawler):
       return cls(crawler.settings)
 
   def __init__(self, settings):
       self.username = settings.get('YourProxyPlanUsername')
       self.password = settings.get('YourProxyPlanPassword')
       self.url = settings.get('PROXY_URL')
       self.port = settings.get('PROXY_PORT')
 
   def process_request(self, request, spider):
       host = f'http://{self.username}:{self.password}@{self.url}:{self.port}'
       request.meta['proxy'] = host
      
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
      

The BookProxyMiddleware class takes the necessary information for a proxy endpoint and adds it to the request’s meta parameter.

To use this middleware, you need to register it in the settings.py file. Follow these steps to update the settings.py file:


PROXY_USER = 'YourProxyPlanUsername'
PROXY_PASSWORD = 'YourProxyPlanPassword'
PROXY_URL = 'gw.dataimpulse.com'
PROXY_PORT = '823'

DOWNLOADER_MIDDLEWARES = {
    'scrapyproject.middlewares.BookProxyMiddleware': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
      
        
        
        
        
        
        
        
        
        
        
      

The BookProxyMiddleware is added to the DOWNLOADER_MIDDLEWARES list in settings.py with a sequence number of 100. The middleware is applied to the spiders in ascending order based on the sequence numbers.

It’s important to note that all the proxy endpoint credentials are defined in the settings.py file. If you need to update the endpoint, you only need to modify it in the settings.py file. This saves you from updating code for each spider individually in case of any changes to the proxy endpoint.

Once the middleware code is complete and the proxy parameter values are set, you can run the books spider using the same scrapy crawl books command.

To implement rotating proxies in Scrapy, you can follow these steps:

Rotating proxies involve using different proxy IP addresses from a list in a sequential manner. This technique helps prevent detection and blocking by website administrators. Even if one proxy IP is blocked, your scraper or spider can continue functioning by switching to another proxy IP.

To enable proxy rotation in Scrapy, you need to install the “scrapy-rotating-proxies” package. You can install this package by running the following command:

pip install scrapy-rotating-proxies

After installing the “scrapy-rotating-proxies” package, you’ll need to create a list of available proxies in the “settings.py” file. This list should contain the necessary proxy information, including user proxy plan credentials and specific IPs. Here’s an example of a sample proxy list that you can use as a starting point:


ROTATING_PROXY_LIST = [
    'http://YourProxyPlanUsername:[email protected]:823',
    'http://YourProxyPlanUsername:[email protected]:10000',
    'http://YourProxyPlanUsername:YourProxyPlanPassword@Specific_IP_1:10000',
    'http://YourProxyPlanUsername:YourProxyPlanPassword@Specific_IP_2:10000',
                  .
                  .
                  .
    'http://YourProxyPlanUsername:YourProxyPlanPassword@Specific_IP_N:10000'
]
      
        
        
        
        
        
        
        
        
        
        
        
      

If you prefer to load available proxies from a file instead of manually writing them in a list, you can do so by setting the “ROTATING_PROXY_LIST_PATH” variable in the “settings.py” file. This variable should be set to the path of your desired file containing the proxy information.

ROTATING_PROXY_LIST_PATH = ‘/path/to/file/proxieslist.txt’

To enable the rotating proxies functionality, you need to add the “RotatingProxyMiddleware” and “BanDetectionMiddleware” to the “DOWNLOADER_MIDDLEWARES” list in the “settings.py” file. This can be done by including the following code snippet:

DOWNLOADER_MIDDLEWARES = {
‘rotating_proxies.middlewares.RotatingProxyMiddleware’: 610,
‘rotating_proxies.middlewares.BanDetectionMiddleware’: 620,
}

Once you’ve completed the setup, you’re ready to go! Simply run your spider using the “scrapy crawl” command. The RotatingProxyMiddleware will handle the rotation of proxy endpoints automatically at regular intervals.