Using Scrapy with Selenium for Advanced Techniques

On This Page What is Web Scraping?Web Scraping with Scrapy and Sel

January 03, 2026 · 23 min read · Tool Comparison

How to use Scrapy with Selenium for Advanced Web Scraping

Web scraping has become an indispensable tool for information descent from websites, enabling businesses, investigator, and developers to accumulate worthful insights from the vast pond of online info. With the rightfield tools, web scraping can be both effective and powerful, permit you to automatise data collection and analysis processes.

Overview

Web Scraping with Scrapy and Selenium

  • Selenium automates browser to interact with dynamic websites, while Scrapy is a Python framework contrive for effective web scraping.
  • Combining Scrapy and Selenium enables scraping from JavaScript-heavy websites and cover complex web interactions, like logging in or navigating dynamical content.

Combining Scrapy and Selenium

  • When to use Scrapy unaccompanied: Best for stable websites with structured datum.
  • When to use Selenium alone: Essential for dynamic sites requiring JavaScript executing, form submissions, or session handling.
  • Using both: Leverage Scrapy & # 8217; s speed for large-scale data collection while using Selenium for deal dynamic elements.

This clause will guide you through unite Scrapy and Selenium to perform advanced web scraping, enabling you to handle complex web pages and extract data that would otherwise be challenging to gather with a standard scraper.

What is Web Scraping?

Web scraping is the process of educe datum from site. It regard do HTTP petition to a site, fetching the HTML content, and then parsing and analyzing the data to gather useful information. Web scraping can be applied to various tasks, such as:

  • Collecting merchandise data from e-commerce websites.
  • Extracting job listings or word articles.
  • Gathering real-time financial or sports datum.
  • Mining academic or research content.

While many websites offer datum through APIs, web scraping remain essential when APIs are unavailable, or when user need to gather information from multiple sources or navigate complex web pages. Web scraping allows you to automate and streamline information extraction, ply a quick and efficient way to collect large amounts of information from the web.

Web Scraping with Scrapy and Selenium

is a widely utilize tool for controlling web browsers automatically. It can click buttons, fill out forms, and navigate websites, making it useful for testing web applications and automating browser tasks.

Scrapy is a Python model built for web scraping. It helps hoard information from websites by mail requests, extracting information, and storing it efficiently. Its design allows for fasting and large-scale information extraction from multiple web Page.

Read More:

The choice between Scrapy exclusively or Scrapy with Selenium depends on the site & # 8217; s complexity and the scratch requirements.

When to use Scrapy alone?

Scrapy is good for static websites with integrated data.

It works well in these cases:

  • Fast and efficient:Handles multiple pages quickly with minimum resources.
  • Large-scale scraping:Can crawl total site and follow tie-in mechanically.
  • Low memory usage:Uses significantly less memory than Selenium.
  • Customizable:Supports placeholder, retries, and headers for advanced scraping.

When to use Selenium solely?

Selenium is crucial for JavaScript-heavy websites that require interaction.

Use Selenium when:

  • Content loads dynamically (JavaScript, AJAX).
  • Actions like chatter, scrolling, or fill form are required.
  • Logging in and maintaining sessions is necessary.
  • Handling CAPTCHAs is part of the process.
  • Multiple programming languages are needed (Selenium supports Java, C #, JavaScript, etc.).

Combining Scrapy and Selenium

For projects that imply both structured information extraction and complex web interaction, combining Scrapy and Selenium can provide a powerful solution. This approach grant developers to leverage Scrapy & # 8217; s efficiency for large-scale data processing while using Selenium to handle JavaScript render and user interactions when needed

Feature Scrapy OnlyScrapy + Selenium
Static HTML scrapingYesNot needed
JavaScript-loaded messageNoRequired
Handling logins & amp; formsNoRequired
Pagination (inactive tie)YesYes
Infinite scroll / AJAXNoRequired
High-speed data extractionYesSlower due to browser supply
Interacting with buttons/dropdownsNoRequired

 

Talk to an Expert

Setting Up Your Environment

Know how to set up the environment

Installation Guide

To set up Scrapy with Selenium, follow these steps:

Step 1. Install Scrapy

pip install scrapy

Step 2. Install Selenium

pip install selenium

Step 3. Install scrapy-selenium

pip install scrapy-selenium

Step 4. Download the latest WebDriverfor the browser you wish to use, or install webdriver-manager by run the command.

pip install webdriver-manager

Also, install BeautifulSoup by running the below bidding

pip install beautifulsoup4

Scraping Product Data Using Scrapy and Selenium

Scenario:

Extract product names and prices from thebstackdemo website, which dynamically loads contented using JavaScript. Selenium ensure all products are visible before Scrapy and BeautifulSoup extract the data.

Example Code:

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by significance By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import clip

class BStackDemoSpider (scrapy.Spider):
gens = & # 8220; bstack_demo & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;] # Target website

def __init__ (self):
# Set up Selenium WebDriver
service = Service (& # 8220; chromedriver.exe & # 8221;) # Ensure ChromeDriver is installed
options = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;) # Run without opening a browser
self.driver = webdriver.Chrome (service=service, options=options)

def parse (self, response):
self.driver.get (response.url)

# Ensure JavaScript-loaded substance is amply visible by scroll
self.driver.find_element (By.TAG_NAME, & # 8220; body & # 8221;) .send_keys (Keys.END)
time.sleep (2) # Allow time for content to load

# Parse page source with BeautifulSoup
soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

# Extract merchandise details
for product in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; name & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__title & # 8221;) .text.strip (),
& # 8220; toll & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; val & # 8221;) .text.strip (),
}

self.driver.quit () # Close browser session

How it works:

Selenium opens thebstackdemowebsite and insure all products are visible by scrolling down. Once the content is amply load, BeautifulSoup extracts ware name and prices from the page source. Scrapy then treat the data and fund it, while Selenium fold to free system resource.

Running Selenium in Headless Mode

Running Selenium in headless fashion grant web scrape without opening a browser window, do it faster and more efficient.

Scenario:

Extract make names and product availability from thebstackdemo websitewhile running Selenium in headless mode for a faster and resource-efficient scraping process.

Example Code:

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by importation By
from bs4 import BeautifulSoup
importee time

course BStackDemoHeadlessSpider (scrapy.Spider):
name = & # 8220; bstack_headless & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;] # Target website

def __init__ (self):
# Set up Selenium WebDriver in headless fashion
service = Service (& # 8220; chromedriver.exe & # 8221;) # Ensure ChromeDriver is installed
options = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;) # Run browser in headless mode
options.add_argument (& # 8220; & # 8211; disable-gpu & # 8221;) # Improve performance in some systems
options.add_argument (& # 8220; & # 8211; no-sandbox & # 8221;) # Bypass OS protection restrictions
self.driver = webdriver.Chrome (service=service, options=options)

def parse (self, response):
self.driver.get (response.url)
time.sleep (2) # Allow JavaScript to load content

# Get page source and parse with BeautifulSoup
soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

# Extract brand names and availability status
for product in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; brand & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; shelf-item__brand & # 8221;) .text.strip (),
& # 8220; availability & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__buy-btn & # 8221;) .text.strip (),
}

self.driver.quit () # Close browser session

How it act:

Selenium runs in headless mode, laden the bstackdemo site without opening a browser window. After waiting for JavaScript to load, BeautifulSoup elicit brand names and availability condition from the page seed. Scrapy process and stores the data while Selenium closes, ensuring an efficient and lightweight scraping process.

Read More:

Pro Tip: Running Selenium tests on BrowserStackallows automated screen on real cloud-based browsers without local setup. This ensures faster performance, better scalability, and access to multiple browser surroundings. Cloud-based testing reduces system load and improves reliability for large-scale web scraping and mechanization projects.

Basic Web Scraping with Scrapy and Selenium

Scrapy and Selenium together create it easy to extract JavaScript-rendered content, handle pagination, and automate logins for scraping data from websites that rely on dynamic elements.

Extracting JavaScript-Rendered Content

Some websites laden substance dynamically use JavaScript, meaning Scrapy alone can not extract the information. Selenium loads the page completely before Scrapy processes it.

Example: Extracting Product Names and Prices

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
category JavaScriptContentSpider (scrapy.Spider):
name = & # 8220; js_content & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;]

def __init__ (self):
service = Service (& # 8220; chromedriver.exe & # 8221;) # Ensure ChromeDriver is installed
options = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;)
self.driver = webdriver.Chrome (service=service, options=options)

def parse (self, reaction):
self.driver.get (response.url)
time.sleep (2) # Wait for JavaScript to laden

soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

for ware in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; name & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__title & # 8221;) .text.strip (),
& # 8220; price & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; val & # 8221;) .text.strip (),
}

self.driver.quit ()

How it works:

Selenium loads the page, waits for JavaScript to render products, and BeautifulSoup extracts the data before Scrapy processes it.

Handling Introductory Pagination with Selenium

Some site have multiple pages for listings. Selenium can click the & # 8220; Next & # 8221; button to navigate through pages while Scrapy extracts data.

Example: Scraping Products from Multiple Pages

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service signification Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

class PaginationSpider (scrapy.Spider):
name = & # 8220; pagination_spider & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;]

def __init__ (self):
service = Service (& # 8220; chromedriver.exe & # 8221;)
options = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;)
self.driver = webdriver.Chrome (service=service, options=options)

def parse (self, response):
self.driver.get (response.url)

while True:
time.sleep (2) # Wait for page to load
soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

for ware in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; name & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__title & # 8221;) .text.strip (),
& # 8220; price & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; val & # 8221;) .text.strip (),
}

try:
next_button = self.driver.find_element (By.CLASS_NAME, & # 8220; pagination-next & # 8221;)
if & # 8220; incapacitate & # 8221; in next_button.get_attribute (& # 8220; class & # 8221;):
break # Stop if no next page
next_button.click ()
except:
break

self.driver.quit ()

How it works:

Selenium clicks the & # 8220; Next & # 8221; push, loads the adjacent set of products, and Scrapy extracts the data until all pages are scraped.

Read More:

Automating Login for Authenticated Pages

Some websites require login before accessing data. Selenium can enter certificate, subject the kind, and maintain the session while Scrapy pull data.

For autonomous testing across multiple user personas, check out SUSATest — it explores your app like 10 different real users.

Example: Logging in and Scraping User-Specific Data

meaning scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

family LoginSpider (scrapy.Spider):
name = & # 8220; login_spider & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/signin & # 8221;] # Login page URL

def __init__ (self):
service = Service (& # 8220; chromedriver.exe & # 8221;)
alternative = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;)
self.driver = webdriver.Chrome (service=service, options=options)

def parse (self, response):
self.driver.get (response.url)

# Enter username and password
self.driver.find_element (By.ID, & # 8220; react-select-2-input & # 8221;) .send_keys (& # 8220; demouser & # 8221;)
self.driver.find_element (By.ID, & # 8220; react-select-3-input & # 8221;) .send_keys (& # 8220; testingpassword & # 8221;)
self.driver.find_element (By.CLASS_NAME, & # 8220; login-button & # 8221;) .click ()
time.sleep (2)

soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

# Extract user-specific data
user_info = soup.find (& # 8220; div & # 8221;, class_= & # 8221; user-info & # 8221;)
if user_info:
yield {& # 8220; user & # 8221;: user_info.text.strip ()}

self.driver.quit ()

How it works:

Selenium fills in login credentials, submits the shape, and Scrapy extracts user-specific data after login.

Read More:

Advanced Scrapy + Selenium Techniques

Here are some of the advanced techniques for using Scrapy along with Selenium:

Handling Dynamic and Interactive Elements

Here are illustration on how to handle active and interactive elements:

Scraping Infinite Scrolling Pages

Many modern websites use infinite scrolling and AJAX to load content dynamically instead of traditional pagination. Selenium can scroll the page and postponement for new elements to appear before Scrapy extracts data.

Scenario: Some site, like social media feeds, lade new content as the user scrolls down. Selenium & # 8217; sexecute_script ()function can copy scrolling to trigger contented loading.

Code Example:

signification scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
importation time

class InfiniteScrollSpider (scrapy.Spider):
gens = & # 8220; infinite_scroll & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;]

def __init__ (self):
service = Service (& # 8220; chromedriver.exe & # 8221;)
options = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;)
self.driver = webdriver.Chrome (service=service, options=options)

def parse (self, response):
self.driver.get (response.url)
scroll_pause_time = 2

last_height = self.driver.execute_script (& # 8220; return document.body.scrollHeight & # 8221;)

while True:
self.driver.execute_script (& # 8220; window.scrollTo (0, document.body.scrollHeight); & # 8221;)
time.sleep (scroll_pause_time)

new_height = self.driver.execute_script (& # 8220; return document.body.scrollHeight & # 8221;)
if new_height == last_height:
break
last_height = new_height

soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

for product in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; name & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__title & # 8221;) .text.strip (),
& # 8220; price & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; val & # 8221;) .text.strip (),
}

self.driver.quit ()

How It Works:

Selenium scroll down to load more production. It look briefly and checks if new content appears. Once all products are loaded, Scrapy extracts the data.

Extracting AJAX-Loaded Content

Some site bring information dynamically using AJAX requests, meaning content appears after the page initially loads. Selenium waits until the new data is available before Scrapy processes it.

Scenario: Waiting for AJAX to Load Products on bstackdemo

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support meaning expected_conditions as EC
from bs4 import BeautifulSoup

class AjaxContentSpider (scrapy.Spider):
gens = & # 8220; ajax_content & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;]

def __init__ (self):
service = Service (& # 8220; chromedriver.exe & # 8221;)
selection = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;)
self.driver = webdriver.Chrome (service=service, options=options)

def parse (self, response):
self.driver.get (response.url)

# Wait until products are loaded
WebDriverWait (self.driver, 10) .until (
EC.presence_of_element_located ((By.CLASS_NAME, & # 8220; shelf-item & # 8221;))
)

soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

for merchandise in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; name & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__title & # 8221;) .text.strip (),
& # 8220; damage & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; val & # 8221;) .text.strip (),
}

self.driver.quit ()

How It Works:

Selenium waits for AJAX-loaded content to appear utilize WebDriverWait (). Once products are available, Scrapy extracts the data. This control no data is missed due to slow-loading elements.

Bypassing Anti-Scraping Mechanisms

Websites ofttimes implement anti-scraping techniques to block automated bots. To avoid espial and ensure smooth data extraction, various strategies can be used.

Avoiding Detection with Headless Browsers & amp; Randomized User-Agents

Using a headless browser allows scratch without opening a seeable window. Randomizing user-agents helps mimic different browsers to keep blocking.

Scenario: Using Headless Mode and Randomized User-Agents

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service significance Service
from selenium.webdriver.common.by import By
from fake_useragent import UserAgent
from bs4 import BeautifulSoup

class AntiDetectionSpider (scrapy.Spider):
name = & # 8220; anti_detection & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;]

def __init__ (self):
ua = UserAgent ()
user_agent = ua.random # Randomize user-agent

service = Service (& # 8220; chromedriver.exe & # 8221;)
options = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;)
options.add_argument (f & # 8221; user-agent= {user_agent} & # 8221;)

self.driver = webdriver.Chrome (service=service, options=options)

def parse (self, response):
self.driver.get (response.url)
soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

for merchandise in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; name & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__title & # 8221;) .text.strip (),
& # 8220; terms & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; val & # 8221;) .text.strip (),
}

self.driver.quit ()

How It Works:Headless mode prevents spotting by running without a visible browser. Randomized user-agents make requests appear like they come from existent users.

Managing Cookies & amp; Sessions for Continuity

Websites use cookies and sessions to track users. Managing them helps maintain login states and avoid catching.

Scenario: Preserving Cookies Across Requests

signification scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pickle
import time

class CookieManagementSpider (scrapy.Spider):
name = & # 8220; cookie_management & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/signin & # 8221;]

def __init__ (self):
service = Service (& # 8220; chromedriver.exe & # 8221;)
self.driver = webdriver.Chrome (service=service)

def parse (self, response):
self.driver.get (response.url)

# Login process
self.driver.find_element (By.ID, & # 8220; react-select-2-input & # 8221;) .send_keys (& # 8220; demouser & # 8221;)
self.driver.find_element (By.ID, & # 8220; react-select-3-input & # 8221;) .send_keys (& # 8220; testingpassword & # 8221;)
self.driver.find_element (By.CLASS_NAME, & # 8220; login-button & # 8221;) .click ()
time.sleep (2)

# Save cooky after login
pickle.dump (self.driver.get_cookies (), exposed (& # 8220; cookies.pkl & # 8221;, & # 8220; wb & # 8221;))

# Load cookies for future requests
self.driver.get (& # 8220; https: //bstackdemo.com/ & # 8221;)
for biscuit in pickle.load (open (& # 8220; cookies.pkl & # 8221;, & # 8220; rb & # 8221;)):
self.driver.add_cookie (cooky)

self.driver.refresh ()
time.sleep (2)

# Extract data after keep session
yield {& # 8220; message & # 8221;: & # 8220; Logged in and session maintained & # 8221;}

self.driver.quit ()

How It Works: Saves cookie after login for reuse in succeeding requests. Restores session to avert repeated logins and spotting.

Solving CAPTCHAs Using Third-Party Services

Some websites use CAPTCHAs to block bot. Third-party services like 2Captcha or Anti-Captcha can work them automatically.

Scenario: Sending CAPTCHA to 2Captcha for Solving

import requestsimport time

API_KEY = & # 8220; your_2captcha_api_key & # 8221;
captcha_site_key = & # 8220; site-key-from-bstackdemo & # 8221;
url = & # 8220; https: //bstackdemo.com/ & # 8221;

# Step 1: Request CAPTCHA solving
reaction = requests.post (
& # 8220; http: //2captcha.com/in.php & # 8221;,
data= {& # 8220; key & # 8221;: API_KEY, & # 8220; method & # 8221;: & # 8220; userrecaptcha & # 8221;, & # 8220; googlekey & # 8221;: captcha_site_key, & # 8220; pageurl & # 8221;: url}
)

captcha_id = response.text.split (& # 8220; | & # 8221;) [-1]

# Step 2: Wait for CAPTCHA solution
time.sleep (15)
solution_response = requests.get (f & # 8221; http: //2captcha.com/res.php? key= {API_KEY} & amp; action=get & amp; id= {captcha_id} & # 8221;)

while & # 8220; CAPCHA_NOT_READY & # 8221; in solution_response.text:
time.sleep (5)
solution_response = requests.get (f & # 8221; http: //2captcha.com/res.php? key= {API_KEY} & amp; action=get & amp; id= {captcha_id} & # 8221;)

captcha_solution = solution_response.text.split (& # 8220; | & # 8221;) [-1]

# Step 3: Use CAPTCHA answer in web scraping
print (& # 8220; CAPTCHA Solved: & # 8221;, captcha_solution)

How It Works:Sends CAPTCHA to 2Captcha for solving. Waits for the solution and applies it to bypass restrictions.

Pro Tip: Websites track IPs and browser behavior to detect bots. Running Scrapy + Selenium on BrowserStack allows scraper to:

  • Use multiple existent browsers to mimic organic traffic.
  • Rotate IP address to reduce detection hazard.
  • Run tests in the cloud for better efficiency and scalability.

Large-Scale Scraping and Performance Optimization

When scraping tumid volumes of data from websites, performance and scalability are key factors. Combining Scrapy with Selenium can be a powerful solution, but it besides requires careful optimisation to handle multiple requests efficiently and avoid being blocked.

Efficiently Managing Browser Sessions to Reduce Memory Usage

Running Selenium in headless mode can help trim memory usage. Additionally, managing browser session effectively-such as closing unused sessions-is crucial to avoid memory overload during long scraping operation.

Scenario: Closing Browser Sessions After Use

import scrapyfrom selenium meaning webdriver
from selenium.webdriver.chrome.service meaning Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

class EfficientScraperSpider (scrapy.Spider):
name = & # 8220; efficient_scraper & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;]

def __init__ (self):
service = Service (& # 8220; chromedriver.exe & # 8221;)
options = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;) # Use headless mode
self.driver = webdriver.Chrome (service=service, options=options)

def parse (self, response):
self.driver.get (response.url)
time.sleep (2)

soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

# Extract data
for merchandise in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; name & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__title & # 8221;) .text.strip (),
& # 8220; price & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; val & # 8221;) .text.strip (),
}

self.driver.quit () # Properly close the browser session

How It Works:Headless modality reduces retentiveness usage by not rendering the browser UI. Closing the browser session with self.driver.quit () ensures no retentivity is wasted after use.

Running Multiple Selenium Instances with Scrapy & # 8217; s CrawlSpider

Scrapy & # 8217; s CrawlSpider allows for structured crawling, and by running multiple Selenium instances, you can scrape multiple page at once. This parallelism helps quicken data collection.

Scenario: Using CrawlSpider with Multiple Selenium Instances

significance scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 significance BeautifulSoup

class MultiSeleniumSpider (CrawlSpider):
name = & # 8220; multi_selenium & # 8221;
allowed_domains = [& # 8220; bstackdemo.com & # 8221;]
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;]

rules = (
Rule (LinkExtractor (), callback= & # 8221; parse_item & # 8221;, follow=True),
)

def __init__ (self):
service = Service (& # 8220; chromedriver.exe & # 8221;)
options = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;)
self.driver = webdriver.Chrome (service=service, options=options)

def parse_item (self, reply):
self.driver.get (response.url)
soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

for product in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; name & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__title & # 8221;) .text.strip (),
& # 8220; price & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; val & # 8221;) .text.strip (),
}

self.driver.quit ()

How It Works:CrawlSpider allows automatic followers of links to scrape multiple Page. Each page is handled by a separate Selenium instance, lam in headless mode to optimize execution.

Using Proxies and Rotating IPs to Avoid Blocks

Websites often obstruct scrapers after too many requests from a single IP. Using proxies or rotating IPs aid avoid detection and bans.

Scenario: Rotating IPs with Scrapy Middleware

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by meaning By
from bs4 significance BeautifulSoup
import random

course ProxyRotatorSpider (scrapy.Spider):
name = & # 8220; proxy_rotator & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;]

def __init__ (self):
self.proxies = [
& # 8220; http: //proxy1.com & # 8221;, & # 8220; http: //proxy2.com & # 8221;, & # 8220; http: //proxy3.com & # 8221;
]
service = Service (& # 8220; chromedriver.exe & # 8221;)
options = webdriver.ChromeOptions ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;)
self.driver = webdriver.Chrome (service=service, options=options)

def parse (self, response):
proxy = random.choice (self.proxies) # Rotate proxy
self.driver.get (response.url)
self.driver.execute_cdp_cmd (& # 8216; Network.enable & # 8217;, {})
self.driver.execute_cdp_cmd (& # 8216; Network.setCacheDisabled & # 8217;, {& # 8220; cacheDisabled & # 8221;: True})
self.driver.get (response.url)

soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

for product in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; name & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__title & # 8221;) .text.strip (),
& # 8220; price & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; val & # 8221;) .text.strip (),
}

self.driver.quit ()

How It Works:Proxies are rotated indiscriminately for each request to distribute traffic. Using Scrapy middleware permit handling of rotating IPs to avoid being stop.

Asynchronous Execution: Combining Scrapy & # 8217; s Concurrency with Selenium & # 8217; s Actions

Scrapy is built for asynchronous operations, grant multiple requests to run concurrently. By combining Scrapy & # 8217; s concurrence with Selenium & # 8217; s page interactions, large-scale scraping tasks can be perform much faster.

Scenario: Combining Asynchronous Scrapy with Selenium for Faster Scraping

import scrapyfrom selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by meaning By
from scrapy importee Request
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import asyncio

class AsyncScraperSpider (scrapy.Spider):
name = & # 8220; async_scraper & # 8221;
start_urls = [& # 8220; https: //bstackdemo.com/ & # 8221;]

def __init__ (self):
service = Service (& # 8220; chromedriver.exe & # 8221;)
options = Options ()
options.add_argument (& # 8220; & # 8211; headless & # 8221;)
self.driver = webdriver.Chrome (service=service, options=options)

async def parse (self, response):
self.driver.get (response.url)
soup = BeautifulSoup (self.driver.page_source, & # 8220; html.parser & # 8221;)

for merchandise in soup.find_all (& # 8220; div & # 8221;, class_= & # 8221; shelf-item & # 8221;):
yield {
& # 8220; name & # 8221;: product.find (& # 8220; p & # 8221;, class_= & # 8221; shelf-item__title & # 8221;) .text.strip (),
& # 8220; price & # 8221;: product.find (& # 8220; div & # 8221;, class_= & # 8221; val & # 8221;) .text.strip (),
}

self.driver.quit ()

def start_requests (self):
for url in self.start_urls:
yield Request (url, callback=self.parse)

How It Works:Scrapy & # 8217; s asynchronous poser runs requests concurrently. Selenium interacts with web pages while Scrapy handles multiple request simultaneously for fast scratching.

Why opt BrowserStack for Selenium Testing?

is a cloud-based platform plan for automated cross-browser testing of web applications. It enables teams to run Selenium, Appium, and other automation frameworks on real device and browser in the cloud, without the motive for local setups. Key features include:

  • Real Device & amp; Browser Testing: Access 3,500+ real devices and browser for accurate testing.
  • Scalability: Run parallel tests to speed up and scale your examination.
  • No Setup Required: Instantly test without the want for physical devices or complex setups.
  • Cross-Platform Support: Test on both mobile and desktop surround.
  • CI/CD Integration: Easily integrate with Jenkins, CircleCI, and GitHub Actions.
  • Real-Time Debugging: Access logarithm, screenshots, and videos for quick issue identification.
  • Wide Browser & amp; OS Coverage: Supports multiple browsers and OS combinations.
  • Global Availability: Test from any emplacement with fast, reliable accession.

Conclusion

Scrapy is outstanding for extracting data from bare websites, while Selenium is necessary for handling JavaScript-heavy website that require interaction. Canonic scraping task involve straightforward data descent, while advanced techniques tackle issues like uncounted scrolling, active content, and bot detection.

The hereafter of web scraping will see AI-powered tools that adapt to complex sites and bypass anti-scraping measures. Scraping tools will become more machine-controlled, offering lineament like self-adjusting schedules and automatic data cleaning. Ethical and compliant scratch will also grow in grandness, ensuring information secrecy and legal adherence. Lastly, scraping tools will integrate with big data platforms, enabling real-time decision-making.

Running Selenium tests on a real device cloud yield more exact results because it simulates real user weather. offers admittance to over 3500 real device-browser combinations, allowing for thorough examination of web covering to ensure a bland and consistent user experience.

 

 

 

 

Tags

On This Page

7,000+ Views

# Ask-and-Contributeabout this topic with our Discord community.

Related Guides

Automate This With SUSA

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts needed.

Try SUSA Free

Test Your App Autonomously

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.

Try SUSA Free