FlareSolverr Guide: Bypass Cloudflare While Scraping (2024)

FlareSolverr Guide: Bypass Cloudflare While Scraping (1)

Cloudflare is a popular antibot shield that blocks automated requests such as web scrapers. It's used across various global websites like Glassdoor, Indeed and G2. So, bypassing Cloudflare opens the door for a wide set of web scraping opportunities.

In this article, we'll explore the FlareSolverr tool and how to use it to get around Cloudflare while scraping. We'll start by explaining what FlareSolverr is, how it works and how to install and use it. Let's get started!

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

  • Do not scrape at rates that could damage the website.
  • Do not scrape data that's not available publicly.
  • Do not store PII of EU citizens who are protected by GDPR.
  • Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.

What is FlareSolverr?

FlareSolverr is an open-source proxy server for solving Cloudflare anti-bot challenges.
It bypasses Cloudflare and creates a session with Headers and Cookies
that are reused to authorize future requests against the Cloudflare challenge.

FlareSolverr can be used with both GET and POST requests. It also supports integration with Prometheus for generating metrics and statistics about the bypass performance. FlareSoverr doesn't bypass Cloudflare by solving its challenges. Instead, it mimics normal browsers' configuration. Let's have a closer look!

How FlareSolverr Works?

FlareSolverr is built on top of Selenium and Undetected ChromeDriver, which implies different techniques to bypass Cloudflare, such as changing Selenium variable names and adding randomized delayed and mouse moves.

When a request is sent to the FlareSolverr server, it spins a Selenium headless browser with the Undetected ChromeDriver configuration and requests the page URL. Then, it waits for the Cloudflare challenge to get solved automatically or timed out. Finally, it preserves the session values of the successful requests and reuses them for future requests.

That being said, FlareSolverr can't bypass Cloudflare challenges with explicit CAPTCHAs that require clicks.

How to bypass Cloudflare when web scraping in 2024Learn how to detect Cloudflare blocking, how it identifies web scrapers and tips for bypassing it while scraping.

How to Install FlareSolverr?

FlareSolverr can be installed using source code or executable binaries. However, the most stable method is using Docker. If you don't have Docker installed, you can follow the official Docker installation page.

Create a docker-compose.yml file and add the following code:

version: "2.1"services: flaresolverr: # DockerHub mirror flaresolverr/flaresolverr:latest image: ghcr.io/flaresolverr/flaresolverr:latest container_name: flaresolverr environment: - LOG_LEVEL=${LOG_LEVEL:-info} - LOG_HTML=${LOG_HTML:-false} - TZ=Europe/London ports: - "${PORT:-8191}:8191" restart: unless-stopped

Here, we add the basic FlareSolverr configuration found on the official GitHub repository. Let's break down the parameters used:

ParameterDescription
LOG_LEVELThe logging verbosity, setting it to debug will include more logging details.
LOG_HTML Logs the HTML response of each request in the console.
TZConfigures the headless browser timezone.
Port The port where the server will run.

Flaresolverr also includes additional configuration parameters.

ParameterDescription
LANGChanges the web browser language. It comes in handy when changing the web scraping language.
HEADLESS Controls whether to run the browser in headful or headless mode, without GUI.
BROWSER_TIMEOUTConfigures the browser timeout. The default is 40 seconds, but it can be increased for slow internet connections.

For more details on FlareSolverr's parameters, refer to the official configuration docs.

Now that our configuration file is ready. Let's spin up the FlareSolverr server:

docker-compose up --build

To verify your installation, head over to the FlareSolverr port at 127.0.0.1:8191 and you will get a response similar to this:

{"msg": "FlareSolverr is ready!", "version": "3.3.13", "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}

To web scrape using FlareSolverr, we need to interact with its server using HTTP requests. For this, we'll be using httpx. It can be installed using pip:

pip install httpx

How to Use FlareSolverr?

The main functionality of FlareSolver is fairly straightforward. We route the requests to the FalreSolverr server, which will get executed using Selenium and the Undetected ChromeDriver. However, It also allows for sending POST requests, managing sessions and adding proxies. Let's start with the simple GET requests.

Sending GET Requests

To send requests using FlareSolverr, we need to send a POST request to the FlareSolverr URL and pass the request instructions through the request body:

curl -L -X POST 'http://localhost:8191/v1' \-H 'Content-Type: application/json' \--data-raw '{ "cmd": "request.get", "url":"http://www.google.com/", "maxTimeout": 60000}

The above request body is the minimal payload that FlareSolverr accepts. We specify the URL, request timeout and the request type, GET or POST.

Let's replicate the above request using httpx and observe the result:

import httpxdef send_get_request(url: str): """send a GET request with FlareSolverr""" flaresolverr_url = "http://localhost:8191/v1" # basic header content type header r_headers = {"Content-Type": "application/json"} # request payload payload = { "cmd": "request.get", "url": url, "maxTimeout": 60000 } # send the POST request using httpx response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000) return responseresponse = send_get_request(url="https://google.com")print(response.text)

We define a send_get_request() function. It requests the FlareSolverr URL with the target website URL and the request payload. Here is the result we got:

{ "status": "ok", "message": "Challenge not detected!", "solution": { "url": "https://www.google.com/", "status": 200, "cookies": [ { "domain": ".google.com", "expiry": 1721920212, "httpOnly": true, "name": "NID", "path": "/", "sameSite": "None", "secure": true, "value": "511=k58ibgnzvwsZ5YdKvKFbBipVVUYc0XLGbFrNiu_nNTk3dsUR24-xZ6H3XmGiP-1dWXH15MyynGY-z1CIt3HddjzwC5YD5ZQb8g5eU9CwQp993tUypxby2VSGPTEXjG-fnlpvi199oEu0AH7kR_FbWRHECbzEdT6xc8Zu4gHiinjb6zNWkQ31vnU" }, .... ], "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", "headers": {}, "response": "<html> .... </html>" }, "startTimestamp": 1706109011335, "endTimestamp": 1706109014793, "version": "3.3.13"}

From the response, we can see all the cookie values assigned to the request. The assigned headers were also saved, but it's empty since no headers were set. Finally, we got the page HTML.

We have requested a Google page without a Cloudflare challenge. Let's put FlareSolverr into action by requesting nowsecure.nl. It is a simple page that implies the CloudFlare shield:

FlareSolverr Guide: Bypass Cloudflare While Scraping (3)

Let's attempt to bypass this page Cloudflare challenge using FlareSolverr:

import httpxdef send_get_request(url: str): """send a GET request with FlareSolverr""" flaresolverr_url = "http://localhost:8191/v1" # basic header content type header r_headers = {"Content-Type": "application/json"} # request payload payload = { "cmd": "request.get", "url": url, "maxTimeout": 60000 } # send the POST request using httpx response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000) return responseresponse = send_get_request(url="https://nowsecure.nl/")print(response.text)

If we take a look at the response, we'll find the Cloudflare challenge bypassed!

{ "status": "ok", "message": "Challenge solved!", "solution": { "url": "https://nowsecure.nl/", "status": 200, "cookies": [ { "domain": ".nowsecure.nl", "expiry": 1737648300, "httpOnly": true, "name": "cf_clearance", "path": "/", "sameSite": "None", "secure": true, "value": "iDVfZ0_So4n_2d7w9q8RRBl8tUktOzdT9g9NL7JrUiM-1706112302-1-ASeuvc/28aIp0ZLlSbMmwDBW9A0rGbi/APO9w90KWdh1OI0QfsjmSr/gSVVjHb8NPL8VQKsxiO5xhLxb8o206Yw=" } ], "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", "headers": {}, "response": "<html> .... </html>" }, "startTimestamp": 1706112295097, "endTimestamp": 1706112301786, "version": "3.3.13"}

FlareSolverr has successfully bypassed the Cloudflare challenge and returned the session values. Let's have a look at how we can reuse this session.

Managing Sessions

We can grab the session values from Flarsoverr's responses and re-apply them with any HTTP client. However, FlareSolverr provides built-in methods for managing and reusing session values.

First, we have to store the requests' sessions. This can be achieved using the sessions.create command:

import httpxdef send_get_request(url: str): """send a GET request with FlareSolverr and save the session""" flaresolverr_url = "http://localhost:8191/v1" # basic header content type header r_headers = {"Content-Type": "application/json"} # request payload payload = { "cmd": "sessions.create", "url": url, "maxTimeout": 60000 } # send the POST request using httpx response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000) return responsefor i in range(3): # create 3 session values response = send_get_request(url="https://nowsecure.nl/") print(response.text)

In the above code, we change the FlareSolverr command to sessions.create to save and return the request's session. Then, we call the function three times to create different sessions. The response would look like this:

{ "status": "ok", "message": "Session created successfully.", "session": "e403c98c-bad5-11ee-8830-0242ac150002", "startTimestamp": 1706113831129, "endTimestamp": 1706113831774, "version": "3.3.13"}

The next step is retrieving the stored sessions. This can be done using the sessions.list command:

import httpxdef retrieve_sessions(): """retrieve FlareSolverr sessions""" flaresolverr_url = "http://localhost:8191/v1" # basic header content type header r_headers = {"Content-Type": "application/json"} # request payload payload = { "cmd": "sessions.list" } # send the POST request using httpx response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000) return responseresponse = retrieve_sessions()print(response.text)

The response contains a list of the saved session IDs:

{ "status": "ok", "message": "", "sessions": [ "e3004b0a-bad5-11ee-aec0-0242ac150002", "e37000e4-bad5-11ee-b73b-0242ac150002", "e403c98c-bad5-11ee-8830-0242ac150002" ], "startTimestamp": 1706114460036, "endTimestamp": 1706114460036, "version": "3.3.13"}

Now let's reuse one of these sessions by declaring the session ID in the request payload:

import httpximport jsondef retrieve_session(): """retrieve FlareSolverr session""" flaresolverr_url = "http://localhost:8191/v1" # basic header content type header r_headers = {"Content-Type": "application/json"} # request payload payload = { "cmd": "sessions.list" } # send the POST request using httpx response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload) session = json.loads(response.text)["sessions"][0] return sessiondef request_with_session(url: str): """send a GET request with a FlareSolverr session""" session_id = retrieve_session() flaresolverr_url = "http://localhost:8191/v1" # basic header content type header r_headers = {"Content-Type": "application/json"} # request payload payload = { "cmd": "request.get", "session": session_id, "url": url, "maxTimeout": 60000 } # send the POST request using httpx response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=60000) return responseresponse = request_with_session(url="https://nowsecure.nl/")print(response.text)

Here, we define two functions, let's break them down:

  • retrieve_session for retrieving a session ID from FlareSolverr stored sessions.
  • request_with_session for requesting a page URL while reusing a session ID. It's the same as the previous code except for declaring the session ID in the session parameter of the request payload.

Reusing sessions come in handy while scaling web scrapers. We can send a request to bypass Cloudflare and save the session, then reuse the session ID for future requests. This will notably make the requests' execution time faster as we wouldn't have to bypass the challenge with each request.

The last feature we can use to manipulate FlareSolverr's sessions is deleting sessions using the sessions.destroy command:

import httpxdef delete_session(session_id: str): """destroy a FlareSolverr session""" flaresolverr_url = "http://localhost:8191/v1" # basic header content type header r_headers = {"Content-Type": "application/json"} # request payload payload = { "cmd": "sessions.destroy", "session": session_id } # send the POST request using httpx response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload) return responseresponse = delete_session(session_id="e3004b0a-bad5-11ee-aec0-0242ac150002")print(response.text)'{"status": "ok", "message": "The session has been removed.", "startTimestamp": 1706118541437, "endTimestamp": 1706118541558, "version": "3.3.13"}'

Sending POST Requests

Sending POST requests with FlaveSolverr is the same as sending GET requests. All we have to do is change the command to request.post and add the request payload to the postData parameter with the string encoded:

import httpxdef send_post_request(url: str, request_payload: str): """send a POST request using FlareSolverr""" flaresolverr_url = "http://localhost:8191/v1" # basic header content type header r_headers = {"Content-Type": "application/json"} # request payload payload = { "url": url, "maxTimeout": 60000, "cmd": "request.post", "postData": request_payload } # send the POST request using httpx response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload) return responseresponse = send_post_request(url="https://httpbin.dev/anything", request_payload="key1=value1&key2=value2")print(response.text)

Adding Proxies

Proxies in FlareSolverr can be added for all commands through the proxy parameter:

import httpxdef send_get_request(url: str): """send a GET request with FlareSolverr""" flaresolverr_url = "http://localhost:8191/v1" # basic header content type header r_headers = {"Content-Type": "application/json"} # request payload payload = { "cmd": "request.get", "url": url, "maxTimeout": 60000, "proxy": {"url": "proxy_url", "username": "proxy_username", "password": "proxy_password"} } # send the POST request using httpx response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000) return responseresponse = send_get_request(url="https://nowsecure.nl")print(response.text)

Using proxies in FlareSolverr for web scraping allows us to distribute our requests' traffic across different IP addresses. This will make it harder for Cloudflare to track the IP address origin, preventing IP address throttling and blocking. Refer to our previous article on IP address blocking for more details.

How to Avoid Web Scraper IP Blocking?Learn what Internet Protocol addresses are and how IP tracking technologies are used to block web scrapers.

FlareSolverr Limitations

We have successfully bypassed Cloudflare with FlareSolverr. However, it can fail to bypass Cloudflare with highly protected websites. For example, let's try to scrape a page on Zoominfo:

import httpxdef send_get_request(url: str): """send a GET request with FlareSolverr""" flaresolverr_url = "http://localhost:8191/v1" # basic header content type header r_headers = {"Content-Type": "application/json"} # request payload payload = { "url": url, "maxTimeout": 60000, "cmd": "request.get" } # send the POST request using httpx response = httpx.post(url=flaresolverr_url, headers=r_headers, json=payload, timeout=6000) return responseresponse = send_get_request(url="https://www.zoominfo.com/c/tesla-inc/104333869")print(response.text)

Unfortunately, FlareSolverr couldn't bypass the Cloudflare challenge and timed out:

{ "status": "error", "message": "Error: Error solving the challenge. Timeout after 60.0 seconds.", "startTimestamp": 1706175412072, "endTimestamp": 1706175472889, "version": "3.3.13"}

Let's take a look at a better alternative for getting around Cloudflare!

ScrapFly: FlareSolverr Alternative

ScrapFly is a web scraping API that provides an anti-scraping protection bypass to avoid any website blocking.

FlareSolverr Guide: Bypass Cloudflare While Scraping (5)

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system and we achieve this by:

  • Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
  • Millions of self-healing proxies of the highest possible trust score.
  • Constantly evolving and adapting to new anti-bot systems.
  • We've been doing this publicly since 2020 with the best bypass on the market!

Here is how we can bypass Cloudflare protection on the previously failed example. All we have to do is replace our HTTP client with the ScrapFly client and enable the asp argument:

# standard web scraping codeimport httpxfrom parsel import Selectorresponse = httpx.get("some target website URL")selector = Selector(response.text)# in ScrapFly becomes this 👇from scrapfly import ScrapeConfig, ScrapflyClient# replaces your HTTP client (httpx in this case)scrapfly = ScrapflyClient(key="You ScrapFly API key")response = scrapfly.scrape(ScrapeConfig( url="https://www.zoominfo.com/c/tesla-inc/104333869", asp=True, # enable the anti scraping protection to bypass Cloudflare country="US", # set the proxy location to a specfic country render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed))print(response.status_code)"200"# use the built in Parsel selectorselector = response.selector# access the HTML contenthtml = response.scrape_result['content']

Try for FREE!

More on Scrapfly

FAQ

To wrap up this guide, let's have a look at frequently asked questions and common errors about bypassing Cloudflare with FlareSolverr.

Can FlareSolverr bypass Cloudflare?

Yes, you can use FlareSolverr to get around Cloudflare protection. However, FlareSolverr is limited to specific Cloudflare versions, and it can fail with highly protected websites.

The cookies provided by FlareSolverr are not valid.

This is a common issue encountered when the request or consumer doesn't use the same IP address as the one used by FlareSolverr. To resolve this issue, you can disable IPv6 for the FlareSolverr and consumer Docker containers. You can also disable the VPNs or proxies if used, as they can cause networking conflicts.

Error solving the challenge. Timeout after X seconds.

This error suggests a failure in bypassing Cloudflare. This might be due to an unsolvable challenge or a short timeout window in the requests. To resolve this error, You can attempt to increase the FlareSolverr timeout.

Summary

In this article, we explained about the FlareSolverr tool. It bypasses Cloudflare by requesting the web pages using the Selenium web browser with the Undetected ChromeDriver configuration.

We went through a step-by-step guide on installing FlareSolverr using Docker. We also explained how to web scrape using FlareSolverr by managing sessions, adding proxies and sending GET and POST requests.

FlareSolverr Guide: Bypass Cloudflare While Scraping (2024)
Top Articles
Latest Posts
Article information

Author: Foster Heidenreich CPA

Last Updated:

Views: 6359

Rating: 4.6 / 5 (76 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Foster Heidenreich CPA

Birthday: 1995-01-14

Address: 55021 Usha Garden, North Larisa, DE 19209

Phone: +6812240846623

Job: Corporate Healthcare Strategist

Hobby: Singing, Listening to music, Rafting, LARPing, Gardening, Quilting, Rappelling

Introduction: My name is Foster Heidenreich CPA, I am a delightful, quaint, glorious, quaint, faithful, enchanting, fine person who loves writing and wants to share my knowledge and understanding with you.