Web Scraping without Getting Blocked
Web scraping is a perfectly legal activity as long as you’re scraping publicly available data, although some may consider it a morally grey area. Many webmasters hate web crawlers because of the extra load they place on servers, as well as not wanting to make it easy for competitors to simply scrape data off their website, so they’ll use various detection methods to stop web crawlers in their tracks.
In this article, we’re going to give you some helpful tips on how to avoid getting blocked while web scraping.
Proxy API helps in getting HTML from any page with a simple API call with the web scraping API.
Table of Contents
Use Random Intervals between your Requests
It’s pretty obvious when a web scraper sends a request every second, because no human would browse a website in that manner. To avoid your scraper following a recognizable pattern, set random intervals between requests. This has several benefits aside from not being blocked.
For starters, sending too many fast requests can crash a website, similar to a DoS (denial of service) attack, especially smaller websites with limited resources. Sending your requests a little more slowly will help the web server avoid being overloaded, and this will also help platforms offering web data scraping tools from being banned and affecting other customers.
You can also check a website’s robots.txt file, for example Reddit.com/robots.txt, and check to see if any specific crawler bots are banned, or if the site owner has any delay instructions for bot users, usually under a line like “crawl-delay”.
Avoid hidden link Traps
Many websites will detect web scrapers by using invisible links only a robot would follow. One way to check if a website is using hidden link traps is to detect if any links on a website have “display: none” or “visibility: hidden” in the CSS properties, and then avoid following those links, or else you will be banned quite easily.
Another trick webmasters use is to simply set the color of the links to whatever color the website background is, effectively making them invisible to the human eye. You can check for properties like “color: #fff”, and also highlight the entire page to render any invisible text visible.
Use different IP Addresses on Rotation
Because IP addresses are a unique identifier for every machine using the internet, examining IP addresses is the main method websites filter out web scrapers. What you want to do is avoid sending your requests through a single IP address by using IP rotation, which will send your requests through a variety of IP addresses. This isn’t the same as using a proxy to route your traffic, although things exist such as rotating proxies (also known as residential proxies).
Use an Automated CAPTCHA solver
Many websites use CAPTCHAs to slow down web crawlers. The CAPTCHAs can come in different forms, such as asking you to solve a simple math equation, selecting individual photos out of groups, or even simple slider-bars. It really depends on how determined the website is at keeping out automated traffic.
Fortunately, there also exists services for automatically defeating CAPTCHAs, such as 2Captcha, Anticaptcha, Image Typerz, and EndCaptcha, just to name a few. Most of the premium services have a “per x amount of CAPTCHAs solved” cost associated with them, such as $1.50 USD per 1000 CAPTCHAs the software solves (that’s a ballpark figure, it depends on the service).
Because CAPTCHA solving services can be a little slow and quickly become expensive, you’ll need to consider if it’s really worth it scraping e-commerce websites that use a lot of CAPTCHA puzzles.
Scrape from the Google cache
If you want to scrape data that doesn’t change very often, you might be better off scraping the cached version of a website directly from Google. You can prepend “http://webcache.googleusercontent.com/search?q=cache:” to the beginning of any website URL, and it should automatically show you the latest cache version of the page.
This isn’t entirely foolproof, as some websites may instruct Google not to cache their data, and also lower-ranked sites may be farther out of date as Google doesn’t crawl them as often.