I discovered several illicit websites have been scraping, reprocessing & re-serving copyright web content from CarComplaints.com in real-time.
It’s an assholish way to do business.
Here’s how the scam works:
- An unsuspecting visitor to one of these illicit websites requests a web page.
- The web server passes the request to the content scraper bot.
- The scraper bot script makes a web request to the legitimate website & reprocesses (steals) the content.
- The scraper bot transmits the stolen content back to the illicit web server.
- The web server serves the stolen content back to the site visitor.
This content-scraping happens in realtime, in the background over a few seconds as the visitor’s browser sits there waiting.
The first content scraper site I discovered was replacing “CarComplaints.com” anywhere it appeared in the HTML code with the name of the illicit website, & also replaced advertising so that the scammers earned the ad revenue instead of my company. Evil!
The largest offender so far was the website carcomplaints.xyz, which has since been shut down after I filed complaints with their ISP. They had managed to get ~9,150 pages indexed by Google, which are (hopefully) in the process of being removed sometime soon. Their entire site was a duplicate of mine with all pages scraped from my site & returned to their visitors in realtime. The scam website was hosted on a different IP & service from the content scraper, but it was easy to track down by requesting a bogus page on the scam website & then watching the content scrape request hit my site by tailing the Apache access log.
Once I identified these content scrape requests, I reviewed my access log & found many similar requests being made from other IPs, but I couldn’t find the corresponding scam websites. It’s impossible to track down which website these requests were originating from, but you can still go after the ISP that’s hosting the content scrapers.
For now the scraper bots are using the useragent “Go-http-client/1.1“.
Many of the scraper bots use Amazon AWS as the host. To file a complaint, email details including log files to abuse@amazonaws.com — generally AWS is pretty good about taking care of it, but you will need to prove there’s been an AWS Acceptable Use Policy violation or else AWS simply passes your complaint on to their customer.
To establish an AUP violation, ban the Go-http-client useragent in your robots.txt file. AWS requires any clients operating web crawlers to follow the robots.txt convention. I couldn’t find where any of these IPs had tried to access robots.txt but I did it anyway so AWS could take further steps against their client.
Until the scammers change the useragent, you can also ban that traffic by returning a 403 Forbidden response using a RewriteRule in .htaccess:
RewriteCond %{HTTP_USER_AGENT} (Go-http-client) [NC] RewriteRule !^robots\.txt - [F]
Or have a bit more fun with the scammers & redirect their content scraper requests to a copyright violation notice page:
RewriteCond %{HTTP_USER_AGENT} (Go-http-client) [NC] RewriteRule !^(robots\.txt|copyright_violation\.html) /copyright_violation.html [R,L]
Or the FBI Cyber Crime page:
RewriteCond %{HTTP_USER_AGENT} (Go-http-client) [NC] RewriteRule !^robots\.txt https://www.fbi.gov/investigate/cyber/ [R,L]
NOTE: These examples assume you already have mod_rewrite enabled & “RewriteEngine On”.
Leave a Reply