A windsurfing, CSS-grudging, IE-hating, web-developing, gigantic-machine-puzzling blog

Category: abuse

SCAM: Web Content Scraping in Realtime

I discovered several illicit websites have been scraping, reprocessing & re-serving copyright web content from in real-time.

It’s an assholish way to do business.

Here’s how the scam works:

  1. An unsuspecting visitor to one of these illicit websites requests a web page.
  2. The web server passes the request to the content scraper bot.
  3. The scraper bot script makes a web request to the legitimate website & reprocesses (steals) the content.
  4. The scraper bot transmits the stolen content back to the illicit web server.
  5. The web server serves the stolen content back to the site visitor.

This content-scraping happens in realtime, in the background over a few seconds as the visitor’s browser sits there waiting.

The first content scraper site I discovered was replacing “” anywhere it appeared in the HTML code with the name of the illicit website, & also replaced advertising so that the scammers earned the ad revenue instead of my company. Evil!

The largest offender so far was the website, which has since been shut down after I filed complaints with their ISP. They had managed to get ~9,150 pages indexed by Google, which are (hopefully) in the process of being removed sometime soon. Their entire site was a duplicate of mine with all pages scraped from my site & returned to their visitors in realtime. The scam website was hosted on a different IP & service from the content scraper, but it was easy to track down by requesting a bogus page on the scam website & then watching the content scrape request hit my site by tailing the Apache access log.

Once I identified these content scrape requests, I reviewed my access log & found many similar requests being made from other IPs, but I couldn’t find the corresponding scam websites. It’s impossible to track down which website these requests were originating from, but you can still go after the ISP that’s hosting the content scrapers.

For now the scraper bots are using the useragent “Go-http-client/1.1“.

Many of the scraper bots use Amazon AWS as the host. To file a complaint, email details including log files to — generally AWS is pretty good about taking care of it, but you will need to prove there’s been an AWS Acceptable Use Policy violation or else AWS simply passes your complaint on to their customer.

To establish an AUP violation, ban the Go-http-client useragent in your robots.txt file. AWS requires any clients operating web crawlers to follow the robots.txt convention. I couldn’t find where any of these IPs had tried to access robots.txt but I did it anyway so AWS could take further steps against their client.

Until the scammers change the useragent, you can also ban that traffic by returning a 403 Forbidden response using a RewriteRule in .htaccess:

RewriteCond %{HTTP_USER_AGENT} (Go-http-client) [NC]
RewriteRule !^robots\.txt - [F]

Or have a bit more fun with the scammers & redirect their content scraper requests to a copyright violation notice page:

RewriteCond %{HTTP_USER_AGENT} (Go-http-client) [NC]
RewriteRule !^(robots\.txt|copyright_violation\.html) /copyright_violation.html [R,L]

Or the FBI Cyber Crime page:

RewriteCond %{HTTP_USER_AGENT} (Go-http-client) [NC]
RewriteRule !^robots\.txt [R,L]

NOTE: These examples assume you already have mod_rewrite enabled & “RewriteEngine On”.

Bad Crawler Bots:

Bad Web CrawlersFound this bot accessing the site via lots of different 202.46.* IPs. Reverse DNS points to

The range of IPs for to is associated with ShenZhen Sunrise Technology Co., Ltd.

This is how to ban via .htaccess RewriteRule:

## ban
RewriteCond %{REMOTE_ADDR} ^202\.46\.(3[2-9]|[4-5][0-9]|6[0-3])\.
RewriteRule !^robots\.txt - [F]

Optionally, you can add this RewriteCond for the useragent they happen to be using at the moment:

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(Windows\ NT\ 5\.1;\ rv:6\.0\.2\)\ Gecko/20100101\ Firefox/6\.0\.2

However, the IP ban is specific to the range owned by the company, so personally I wouldn’t bother using that useragent criteria. They could just change it at any time.

I did see they made several requests to robots.txt, but without a proper user agent identifying this bot as a crawler, your guess is as good as mine how to ban it in robots.txt, perhaps:

Disallow: /

Bad Crawler Bots: Proximic, CrystalSemantics, Grapeshot, Gigavenue

Bad Web CrawlersEvery so often I go through the error logs & watch for server abuse. The latest review found a few new players: Crystal Semantics, Grapeshot, Gigavenue & Mangoway.


Crystal Semantics does after-the-fact contextual advertising. They crawl your pages after an ad is shown. Risky Internet covers this topic well:

Since we do not need a whole series of Ad crawlers making a business out of stealing bandwidth and each on their own reloading pages, the ONLY valid solution is that the seller of the ad-space (whether they are Google Ads or other) deliver the valid classification, since they are the first to crawl the page.No need to have a whole series of separate companies scrape off the same page, and adding more load to all sites, just to make their own business out of it.

Amen to that. Normally I wouldn’t mind so much, but in all their HTTP requests they’ve been accessing the path portion of the URL in all-lowercase. We use mixed case so they’ve been getting gazillion 404 Page Not Found errors. Probably sloppy coding somewhere between their ad agency partner & their service — but after months of 404 errors, they’ve had plenty of opportunity to discover the problem through self-monitoring & fix it.


Basically a repeat of above, except they apparently & rather arrogantly don’t comply with robots.txt. Not quite as many 404 errors as Crystal Semantics had, but I don’t agree with the whole post-ad-serving contextual value added crawl business model.


Evil. They’re crawling the site like crazy from multiple IPs but don’t use a unique user-agent. Zero information about their crawler. Emails to all three email addresses listed on bounce (,, I tried contacting Adam D. Binder via LinkedIn & we’ll see how it goes.

So the changes to robots.txt:

User-agent: crystalsemantics
Disallow: /
User-agent: grapeshot
Disallow: /

Gigavenue doesn’t publish robots.txt info so your guess is as good as mine what robots.txt useragent to use for them.

For good measure, ban them in .htaccess too:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Ruby|proximic|CrystalSemanticsBot|GrapeshotCrawler|Mangoway) [NC,OR]
RewriteCond %{REMOTE_ADDR} ^(208\.78\.85|208\.66\.97|208\.66\.100)
RewriteRule !^robots\.txt - [F]

This bans them by UserAgent for the better-behaved crawlers that have one, and by IP for the evil services that don’t, & sends them all to a 403 Forbidden response, except they can access robots.txt to find out the nice way they are disallowed from crawling the site.

NOTE: These IPs in the example code are now several years old & probably aren’t correct anymore. They are only meant to serve as an example of how to ban these & similar services, if you choose to do that.

404 Not Found Abuse: oggiPlayerLoader.htm

In a refreshingly proactive turn of events, one Amazon AWS abuser replied to me directly. The oggiPlayerLoader.htm 404 errors detailed in my previous web abuse post were courtesy of Oggifinogi, a rich media provider based out of Bellevue, WA.

Directory of Technology Paul Grinchenko emailed me back with a friendly explanation:

We are just looking for our IFRAME buster. You were running at least 1 of our ads in an IFRAME.

No surprise there. We have no prior relationship with Oggifinogi, so I figured their ads had been served through one of the 3rd party ad networks we use (turns out it was ValueClick).

Luckily the issue is simpler than that — Amazon AWS prohibits them from 404-bombing our servers at “an excessive or disruptive rate”. My reply to Paul:

As you probably saw from the “comments” I provided, my complaint was your service’s excessive HEAD requests to the same 6 non-existent files. Judging from the excessively long-term & repetitive 404 errors, it seems your service does nothing useful with the “not found” status code returned by our servers each time. Oggifinogi would be better off using a more responsible system: monitor HTTP response codes to your iframe buster requests, & use that information to limit requests when the files clearly don’t exist. By the way, I somewhat appreciate your service’s HEAD requests versus a full GET, but it’s a bandaid.

Also I urge you to consider Amazon’s advice: We would strongly recommend that you customize your UserAgent string to include a contact email address so as to allow target sites being crawled to be able to contact you directly. (…although most responsible web services I’ve come across put a URL in their user agent, rather than an email address…)

A few hours later, Paul replied that Oggifinogi does indeed cache iframe buster file presence for a short period, so their requests should not exceed 75 per hour. That fits the profile I saw — no real strain on the web server, but very annoying when tailing error logs.

The good news is Paul agreed to start using a Oggifinogi user agent — hopefully with a help/explanation page URL too.

Paul also sent me the oggiPlayerLoader.htm instructions. Now Oggifinogi can bust our iframes at will, rather than continuing the 404 war. In case anyone else out there wants to join the peace process:

Instructions for Publishers:

  1. Please download and unpack – External link for Pubs to Download
  2. Make sure that unpacked version is called oggiPlayerLoader.htm
  3. Copy oggiPlayerLoader.htm to just one of the following locations – single location is enough:

Please make sure that resulting location is accessible from outside. Location shouldn’t be protected. For example you should be able to open in the browser URL without entering any credentials.

… Working to improving the Interweb, one 404 error at a time.

Web crawl abuse from Amazon AWS-hosted projects

I’ve been keeping an eye on the error log lately, watching for phishing attempts, misbehaving bots/scripts, & other random stupidity. Turns out the major offenders have something in common — they’re hosted on Amazon’s AWS platform.

One Amazon AWS customer was crawling pages in bursts at up to 100 per minute, but referencing our mixed-case URLs in all lowercase — racking up several hundred thousand 404 errors over several weeks. Luckily they had a “Ruby” user agent (Ruby script’s HTTP request?) … bye bye Ruby, at least until you change user agents.

Another Amazon AWS customer was requesting oggiPlayerLoader.htm in various locations. Anyone know what this “Frame Booster” is part of? (UPDATE: see my followup about Oggifinogi). Luckily they use a HEAD request, so those got banned too along with some other esoteric request methods suggested by Perishable Press.

RewriteCond %{HTTP_USER_AGENT} "Ruby" [NC,OR]
RewriteCond %{REQUEST_METHOD} ^(delete|head|trace|track) [NC]
RewriteRule ^(.*)$ - [F,L]

I cheerily reported both cases of AWS abuse to Amazon via their web abuse form. Turns out the abuse form is there only to mess with your head. Some form data has to be space-separated while other data must be comma-separated. Fields where you list IPs & URLs barely fit a single entry, much less multiple items. And good luck cutting your access log snippet down to their 2000 character limit. Amazon just launched their Cloud Drive — zillions of decaquintillobytes of storage space — but can they handle processing a few hundred lines of server logs? Nope.

The kicker is if they do accept, verify, & pass on your  complaint to their AWS customer, Amazon won’t provide any details about the offender so that you could, oh I don’t know, blog mean things about them. You’ll need a subpoena for that.

Moving on to abuse not related to AWS — people are referencing themes/default/style.css all over the place. The requests look legitimate, from various random IPs & user agents, so I’m guessing it’s a misbehaving browser plugin. Searching Google indicates it could be something called OpenScape, which I didn’t have time to research. Anyone know what that’s all about? Those got forbidden…

RewriteRule theme/default/style.css$ - [F,L]

And finally there’s Microsoft. For about a year, MSNBot has managed to take legitimate page URLS & tack Javascript onto the end, as in /Kia/Sephia/2001/engine/this.options[this.selectedIndex].value;” Only Microsoft could manage that.

Powered by WordPress & Theme by Anders Norén