Netscraps

A windsurfing, CSS-grudging, IE-hating, web-developing, gigantic-machine-puzzling blog

Category: abuse

Bad Crawler Bots: ptr.cnsat.com.cn

Bad Web CrawlersFound this bot accessing the site via lots of different 202.46.* IPs. Reverse DNS points to ptr.cnsat.com.cn.

The range of IPs for 202.46.32.0 to 202.46.63.255 is associated with ShenZhen Sunrise Technology Co., Ltd.

This is how to ban via .htaccess RewriteRule:

## ban ptr.cnsat.com.cn
RewriteCond %{REMOTE_ADDR} ^202\.46\.(3[2-9]|[4-5][0-9]|6[0-3])\.
RewriteRule !^robots\.txt - [F]

Optionally, you can add this RewriteCond for the useragent they happen to be using at the moment:

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(Windows\ NT\ 5\.1;\ rv:6\.0\.2\)\ Gecko/20100101\ Firefox/6\.0\.2

However, the IP ban is specific to the range owned by the company, so personally I wouldn’t bother using that useragent criteria. They could just change it at any time.

I did see they made several requests to robots.txt, but without a proper user agent identifying this bot as a crawler, your guess is as good as mine how to ban it in robots.txt, perhaps:

User-Agent: ptr.cnsat.com.cn
Disallow: /

Bad Crawler Bots: Proximic, CrystalSemantics, Grapeshot, Gigavenue

Bad Web CrawlersEvery so often I go through the CarComplaints.com error logs & watch for server abuse. The latest review found a few new players: Crystal Semantics, Grapeshot, Gigavenue & Mangoway.

CrystalSemantics

Crystal Semantics does after-the-fact contextual advertising. They crawl your pages after an ad is shown. Risky Internet covers this topic well:

Since we do not need a whole series of Ad crawlers making a business out of stealing bandwidth and each on their own reloading pages, the ONLY valid solution is that the seller of the ad-space (whether they are Google Ads or other) deliver the valid classification, since they are the first to crawl the page.No need to have a whole series of separate companies scrape off the same page, and adding more load to all sites, just to make their own business out of it.

Amen to that. Normally I wouldn’t mind so much, but in all their HTTP requests they’ve been accessing the path portion of the URL in all-lowercase. We use mixed case so they’ve been getting gazillion 404 Page Not Found errors. Probably sloppy coding somewhere between their ad agency partner & their service — but after months of 404 errors, they’ve had plenty of opportunity to discover the problem through self-monitoring & fix it.

Grapeshot

Basically a repeat of above, except they apparently & rather arrogantly don’t comply with robots.txt. Not quite as many 404 errors as Crystal Semantics had, but I don’t agree with the whole post-ad-serving contextual value added crawl business model.

Gigavenue

Evil. They’re crawling the site like crazy from multiple IPs but don’t use a unique user-agent. Zero information about their crawler. Emails to all three email addresses listed on gigavenue.com bounce (info@gigavenue.com, info@arcscale.com, noc@arcscale.com). I tried contacting Adam D. Binder via LinkedIn & we’ll see how it goes.

So the changes to robots.txt:

User-agent: crystalsemantics
Disallow: /
User-agent: grapeshot
Disallow: /

Gigavenue doesn’t publish robots.txt info so your guess is as good as mine what robots.txt useragent to use for them.

For good measure, ban them in .htaccess too:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Ruby|proximic|CrystalSemanticsBot|GrapeshotCrawler|Mangoway) [NC,OR]
RewriteCond %{REMOTE_ADDR} ^(208\.78\.85|208\.66\.97|208\.66\.100)
RewriteRule !^robots\.txt - [F]

This bans them by UserAgent for the better-behaved crawlers that have one, and by IP for the evil services that don’t, & sends them all to a 403 Forbidden response, except they can access robots.txt to find out the nice way they are disallowed from crawling the site.

NOTE: These IPs in the example code are now several years old & probably aren’t correct anymore. They are only meant to serve as an example of how to ban these & similar services, if you choose to do that.

404 Not Found Abuse: oggiPlayerLoader.htm

In a refreshingly proactive turn of events, one Amazon AWS abuser replied to me directly. The oggiPlayerLoader.htm 404 errors detailed in my previous web abuse post were courtesy of Oggifinogi, a rich media provider based out of Bellevue, WA.

Directory of Technology Paul Grinchenko emailed me back with a friendly explanation:

We are just looking for our IFRAME buster. You were running at least 1 of our ads in an IFRAME.

No surprise there. We have no prior relationship with Oggifinogi, so I figured their ads had been served through one of the 3rd party ad networks we use (turns out it was ValueClick).

Luckily the issue is simpler than that — Amazon AWS prohibits them from 404-bombing our servers at “an excessive or disruptive rate”. My reply to Paul:

As you probably saw from the “comments” I provided, my complaint was your service’s excessive HEAD requests to the same 6 non-existent files. Judging from the excessively long-term & repetitive 404 errors, it seems your service does nothing useful with the “not found” status code returned by our servers each time. Oggifinogi would be better off using a more responsible system: monitor HTTP response codes to your iframe buster requests, & use that information to limit requests when the files clearly don’t exist. By the way, I somewhat appreciate your service’s HEAD requests versus a full GET, but it’s a bandaid.

Also I urge you to consider Amazon’s advice: We would strongly recommend that you customize your UserAgent string to include a contact email address so as to allow target sites being crawled to be able to contact you directly. (…although most responsible web services I’ve come across put a URL in their user agent, rather than an email address…)

A few hours later, Paul replied that Oggifinogi does indeed cache iframe buster file presence for a short period, so their requests should not exceed 75 per hour. That fits the profile I saw — no real strain on the web server, but very annoying when tailing error logs.

The good news is Paul agreed to start using a Oggifinogi user agent — hopefully with a help/explanation page URL too.

Paul also sent me the oggiPlayerLoader.htm instructions. Now Oggifinogi can bust our iframes at will, rather than continuing the 404 war. In case anyone else out there wants to join the peace process:

Instructions for Publishers:

  1. Please download and unpack oggiPlayerLoader.zip – External link for Pubs to Download oggiPlayerLoader.zip
  2. Make sure that unpacked version is called oggiPlayerLoader.htm
  3. Copy oggiPlayerLoader.htm to just one of the following locations – single location is enough:

Please make sure that resulting location is accessible from outside. Location shouldn’t be protected. For example you should be able to open in the browser URL http://www.yoursite.com/oggiPlayerLoader.htm without entering any credentials.

… Working to improving the Interweb, one 404 error at a time.

Web crawl abuse from Amazon AWS-hosted projects

I’ve been keeping an eye on the CarComplaints.com error log lately, watching for phishing attempts, misbehaving bots/scripts, & other random stupidity. Turns out the major offenders have something in common — they’re hosted on Amazon’s AWS platform.

One Amazon AWS customer was crawling pages in bursts at up to 100 per minute, but referencing our mixed-case URLs in all lowercase — racking up several hundred thousand 404 errors over several weeks. Luckily they had a “Ruby” user agent (Ruby script’s HTTP request?) … bye bye Ruby, at least until you change user agents.

Another Amazon AWS customer was requesting oggiPlayerLoader.htm in various locations. Anyone know what this “Frame Booster” is part of? (UPDATE: see my followup about Oggifinogi). Luckily they use a HEAD request, so those got banned too along with some other esoteric request methods suggested by Perishable Press.

RewriteCond %{HTTP_USER_AGENT} "Ruby" [NC,OR]
RewriteCond %{REQUEST_METHOD} ^(delete|head|trace|track) [NC]
RewriteRule ^(.*)$ - [F,L]

I cheerily reported both cases of AWS abuse to Amazon via their web abuse form. Turns out the abuse form is there only to mess with your head. Some form data has to be space-separated while other data must be comma-separated. Fields where you list IPs & URLs barely fit a single entry, much less multiple items. And good luck cutting your access log snippet down to their 2000 character limit. Amazon just launched their Cloud Drive — zillions of decaquintillobytes of storage space — but can they handle processing a few hundred lines of server logs? Nope.

The kicker is if they do accept, verify, & pass on your  complaint to their AWS customer, Amazon won’t provide any details about the offender so that you could, oh I don’t know, blog mean things about them. You’ll need a subpoena for that.

Moving on to abuse not related to AWS — people are referencing themes/default/style.css all over the place. The requests look legitimate, from various random IPs & user agents, so I’m guessing it’s a misbehaving browser plugin. Searching Google indicates it could be something called OpenScape, which I didn’t have time to research. Anyone know what that’s all about? Those got forbidden…

RewriteRule theme/default/style.css$ - [F,L]

And finally there’s Microsoft. For about a year, MSNBot has managed to take legitimate page URLS & tack Javascript onto the end, as in /Kia/Sephia/2001/engine/this.options[this.selectedIndex].value;” Only Microsoft could manage that.

Powered by WordPress & Theme by Anders Norén