Bad Web CrawlersEvery so often I go through the CarComplaints.com error logs & watch for server abuse. The latest review found a few new players: Crystal Semantics, Grapeshot, Gigavenue & Mangoway.

CrystalSemantics

Crystal Semantics does after-the-fact contextual advertising. They crawl your pages after an ad is shown. Risky Internet covers this topic well:

Since we do not need a whole series of Ad crawlers making a business out of stealing bandwidth and each on their own reloading pages, the ONLY valid solution is that the seller of the ad-space (whether they are Google Ads or other) deliver the valid classification, since they are the first to crawl the page.No need to have a whole series of separate companies scrape off the same page, and adding more load to all sites, just to make their own business out of it.

Amen to that. Normally I wouldn’t mind so much, but in all their HTTP requests they’ve been accessing the path portion of the URL in all-lowercase. We use mixed case so they’ve been getting gazillion 404 Page Not Found errors. Probably sloppy coding somewhere between their ad agency partner & their service — but after months of 404 errors, they’ve had plenty of opportunity to discover the problem through self-monitoring & fix it.

Grapeshot

Basically a repeat of above, except they apparently & rather arrogantly don’t comply with robots.txt. Not quite as many 404 errors as Crystal Semantics had, but I don’t agree with the whole post-ad-serving contextual value added crawl business model.

Gigavenue

Evil. They’re crawling the site like crazy from multiple IPs but don’t use a unique user-agent. Zero information about their crawler. Emails to all three email addresses listed on gigavenue.com bounce (info@gigavenue.com, info@arcscale.com, noc@arcscale.com). I tried contacting Adam D. Binder via LinkedIn & we’ll see how it goes.

So the changes to robots.txt:

User-agent: crystalsemantics
Disallow: /
User-agent: grapeshot
Disallow: /

Gigavenue doesn’t publish robots.txt info so your guess is as good as mine what robots.txt useragent to use for them.

For good measure, ban them in .htaccess too:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Ruby|proximic|CrystalSemanticsBot|GrapeshotCrawler|Mangoway) [NC,OR]
RewriteCond %{REMOTE_ADDR} ^(208\.78\.85|208\.66\.97|208\.66\.100)
RewriteRule !^robots\.txt - [F]

This bans them by UserAgent for the better-behaved crawlers that have one, and by IP for the evil services that don’t, & sends them all to a 403 Forbidden response, except they can access robots.txt to find out the nice way they are disallowed from crawling the site.

NOTE: These IPs in the example code are now several years old & probably aren’t correct anymore. They are only meant to serve as an example of how to ban these & similar services, if you choose to do that.