Every so often I go through the CarComplaints.com error logs & watch for server abuse. The latest review found a few new players: Crystal Semantics, Grapeshot, Gigavenue & Mangoway.
CrystalSemantics
Crystal Semantics does after-the-fact contextual advertising. They crawl your pages after an ad is shown. Risky Internet covers this topic well:
Since we do not need a whole series of Ad crawlers making a business out of stealing bandwidth and each on their own reloading pages, the ONLY valid solution is that the seller of the ad-space (whether they are Google Ads or other) deliver the valid classification, since they are the first to crawl the page.No need to have a whole series of separate companies scrape off the same page, and adding more load to all sites, just to make their own business out of it.
Amen to that. Normally I wouldn’t mind so much, but in all their HTTP requests they’ve been accessing the path portion of the URL in all-lowercase. We use mixed case so they’ve been getting gazillion 404 Page Not Found errors. Probably sloppy coding somewhere between their ad agency partner & their service — but after months of 404 errors, they’ve had plenty of opportunity to discover the problem through self-monitoring & fix it.
Grapeshot
Basically a repeat of above, except they apparently & rather arrogantly don’t comply with robots.txt. Not quite as many 404 errors as Crystal Semantics had, but I don’t agree with the whole post-ad-serving contextual value added crawl business model.
Gigavenue
Evil. They’re crawling the site like crazy from multiple IPs but don’t use a unique user-agent. Zero information about their crawler. Emails to all three email addresses listed on gigavenue.com bounce (info@gigavenue.com, info@arcscale.com, noc@arcscale.com). I tried contacting Adam D. Binder via LinkedIn & we’ll see how it goes.
So the changes to robots.txt:
User-agent: crystalsemantics Disallow: / User-agent: grapeshot Disallow: /
Gigavenue doesn’t publish robots.txt info so your guess is as good as mine what robots.txt useragent to use for them.
For good measure, ban them in .htaccess too:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (Ruby|proximic|CrystalSemanticsBot|GrapeshotCrawler|Mangoway) [NC,OR] RewriteCond %{REMOTE_ADDR} ^(208\.78\.85|208\.66\.97|208\.66\.100) RewriteRule !^robots\.txt - [F]
This bans them by UserAgent for the better-behaved crawlers that have one, and by IP for the evil services that don’t, & sends them all to a 403 Forbidden response, except they can access robots.txt to find out the nice way they are disallowed from crawling the site.
NOTE: These IPs in the example code are now several years old & probably aren’t correct anymore. They are only meant to serve as an example of how to ban these & similar services, if you choose to do that.