A windsurfing, CSS-grudging, IE-hating, web-developing, gigantic-machine-puzzling blog

Category: abuse

Ruben Harris TrekkSoft Spam

Back in September 2017, an email address at my small scuba diving company Hero Divers began receiving unsolicited marketing emails from Ruben Harris, ruben.h@trekksoft.io, Product Marketing Manager at TrekkSoft AG.

As far as I can tell, he’s a legitimate employee & has a ZoomInfo profile.

Most of Ruben’s emails have a customized subject (i.e. “About Hero Divers”) & he mentions Hero Divers throughout the email, so it looks like a personalized email marketing campaign. The links in Ruben’s emails redirect through a link-tracking service & end up here:

https://www.trekksoft.com/en/lp/simple-sp-demo-request?utm_campaign=scubadiving2018&utm_source=email

Here’s the first email I received from Ruben Harris from 9/20/17:

Hi,

I’m writing you in the hopes of finding the person that is responsible for online marketing at Hero Divers. How are things going at Hero Divers? Do you find you are stretched this season?

I work with TrekkSoft, a booking solution for companies like yours. We provide you an integrated booking solution that automates your web, front desk and partner bookings. In plain English, we cut down on phone calls and emails while driving you more bookings through our distribution channels.

Have a look at what we can offer Hero Divers. Now is a great time of year to understand where you can make improvements. I’d be delighted to review your website and your operations to see if we can come up with some improvements together.

If there’s anyone else at Hero Divers that would be better to talk to, I would greatly appreciate it if you could point me in the right direction.

Kind regards,
Ruben Harris
Product Marketing Manager at TrekkSoft

PS: We’ve got a large amount of clients worldwide, which would be ideal to cross sell with through the partner network we’ve set up.

There’s no remove link on any of his emails. I continued to receive similar TrekkSoft emails from Ruben at about one per week: on 9/25, 10/1, 10/6, & 10/12.

I just assume Ruben will get bored after awhile & stop emailing. Sure enough, the weekly barrage stopped after 10/12. But get an axe, it was a trick.

March 9th, what pops up? Another fucking email from Ruben Harris at TrekkSoft. He has the gall to pretend like we don’t have any prior history: “Hi, I’m Ruben from TrekkSoft. I came across Hero Divers online and wanted to get in touch to see if your team is thinking about options for an all-in-one booking system ….

March 13th, it’s Ruben again: “Hi, just checking in to see if you wanted to ask any questions about TrekkSoft booking software ….

I reply that I’m not interested.

March 18th, more from Ruben, “Just checking in to see if you had any questions about TrekkSoft for Hero Divers ….

I reply again asking to be removed.

March 22, Ruben Harris at it again: “Quick question: as scuba diving company, what’s the biggest challenge for Hero Divers right now? ….

March 26, yet another email from Ruben: “It’s Ruben from TrekkSoft. I’m writing to see if you’d seen my last message and to quickly pass on our Travel Trends Report 2018 ….

March 30, yep: “Hi, I hope you found my last few emails useful. I also wanted to offer you a time slot to discuss Hero Divers properly and find out whether TrekkSoft’s all-in-one booking software could be a good fit for you ….

Replied yet again asking to be removed.

April 3, “Hello, just wondering if you found the time to read through my emails over the past few weeks. If you are not interested in what our software can offer Hero Divers then I won’t mail you again ….

Yes Ruben, I read your fucking emails. And if it’s true what you say that you won’t email me again, thank fucking god. Although by now, I’m mad.

I would never do business with TrekkSoft, even if I was interested. Ruben, either run a legitimate mass-marketing campaign & include remove links, or run a personal one & remove people when they ask.

I do a little research. Apparently I’m not alone. I find this from “The Stranger” on Twitter from 2016:

https://twitter.com/rivenworld/status/691033252908371968?lang=en

April 3, I post on the TrekkSoft Facebook page complaining about all these Ruben Harris emails & asking if he really worked as Marketing Manager for the company.

Next day I go back to the TrekkSoft Facebook page & my post is gone. Looks like they’ve hidden my post with no reply. Now I’m the angry non-customer.

April 4, I post again on their Facebook page:

April 5, no reply from TrekkSoft. At least my comment is still up on their Facebook page. Facebook indicates TrekkSoft “usually replies within an hour”, but it’s been 2 days & counting with no reply to my Ruben Harris spam complaint.

I did some more digging & found out Quickmail.io is the link tracking service Ruben is using. They have a very helpful abuse reporting page, so I sent in a spam abuse complaint about TrekkSoft with copies of all the spam emails & my requests back to be removed. Really hoping they shut Ruben down.

TrekkSoft is a marketing company. It’s ironic how badly they are screwing up their own marketing & reputation with this shitty Ruben Harris email campaign. I also sent a direct message to VP of Sales at TrekkSoft with my complaint & a link to this blog. Maybe he’ll care.

Stay tuned.

SCAM: Web Content Scraping in Realtime

I discovered several illicit websites have been scraping, reprocessing & re-serving copyright web content from CarComplaints.com in real-time.

It’s an assholish way to do business.

Here’s how the scam works:

  1. An unsuspecting visitor to one of these illicit websites requests a web page.
  2. The web server passes the request to the content scraper bot.
  3. The scraper bot script makes a web request to the legitimate website & reprocesses (steals) the content.
  4. The scraper bot transmits the stolen content back to the illicit web server.
  5. The web server serves the stolen content back to the site visitor.

This content-scraping happens in realtime, in the background over a few seconds as the visitor’s browser sits there waiting.

The first content scraper site I discovered was replacing “CarComplaints.com” anywhere it appeared in the HTML code with the name of the illicit website, & also replaced advertising so that the scammers earned the ad revenue instead of my company. Evil!

The largest offender so far was the website carcomplaints.xyz, which has since been shut down after I filed complaints with their ISP. They had managed to get ~9,150 pages indexed by Google, which are (hopefully) in the process of being removed sometime soon. Their entire site was a duplicate of mine with all pages scraped from my site & returned to their visitors in realtime. The scam website was hosted on a different IP & service from the content scraper, but it was easy to track down by requesting a bogus page on the scam website & then watching the content scrape request hit my site by tailing the Apache access log.

Once I identified these content scrape requests, I reviewed my access log & found many similar requests being made from other IPs, but I couldn’t find the corresponding scam websites. It’s impossible to track down which website these requests were originating from, but you can still go after the ISP that’s hosting the content scrapers.

For now the scraper bots are using the useragent “Go-http-client/1.1“.

Many of the scraper bots use Amazon AWS as the host. To file a complaint, email details including log files to abuse@amazonaws.com — generally AWS is pretty good about taking care of it, but you will need to prove there’s been an AWS Acceptable Use Policy violation or else AWS simply passes your complaint on to their customer.

To establish an AUP violation, ban the Go-http-client useragent in your robots.txt file. AWS requires any clients operating web crawlers to follow the robots.txt convention. I couldn’t find where any of these IPs had tried to access robots.txt but I did it anyway so AWS could take further steps against their client.

Until the scammers change the useragent, you can also ban that traffic by returning a 403 Forbidden response using a RewriteRule in .htaccess:

RewriteCond %{HTTP_USER_AGENT} (Go-http-client) [NC]
RewriteRule !^robots\.txt - [F]

Or have a bit more fun with the scammers & redirect their content scraper requests to a copyright violation notice page:

RewriteCond %{HTTP_USER_AGENT} (Go-http-client) [NC]
RewriteRule !^(robots\.txt|copyright_violation\.html) /copyright_violation.html [R,L]

Or the FBI Cyber Crime page:

RewriteCond %{HTTP_USER_AGENT} (Go-http-client) [NC]
RewriteRule !^robots\.txt https://www.fbi.gov/investigate/cyber/ [R,L]

NOTE: These examples assume you already have mod_rewrite enabled & “RewriteEngine On”.

Bad Crawler Bots: ptr.cnsat.com.cn

Bad Web CrawlersFound this bot accessing the site via lots of different 202.46.* IPs. Reverse DNS points to ptr.cnsat.com.cn.

The range of IPs for 202.46.32.0 to 202.46.63.255 is associated with ShenZhen Sunrise Technology Co., Ltd.

This is how to ban via .htaccess RewriteRule:

## ban ptr.cnsat.com.cn
RewriteCond %{REMOTE_ADDR} ^202\.46\.(3[2-9]|[4-5][0-9]|6[0-3])\.
RewriteRule !^robots\.txt - [F]

Optionally, you can add this RewriteCond for the useragent they happen to be using at the moment:

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(Windows\ NT\ 5\.1;\ rv:6\.0\.2\)\ Gecko/20100101\ Firefox/6\.0\.2

However, the IP ban is specific to the range owned by the company, so personally I wouldn’t bother using that useragent criteria. They could just change it at any time.

I did see they made several requests to robots.txt, but without a proper user agent identifying this bot as a crawler, your guess is as good as mine how to ban it in robots.txt, perhaps:

User-Agent: ptr.cnsat.com.cn
Disallow: /

Bad Crawler Bots: Proximic, CrystalSemantics, Grapeshot, Gigavenue

Bad Web CrawlersEvery so often I go through the CarComplaints.com error logs & watch for server abuse. The latest review found a few new players: Crystal Semantics, Grapeshot, Gigavenue & Mangoway.

CrystalSemantics

Crystal Semantics does after-the-fact contextual advertising. They crawl your pages after an ad is shown. Risky Internet covers this topic well:

Since we do not need a whole series of Ad crawlers making a business out of stealing bandwidth and each on their own reloading pages, the ONLY valid solution is that the seller of the ad-space (whether they are Google Ads or other) deliver the valid classification, since they are the first to crawl the page.No need to have a whole series of separate companies scrape off the same page, and adding more load to all sites, just to make their own business out of it.

Amen to that. Normally I wouldn’t mind so much, but in all their HTTP requests they’ve been accessing the path portion of the URL in all-lowercase. We use mixed case so they’ve been getting gazillion 404 Page Not Found errors. Probably sloppy coding somewhere between their ad agency partner & their service — but after months of 404 errors, they’ve had plenty of opportunity to discover the problem through self-monitoring & fix it.

Grapeshot

Basically a repeat of above, except they apparently & rather arrogantly don’t comply with robots.txt. Not quite as many 404 errors as Crystal Semantics had, but I don’t agree with the whole post-ad-serving contextual value added crawl business model.

Gigavenue

Evil. They’re crawling the site like crazy from multiple IPs but don’t use a unique user-agent. Zero information about their crawler. Emails to all three email addresses listed on gigavenue.com bounce (info@gigavenue.com, info@arcscale.com, noc@arcscale.com). I tried contacting Adam D. Binder via LinkedIn & we’ll see how it goes.

So the changes to robots.txt:

User-agent: crystalsemantics
Disallow: /
User-agent: grapeshot
Disallow: /

Gigavenue doesn’t publish robots.txt info so your guess is as good as mine what robots.txt useragent to use for them.

For good measure, ban them in .htaccess too:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Ruby|proximic|CrystalSemanticsBot|GrapeshotCrawler|Mangoway) [NC,OR]
RewriteCond %{REMOTE_ADDR} ^(208\.78\.85|208\.66\.97|208\.66\.100)
RewriteRule !^robots\.txt - [F]

This bans them by UserAgent for the better-behaved crawlers that have one, and by IP for the evil services that don’t, & sends them all to a 403 Forbidden response, except they can access robots.txt to find out the nice way they are disallowed from crawling the site.

NOTE: These IPs in the example code are now several years old & probably aren’t correct anymore. They are only meant to serve as an example of how to ban these & similar services, if you choose to do that.

404 Not Found Abuse: oggiPlayerLoader.htm

In a refreshingly proactive turn of events, one Amazon AWS abuser replied to me directly. The oggiPlayerLoader.htm 404 errors detailed in my previous web abuse post were courtesy of Oggifinogi, a rich media provider based out of Bellevue, WA.

Directory of Technology Paul Grinchenko emailed me back with a friendly explanation:

We are just looking for our IFRAME buster. You were running at least 1 of our ads in an IFRAME.

No surprise there. We have no prior relationship with Oggifinogi, so I figured their ads had been served through one of the 3rd party ad networks we use (turns out it was ValueClick).

Luckily the issue is simpler than that — Amazon AWS prohibits them from 404-bombing our servers at “an excessive or disruptive rate”. My reply to Paul:

As you probably saw from the “comments” I provided, my complaint was your service’s excessive HEAD requests to the same 6 non-existent files. Judging from the excessively long-term & repetitive 404 errors, it seems your service does nothing useful with the “not found” status code returned by our servers each time. Oggifinogi would be better off using a more responsible system: monitor HTTP response codes to your iframe buster requests, & use that information to limit requests when the files clearly don’t exist. By the way, I somewhat appreciate your service’s HEAD requests versus a full GET, but it’s a bandaid.

Also I urge you to consider Amazon’s advice: We would strongly recommend that you customize your UserAgent string to include a contact email address so as to allow target sites being crawled to be able to contact you directly. (…although most responsible web services I’ve come across put a URL in their user agent, rather than an email address…)

A few hours later, Paul replied that Oggifinogi does indeed cache iframe buster file presence for a short period, so their requests should not exceed 75 per hour. That fits the profile I saw — no real strain on the web server, but very annoying when tailing error logs.

The good news is Paul agreed to start using a Oggifinogi user agent — hopefully with a help/explanation page URL too.

Paul also sent me the oggiPlayerLoader.htm instructions. Now Oggifinogi can bust our iframes at will, rather than continuing the 404 war. In case anyone else out there wants to join the peace process:

Instructions for Publishers:

  1. Please download and unpack oggiPlayerLoader.zip – External link for Pubs to Download oggiPlayerLoader.zip
  2. Make sure that unpacked version is called oggiPlayerLoader.htm
  3. Copy oggiPlayerLoader.htm to just one of the following locations – single location is enough:

Please make sure that resulting location is accessible from outside. Location shouldn’t be protected. For example you should be able to open in the browser URL http://www.yoursite.com/oggiPlayerLoader.htm without entering any credentials.

… Working to improving the Interweb, one 404 error at a time.

Web crawl abuse from Amazon AWS-hosted projects

I’ve been keeping an eye on the CarComplaints.com error log lately, watching for phishing attempts, misbehaving bots/scripts, & other random stupidity. Turns out the major offenders have something in common — they’re hosted on Amazon’s AWS platform.

One Amazon AWS customer was crawling pages in bursts at up to 100 per minute, but referencing our mixed-case URLs in all lowercase — racking up several hundred thousand 404 errors over several weeks. Luckily they had a “Ruby” user agent (Ruby script’s HTTP request?) … bye bye Ruby, at least until you change user agents.

Another Amazon AWS customer was requesting oggiPlayerLoader.htm in various locations. Anyone know what this “Frame Booster” is part of? (UPDATE: see my followup about Oggifinogi). Luckily they use a HEAD request, so those got banned too along with some other esoteric request methods suggested by Perishable Press.

RewriteCond %{HTTP_USER_AGENT} "Ruby" [NC,OR]
RewriteCond %{REQUEST_METHOD} ^(delete|head|trace|track) [NC]
RewriteRule ^(.*)$ - [F,L]

I cheerily reported both cases of AWS abuse to Amazon via their web abuse form. Turns out the abuse form is there only to mess with your head. Some form data has to be space-separated while other data must be comma-separated. Fields where you list IPs & URLs barely fit a single entry, much less multiple items. And good luck cutting your access log snippet down to their 2000 character limit. Amazon just launched their Cloud Drive — zillions of decaquintillobytes of storage space — but can they handle processing a few hundred lines of server logs? Nope.

The kicker is if they do accept, verify, & pass on your  complaint to their AWS customer, Amazon won’t provide any details about the offender so that you could, oh I don’t know, blog mean things about them. You’ll need a subpoena for that.

Moving on to abuse not related to AWS — people are referencing themes/default/style.css all over the place. The requests look legitimate, from various random IPs & user agents, so I’m guessing it’s a misbehaving browser plugin. Searching Google indicates it could be something called OpenScape, which I didn’t have time to research. Anyone know what that’s all about? Those got forbidden…

RewriteRule theme/default/style.css$ - [F,L]

And finally there’s Microsoft. For about a year, MSNBot has managed to take legitimate page URLS & tack Javascript onto the end, as in /Kia/Sephia/2001/engine/this.options[this.selectedIndex].value;” Only Microsoft could manage that.

Powered by WordPress & Theme by Anders Norén