Hosting crapping itself with Google Bot hits

Maybe giving it a sitemap would alleviate the stress somewhat?

All the sites have a sitemap.xml specifically for this, however the sitemap only holds 50k or so links, the site has well over 500k products each

Google and most of the other legitimate crawlers respect the robots.txt file. As asked previously - what IPs are these bots coming from? Are you sure they're genuinely Google and not some hacker busy DoSing you?

I do an RDNS to make sure of it and all the sites are behind Cloudflare, I actively block anyone that accesses them maliciously. Plus I block the entire China/Russia and Ukraine

First thing I checked was if it was really Google or not
 
AcidRaZor said:
I do an RDNS to make sure of it and all the sites are behind Cloudflare, I actively block anyone that accesses them maliciously. Plus I block the entire China/Russia and Ukraine

First thing I checked was if it was really Google or not
While RDNS check can help, its not the best check. I can go add a "jnb01s02-in-f18.1e100.net" reverse dns to any IP on my network where I have the authoritative zone for RDNS.

Much better to do a whois on the ip itself.

Code:
:~# whois 74.125.233.81 | grep Name
NetName:        GOOGLE
OrgName:        Google Inc.
OrgTechName:   Google Inc
OrgAbuseName:   Google Inc
 
While RDNS check can help, its not the best check. I can go add a "jnb01s02-in-f18.1e100.net" reverse dns to any IP on my network where I have the authoritative zone for RDNS.

Much better to do a whois on the ip itself.

Code:
:~# whois 74.125.233.81 | grep Name
NetName:        GOOGLE
OrgName:        Google Inc.
OrgTechName:   Google Inc
OrgAbuseName:   Google Inc

How efficient is it though? Especially in the web app that gets 100's of hits a minute.

Also, wouldn't anyone with Google's new Fiber ISP have the whois of Google?

I'm just following best practices outlined by Google in how to detect their bot
 
So how many RPS do you get? The only time a bot can kill your site is if you have performance issues to start with. Could be a number of reasons. Check out PageSpeed and http://www.webpagetest.org/.

You want bots to crawl your site.

Pagespeed only evaluates your design really, not the code running the design. I've already pin-pointed (what I think) is the cause and have done some tests with loadimpact.com

Without, the increasing load increases the CPU and response time on the server, whereas with my new code, it stays stable in terms of load time and cpu doesn't fluctuate as much either.

Goes from 5s load time to 50s load time with old code under load. New code is rock solid 0.800s. Think the largest movement in time I got under load with the new code was MAYBE 1s
 
Pagespeed only evaluates your design really, not the code running the design. I've already pin-pointed (what I think) is the cause and have done some tests with loadimpact.com

Without, the increasing load increases the CPU and response time on the server, whereas with my new code, it stays stable in terms of load time and cpu doesn't fluctuate as much either.

Goes from 5s load time to 50s load time with old code under load. New code is rock solid 0.800s. Think the largest movement in time I got under load with the new code was MAYBE 1s

The reason why I mentioned PageSpeed (or WebPageTest.org) is that it will test your site from overseas pops and it will also show you any obvious issues - such as misconfigured HTTP servers (caching missing, compression etc) - all aspects which will add additional load onto a server and can easily be solved.
 
The reason why I mentioned PageSpeed (or WebPageTest.org) is that it will test your site from overseas pops and it will also show you any obvious issues - such as misconfigured HTTP servers (caching missing, compression etc) - all aspects which will add additional load onto a server and can easily be solved.

Yep, and for the most part, that's why all the sites are behind Cloudflare, it also allows me to ban IP's and IP ranges on the DNS level and caches etc. I'm well versed in those things, which is a first for me when it scales, because usually my code scales pretty well with this (this is my first OO PHP project)

I can handle 200k bot hits easily on a daily basis, but the bot traffic increased to about 200k every hour when I launched more sites running off of the code and it became increasingly apparent that there's something wrong, since load time increased exponentially and apache sat around 100% forever :p
 
Top
Sign up to the MyBroadband newsletter
X