Managed Rules block Google Crawls

We've seen a significant increase in Google encountering 403 permission denied errors. They've got up into the thousands on our Google Search Console. I investigated this in detail and found that it's Cloudflare's Managed Rules, in particular the "Fake Google User Agent" that is blocking genuine Google crawls. Has anyone else encountered this? Is there any solution apart from turning off that managed rule?
No description
No description
42 Replies
Thamis
ThamisOP•3mo ago
Right, but it's definitely the real request from Google. The timestamp matches exactly. How can you tell that it's a workers subrequest?
Laudian
Laudian•3mo ago
If you're sure it's matching actual Google requests, you're probably using Signed Exchanges or something like that. Though those shouldn't be blocked.
Thamis
ThamisOP•3mo ago
We do have TollBit that runs workers, so it could be that. But why would a subrequest affect the final result that Google sees? BTW, thanks for jumping in to help on this so quickly. 😄
Laudian
Laudian•3mo ago
Do you use Automatic Signed Exchanges? You can find it in the CF dashboard under Speed -> Optimization -> Other
Thamis
ThamisOP•3mo ago
Yes, that is turned on.
Laudian
Laudian•3mo ago
Whenever Google makes a request, Cloudflare will make a Worker subrequest for the Signed Exchange. This subrequest should be exempt from blocks if I remember right.
Thamis
ThamisOP•3mo ago
Yes that would make sense that it's exempted. But we're definitely returning a 403 to Google. Based on our log explorer, no 403 was ever sent...
Laudian
Laudian•3mo ago
I don't think Signed Exchanges would show there.
Thamis
ThamisOP•3mo ago
Would you recommend turning off SGX? Or turning off the fake agent?
Laudian
Laudian•3mo ago
You can disable the fake user agent rule until this has been fixed, unless you're having specific issues with fake Googlebots.
Thamis
ThamisOP•3mo ago
Right, we've done that. How would you go about trying to fix this issue?
Laudian
Laudian•3mo ago
Can you also create a support ticket and share the number? And can you see in your logs when this started?
Thamis
ThamisOP•3mo ago
I can see in my Google Search Console when it started, yes.
Laudian
Laudian•3mo ago
When was that?
Thamis
ThamisOP•3mo ago
22nd of April
Laudian
Laudian•3mo ago
Wow, it's been a while :NotLikeThis:
Thamis
ThamisOP•3mo ago
Yup. We've been struggling to figure out why for a long time. I only now managed to get to the bottom of it.
Thamis
ThamisOP•3mo ago
Here's the blocking event.
Thamis
ThamisOP•3mo ago
Well, one of them. Right, I submitted a support case ticket. Case ID: 01659769 The start of the issue does coincide with us deploying TollBit I think.
Laudian
Laudian•3mo ago
How exactly is TollBit deployed?
Thamis
ThamisOP•3mo ago
It's a Cloudflare worker.
Thamis
ThamisOP•3mo ago
TollBit Documentation
TollBit Documentation and API reference
Learn how to onboard and implement TollBit for your application.
Laudian
Laudian•3mo ago
Can you try a WAF skip rule for something like "cf.worker.upstream_zone equals yourdomain" -> skip everything?
Thamis
ThamisOP•3mo ago
Sure. What does that do?
Laudian
Laudian•3mo ago
It allows Worker from your own account to bypass any other rules
Thamis
ThamisOP•3mo ago
With or without www?
Laudian
Laudian•3mo ago
(cf.worker.upstream_zone eq "example.com")
(cf.worker.upstream_zone eq "example.com")
The name of the Cloudflare zone, so without www
Thamis
ThamisOP•3mo ago
Ok. Done!
Thamis
ThamisOP•3mo ago
Like this?
No description
Laudian
Laudian•3mo ago
yes.
Thamis
ThamisOP•3mo ago
Right, I've set that up! Question: If all traffic triggers that worker, would this basically whitelist everything? I'm not entirely sure how workers work. Do they run in parallel to the main request, or are they in between the request and the server?
Laudian
Laudian•3mo ago
The WAF sits in front of every request. The initial request makes it through the WAF, but then the Worker subrequest will need to pass the WAF again. If the initial traffic made it through the WAF, you don't want to block the Worker subrequest.
Thamis
ThamisOP•3mo ago
Ah that makes sense! So if the WAF blocks it in the first place, the worker won't get triggered. But otherwise the worker triggers another WAF event which might cause a block. Got it.
Laudian
Laudian•3mo ago
Do you now still see issues if you enable the fake Google bot rule again?
Thamis
ThamisOP•3mo ago
I'll have to wait for another crawl. Google always takes a few days before it sends results.
Thamis
ThamisOP•3mo ago
The URL inspection tool always works. It will not result in a 403 error. It also uses a different user-agent than normal Google crawl.
Laudian
Laudian•3mo ago
oh, ok I would've expected the Fake Googlebot rule to apply to most google user-agents
Thamis
ThamisOP•3mo ago
True, but I don't think it also triggers on every Google crawl. It's not like Google has deindexed our site. But it's thousands of pages and it's growing. So I don't know what the logic is, but sometimes it blocks pages, other times it doesn't, and it seems to grow.
Laudian
Laudian•3mo ago
Definitely weird. Please update whether you still see requests blocked or not.
Thamis
ThamisOP•3mo ago
Will do! I've just requested a recrawl on the affected pages. Thanks a lot for your help with this! I've also let TollBit know what we've found so far. The whitelist worker subrequests rule doesn't seem to get triggered at all.
tom.sowerby
tom.sowerby•4w ago
Sorry, reviving this old thread. We have a very similar issue of 403s for Googlebot. Deemed as fake when the times match the real crawls. Did the new WAF skip rule work for you? I think we're (eventually) got to the bottom of this... "Automatic Signed Exchanges (SXGs)" uses a worker and masks the ASN and IP, causing the Cloudflare "fake Googlebot" rule to 403 it. We tried to turn off SXG, it looked like worked, but no change to 403s. Going back to the SXG setting it was turned on again!?! So some issues with turning off SXG. Disabling the Fake Googlebot rule as a temporary solution has worked in the meantime.

Did you find this page helpful?