C
CrowdSec4mo ago
Sich

Whitelist user agent from file

Hi, I try to write a whitelist parser, to whitelist user agent from a file. (stored in parsers/s02-enrich). This is the parser that I write :
name: si/si_wl_useragent_ai
description: "Whitelist UA AI"
whitelist:
reason: "Whitelist UA AI"
expression:
- "any(File('useragent_ai.txt'), { evt.Parsed.http_user_agent contains # })"
data:
- source_url: https://data.srvsi.com/useragent_ai.txt
dest_file: useragent_ai.txt
type: string
name: si/si_wl_useragent_ai
description: "Whitelist UA AI"
whitelist:
reason: "Whitelist UA AI"
expression:
- "any(File('useragent_ai.txt'), { evt.Parsed.http_user_agent contains # })"
data:
- source_url: https://data.srvsi.com/useragent_ai.txt
dest_file: useragent_ai.txt
type: string
And some extract from the file :
GoogleOther
GoogleOther-Image
GoogleOther-Video
GPTBot
iaskspider/2.0
ICC-Crawler
ImagesiftBot
GoogleOther
GoogleOther-Image
GoogleOther-Video
GPTBot
iaskspider/2.0
ICC-Crawler
ImagesiftBot
The parser is loaded correctly :
time="2025-06-10T05:10:07+02:00" level=info msg="Loaded 1 parser nodes" file=/etc/crowdsec/parsers/s02-enrich/si-useragent-ai.yaml stage=s02-enrich
time="2025-06-10T05:10:07+02:00" level=info msg="Loaded 1 parser nodes" file=/etc/crowdsec/parsers/s02-enrich/si-useragent-ai.yaml stage=s02-enrich
I have probably missed something, any idea on how I can fix this ?
7 Replies
CrowdSec
CrowdSec4mo ago
Important Information
Thank you for getting in touch with your support request. To expedite a swift resolution, could you kindly provide the following information? Rest assured, we will respond promptly, and we greatly appreciate your patience. While you wait, please check the links below to see if this issue has been previously addressed. If you have managed to resolve it, please use run the command /resolve or press the green resolve button below.
Log Files
If you possess any log files that you believe could be beneficial, please include them at this time. By default, CrowdSec logs to /var/log/, where you will discover a corresponding log file for each component.
Guide Followed (CrowdSec Official)
If you have diligently followed one of our guides and hit a roadblock, please share the guide with us. This will help us assess if any adjustments are necessary to assist you further.
Screenshots
Please forward any screenshots depicting errors you encounter. Your visuals will provide us with a clear view of the issues you are facing.
© Created By WhyAydan for CrowdSec ❤️
iiamloz
iiamloz4mo ago
and the file useragent_ai.txt exist in /var/lib/crowdsec/data?
Sich
SichOP4mo ago
Hi, yes the file exist
iiamloz
iiamloz4mo ago
if you add debug: true to the yaml of the whitelist do you get any usable feedback in the log?
Sich
SichOP4mo ago
I will test that ok, tested, so much spam in the log ! Apparently he read the file correctly. But I continu to get bot banned even with the good user agent.
time="2025-06-10T10:45:30+02:00" level=debug msg=" [File('useragent_ai.txt'), {] File(\"useragent_ai.txt\") = [AI2Bot Ai2Bot-Dolma aiHitBot Amazonbot Andibot anthropic-ai Applebot Applebot-Extended bedrockbot Brightbot 1.0 Bytespider CCBot ChatGPT-User Claude-SearchBot Claude-User Claude-Web ClaudeBot cohere-ai cohere-training-data-crawler Cotoyogi Crawlspace Diffbot DuckAssistBot FacebookBot Factset_spyderbot FirecrawlAgent FriendlyCrawler Google-CloudVertexBot Google-Extended GoogleOther GoogleOther-Image GoogleOther-Video GPTBot iaskspider/2.0 ICC-Crawler ImagesiftBot img2dataset ISSCyberRiskCrawler Kangaroo Bot meta-externalagent Meta-ExternalAgent meta-externalfetcher Meta-ExternalFetcher MistralAI-User/1.0 NovaAct OAI-SearchBot omgili omgilibot Operator PanguBot Panscient panscient.com Perplexity-User PerplexityBot PetalBot PhindBot QualifiedBot QuillBot quillbot.com SBIntuitionsBot Scrapy SemrushBot-OCOB SemrushBot-SWA Sidetrade indexer bot TikTokSpider Timpibot VelenPublicWebCrawler Webzio-Extended wpbot YandexAdditional YandexAdditionalBot YouBot]" id=ancient-shape name=si/si_wl_useragent_ai stage=s02-enrich
time="2025-06-10T10:45:30+02:00" level=debug msg=" BLOCK_START [any(File('useragent_ai.txt'), {]" id=ancient-shape name=si/si_wl_useragent_ai stage=s02-enrich

cient-shape name=si/si_wl_useragent_ai stage=s02-enrich
time="2025-06-10T10:45:30+02:00" level=debug msg=" [any(File('useragent_ai.txt'), { evt.Parsed.http_user_agent contains # })] " id=ancient-shape name=si/si_wl_useragent_ai stage=s02-enrich
time="2025-06-10T10:45:30+02:00" level=debug msg=" BLOCK_END [any(File('useragent_ai.txt'), { evt.Parsed.http_user_agent contains # })] -> false" id=ancient-shape name=si/si_wl_useragent_ai stage=s02-enrich
time="2025-06-10T10:45:30+02:00" level=debug msg="Event leaving node : ok" id=ancient-shape name=si/si_wl_useragent_ai stage=s02-enric
time="2025-06-10T10:45:30+02:00" level=debug msg=" [File('useragent_ai.txt'), {] File(\"useragent_ai.txt\") = [AI2Bot Ai2Bot-Dolma aiHitBot Amazonbot Andibot anthropic-ai Applebot Applebot-Extended bedrockbot Brightbot 1.0 Bytespider CCBot ChatGPT-User Claude-SearchBot Claude-User Claude-Web ClaudeBot cohere-ai cohere-training-data-crawler Cotoyogi Crawlspace Diffbot DuckAssistBot FacebookBot Factset_spyderbot FirecrawlAgent FriendlyCrawler Google-CloudVertexBot Google-Extended GoogleOther GoogleOther-Image GoogleOther-Video GPTBot iaskspider/2.0 ICC-Crawler ImagesiftBot img2dataset ISSCyberRiskCrawler Kangaroo Bot meta-externalagent Meta-ExternalAgent meta-externalfetcher Meta-ExternalFetcher MistralAI-User/1.0 NovaAct OAI-SearchBot omgili omgilibot Operator PanguBot Panscient panscient.com Perplexity-User PerplexityBot PetalBot PhindBot QualifiedBot QuillBot quillbot.com SBIntuitionsBot Scrapy SemrushBot-OCOB SemrushBot-SWA Sidetrade indexer bot TikTokSpider Timpibot VelenPublicWebCrawler Webzio-Extended wpbot YandexAdditional YandexAdditionalBot YouBot]" id=ancient-shape name=si/si_wl_useragent_ai stage=s02-enrich
time="2025-06-10T10:45:30+02:00" level=debug msg=" BLOCK_START [any(File('useragent_ai.txt'), {]" id=ancient-shape name=si/si_wl_useragent_ai stage=s02-enrich

cient-shape name=si/si_wl_useragent_ai stage=s02-enrich
time="2025-06-10T10:45:30+02:00" level=debug msg=" [any(File('useragent_ai.txt'), { evt.Parsed.http_user_agent contains # })] " id=ancient-shape name=si/si_wl_useragent_ai stage=s02-enrich
time="2025-06-10T10:45:30+02:00" level=debug msg=" BLOCK_END [any(File('useragent_ai.txt'), { evt.Parsed.http_user_agent contains # })] -> false" id=ancient-shape name=si/si_wl_useragent_ai stage=s02-enrich
time="2025-06-10T10:45:30+02:00" level=debug msg="Event leaving node : ok" id=ancient-shape name=si/si_wl_useragent_ai stage=s02-enric
if you tell me that everything is fine, I will just do a cscli decions delete --all, and watch my logs and see if something interesting happen...
iiamloz
iiamloz4mo ago
its looks good to me, the only thing is its a contains so it has to match case.
Sich
SichOP4mo ago
currently scanned by chatgpt, but no scenarios triggered, I will continu to watch. thx for your help. ok, I think I understand, thoses IP are banned inside CAPI. Exemple for 20.171.207.115, it's in 20.171.207.0/24, who is a openai range. https://openai.com/gptbot-ranges.txt But the IP is blocked by CAPI : https://app.crowdsec.net/cti/20.171.207.115 And on my side, it's blocked through CAPI :
cscli decisions list --all |grep 20.171.207.115
| 116513424 | CAPI | Ip:20.171.207.115 | http:scan | ban | | | 0 | 166h32m53s | 1376141 |
cscli decisions list --all |grep 20.171.207.115
| 116513424 | CAPI | Ip:20.171.207.115 | http:scan | ban | | | 0 | 166h32m53s | 1376141 |
Maybe you should take care of that ? To not ban those IPs with the CAPI ?

Did you find this page helpful?