CrowdSec•2mo ago

A few questions about metrics

Hi there! A while ago we set up metrics collection using Prometheus and visualization using Grafana. We set the prometheus level to "full", and we've only just noticed - using a network monitoring tool - that some of our machines (the ones running ispconfig, and configured to read each and every log of all the websites) are basically constantly sending 5-10Mbps of traffic to our Prometheus server. Because we run well over a hundred machines, this means that the gigabit connection on our monitoring server gets overwhelmed. For example, on a machine that's been running for a while, the metrics are ~300MB.

[~]$ du -h metrics
288M    metrics

[~]$ du -h metrics
288M    metrics

We filter some of these metrics out on the prometheus side (e.g. all the go_ metrics, and also the whitelist hits). As far as I could tell, there is currently no way for us to filter these on the client side. Currently, we are also experimenting with just using the aggregated level of metrics, and trying to find out if that's enough data for us. Using aggregated metrics, on another long-running server, the metrics size, and thus the network traffic are looking much better:

[~]$ du -h metrics
916K    metrics

[~]$ du -h metrics
916K    metrics

But, this too, has some data that we don't really need. Is there any possibility that in the future there would be functionality to control exactly what data gets pushed to metrics, so that we don't need to filter it on the Prometheus side, after already pushing all the data over the network? Sorry if my explanation is a bit fumbled.

17 Replies

CrowdSec•2mo ago

Important Information

This post has been marked as resolved. If this is a mistake please press the red button below or type /unresolve

alukasOP•2mo ago

Just to illustrate, this is how it looks on the Prometheus/Grapher server while all the machines are being scraped for their metrics.

alukasOP•2mo ago

Also, we use the official Grafana dashboards (some parts of which don't work because of our current, unfortunately necessary, filtering).

blotus•2mo ago

Which metrics would you want to get exactly (or rather, which ones you do not) ? I don't really like the idea of exposing some configuration to choose exactly which metrics are exposed, as it would likely be not very user friendly At best, I could something where you can enable/disable a group of metrics (acquisition, parsing, ...), and even then I'm not sure it's something we would expose for all the metrics groups (for example, we are working on more advanced healthcheck based on the prometheus metrics, so we need to have those enabled). And if you are talking about the basic go metrics, I'm not sure it's something we can disable 😦 Another potential workaround for now is to reduce the scrape interval (i don't know which value you are using, but getting the metrics every minute is probably enough) also, which datasources are you using ? I remember the S3 datasource having extremely high cardinality in some situations, but I don't remember if we fixed it

_KaszpiR_•2mo ago

use relabel_config in prometheus target and use relabel_action: drop for matching metrics (or other way around, regexp with keep) https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

Configuration | Prometheus

Prometheus project documentation for Configuration

_KaszpiR_•2mo ago

this way you just can expose all the metrics but scrping instance will process further only specific ones

blotus•2mo ago

relabel_config will only apply after prometheus has fetched the metrics, so it won't fix the bandwith issue

_KaszpiR_•2mo ago

then deploy scraper closer to the source (same host?), and use remote write to push it further

alukasOP•2mo ago

The data source is basically apache logs, but we are using ISPConfig and these are shared webhosting severs, where we also monitor each and every website's logs made by the users. And that creates a LOT of lines (in full mode).

[~]$ grep ispconfig metrics | wc -l
 2197630

[~]$ grep ispconfig metrics | wc -l
 2197630

An example of a metrics that we don't really want to read is cs_filesource_hits_total which basically creates as many lines as there are log files for each and every website, even in aggregated mode, leading to many thousands of lines still. The go metrics were just an example, they don't really amount to much, so that doesn't really contribute to the problem. Thanks for the scrape interval idea, we will consider that. We are using relabel_config and drop currently, that's what we'd like to avoid. And we'd like to keep the data collection as light as possible, and with as few moving parts as possible, so that it is easier to maintain, and so that more resources are available to the clients.

_KaszpiR_•2mo ago

so I guess you have a label values explosion?

blotus•2mo ago

I assume ISPConfig is creating a log file per day ? (like access.log.date instead of just having access.log for the current day)

alukasOP•2mo ago

yeah, it creates both an access log and an error log 20250721-access.log 20250721-error.log And for each site it keeps a week's worth of these. It seems to me that cs_parser_hits_ko_total summarises these properly based on the filename.

cs_parser_hits_ko_total{source="20250720-access.log",type="file"} 187
cs_parser_hits_ko_total{source="20250720-error.log",type="file"} 3139
cs_parser_hits_ko_total{source="20250721-access.log",type="file"} 82
cs_parser_hits_ko_total{source="20250721-error.log",type="file"} 1428

cs_parser_hits_ko_total{source="20250720-access.log",type="file"} 187
cs_parser_hits_ko_total{source="20250720-error.log",type="file"} 3139
cs_parser_hits_ko_total{source="20250721-access.log",type="file"} 82
cs_parser_hits_ko_total{source="20250721-error.log",type="file"} 1428

But cs_filesource_hits_total creates a line for all of them (even in aggregated). e.g.

cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250714-access.log"} 1217
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250714-error.log"} 119
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250715-access.log"} 2440
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250715-error.log"} 129
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250716-access.log"} 3440
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250716-error.log"} 164
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250717-access.log"} 1820
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250717-error.log"} 159
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250718-access.log"} 2775
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250718-error.log"} 178
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250719-access.log"} 2094
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250719-error.log"} 114
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250720-access.log"} 1293
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250720-error.log"} 113
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250721-access.log"} 791
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250721-error.log"} 57

cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250714-access.log"} 1217
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250714-error.log"} 119
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250715-access.log"} 2440
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250715-error.log"} 129
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250716-access.log"} 3440
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250716-error.log"} 164
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250717-access.log"} 1820
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250717-error.log"} 159
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250718-access.log"} 2775
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250718-error.log"} 178
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250719-access.log"} 2094
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250719-error.log"} 114
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250720-access.log"} 1293
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250720-error.log"} 113
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250721-access.log"} 791
cs_filesource_hits_total{source="/var/log/ispconfig/httpd/<domain>/20250721-error.log"} 57

And I've just found out that our Prometheus can't even scrape the machines that have too large of a metrics page, because it takes over 10 seconds and it exceeds the context deadline.

blotus•2mo ago

I need to think about this Easiest way I see would be to set the source label to a static value (like the datasource name, or unknown if not set in the config): that would fix the issue (you would then just have one time series per instance if no name is set in the acquis config, or one per file datasource if set) But it's going to (likely) break a few things: - in cscli metrics, you won't have the detail per file - we are working on a config troubleshooting feature, that relies on the prometheus metrics, so it's probably not going to work in this case

alukasOP•2mo ago

Alright, thank you very much for your help! For now, we'll probably just go with switching to aggregated metrics, and getting rid of data we don't need on the prometheus side. One last curveball of a question, because I can't find any documentation on it. What is cs_regexp_cache_size measuring, is it anything that could provide us with valuable insight? I can sort of guess what it does, but I'd rather ask for a bit of clarification to be sure.

blotus•2mo ago

it tracks the amount of regexp match results done by the RegexpInFile helper (we have a cache to not run the same regexp in the same input multiple times in a row in a short time frame)

alukasOP•2mo ago

Ah, alright, I see. Thank you very much for your help!

CrowdSec•2mo ago

Resolving A few questions about metrics This has now been resolved. If you think this is a mistake please run /unresolve

Gaming

Programming

A few questions about metrics

Did you find this page helpful?