A few questions about metrics
Hi there!
A while ago we set up metrics collection using Prometheus and visualization using Grafana.
We set the prometheus level to "full", and we've only just noticed - using a network monitoring tool - that some of our machines (the ones running ispconfig, and configured to read each and every log of all the websites) are basically constantly sending 5-10Mbps of traffic to our Prometheus server. Because we run well over a hundred machines, this means that the gigabit connection on our monitoring server gets overwhelmed.
For example, on a machine that's been running for a while, the metrics are ~300MB.
We filter some of these metrics out on the prometheus side (e.g. all the go_ metrics, and also the whitelist hits).
As far as I could tell, there is currently no way for us to filter these on the client side.
Currently, we are also experimenting with just using the
aggregated
level of metrics, and trying to find out if that's enough data for us.
Using aggregated metrics, on another long-running server, the metrics size, and thus the network traffic are looking much better:
But, this too, has some data that we don't really need.
Is there any possibility that in the future there would be functionality to control exactly what data gets pushed to metrics, so that we don't need to filter it on the Prometheus side, after already pushing all the data over the network?
Sorry if my explanation is a bit fumbled.17 Replies
Important Information
This post has been marked as resolved. If this is a mistake please press the red button below or type
/unresolve
© Created By WhyAydan for CrowdSec ❤️
Just to illustrate, this is how it looks on the Prometheus/Grapher server while all the machines are being scraped for their metrics.

Also, we use the official Grafana dashboards (some parts of which don't work because of our current, unfortunately necessary, filtering).
Which metrics would you want to get exactly (or rather, which ones you do not) ?
I don't really like the idea of exposing some configuration to choose exactly which metrics are exposed, as it would likely be not very user friendly
At best, I could something where you can enable/disable a group of metrics (acquisition, parsing, ...), and even then I'm not sure it's something we would expose for all the metrics groups (for example, we are working on more advanced healthcheck based on the prometheus metrics, so we need to have those enabled).
And if you are talking about the basic go metrics, I'm not sure it's something we can disable 😦
Another potential workaround for now is to reduce the scrape interval (i don't know which value you are using, but getting the metrics every minute is probably enough)
also, which datasources are you using ?
I remember the S3 datasource having extremely high cardinality in some situations, but I don't remember if we fixed it
use
relabel_config
in prometheus target and use relabel_action: drop
for matching metrics (or other way around, regexp with keep
) https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_configConfiguration | Prometheus
Prometheus project documentation for Configuration
this way you just can expose all the metrics but scrping instance will process further only specific ones
relabel_config will only apply after prometheus has fetched the metrics, so it won't fix the bandwith issue
then deploy scraper closer to the source (same host?), and use remote write to push it further
The data source is basically apache logs, but we are using ISPConfig and these are shared webhosting severs, where we also monitor each and every website's logs made by the users. And that creates a LOT of lines (in full mode).
An example of a metrics that we don't really want to read is
cs_filesource_hits_total
which basically creates as many lines as there are log files for each and every website, even in aggregated mode, leading to many thousands of lines still.
The go metrics were just an example, they don't really amount to much, so that doesn't really contribute to the problem.
Thanks for the scrape interval idea, we will consider that.
We are using relabel_config and drop currently, that's what we'd like to avoid. And we'd like to keep the data collection as light as possible, and with as few moving parts as possible, so that it is easier to maintain, and so that more resources are available to the clients.so I guess you have a label values explosion?
I assume ISPConfig is creating a log file per day ? (like
access.log.date
instead of just having access.log
for the current day)yeah, it creates both an access log and an error log
20250721-access.log
20250721-error.log
And for each site it keeps a week's worth of these.
It seems to me that cs_parser_hits_ko_total
summarises these properly based on the filename.
But cs_filesource_hits_total
creates a line for all of them (even in aggregated).
e.g.
And I've just found out that our Prometheus can't even scrape the machines that have too large of a metrics page, because it takes over 10 seconds and it exceeds the context deadline.I need to think about this
Easiest way I see would be to set the
source
label to a static value (like the datasource name, or unknown
if not set in the config): that would fix the issue (you would then just have one time series per instance if no name is set in the acquis config, or one per file datasource if set)
But it's going to (likely) break a few things:
- in cscli metrics
, you won't have the detail per file
- we are working on a config troubleshooting feature, that relies on the prometheus metrics, so it's probably not going to work in this caseAlright, thank you very much for your help!
For now, we'll probably just go with switching to
aggregated
metrics, and getting rid of data we don't need on the prometheus side.
One last curveball of a question, because I can't find any documentation on it.
What is cs_regexp_cache_size
measuring, is it anything that could provide us with valuable insight?
I can sort of guess what it does, but I'd rather ask for a bit of clarification to be sure.it tracks the amount of regexp match results done by the
RegexpInFile
helper
(we have a cache to not run the same regexp in the same input multiple times in a row in a short time frame)Ah, alright, I see.
Thank you very much for your help!
Resolving A few questions about metrics
This has now been resolved. If you think this is a mistake please run
/unresolve