Export RAS events to Prometheus


Various hardware errors are handled in the RAS module of the Linux kernel and are optionally make available to userspace using trace events. The utility to process this events is rasdaemon, which also can write this events to a sqlite3 database. In order to export such events to Prometheus there are several posiibilities:

Exporting the Data

Prometheus metrics have the concept of staleness. This means that a metric is marked as dead if no value for it has been scraped for a configurable time. As a consequence, exporting metrics to Prometheus is a stateful process, because all metrics and there values, from the start of the system until now needs to be exported if the values are meant to be available. That means if we want to the counter of RAM ECC errors available all the time, we need to keep all the events so that all of them can be exported to Prometheus.

Enhance rasdaemon With Prometheus Interface

This would require to add an http server component to rasdaemon and make it potentially an attack target. rasdaemon is implemented in C and there is no Prometheus client implementation in C available and writing safe and secure http servers in C is a challenge.

We also need to keep all the events in memory, or read them back from the database, to prevent them from going stale.

node_exporter's Textfile

In addition to saving the events in a sqlite3 database, export them to a textfile in node_exporter textfile format, and let node_exporter handle the exposition of the data to Prometheus. This works, but has the problem of with the ephemeral nature of the textfile. If it's deleted it must be completly rewritten and therefore rasdaemon needs to keep all events in memory or re-read them from the database

Export the Database Itself

Because we need to keep the events available, and they are stored in the sqlite3 database anyway, why not simply use the database as the source of the metrics? This approach is taken by the rasexporter utility.

Rasexporter

The rasexporter utility exports the sqlite3 database written by rasdaemon as prometheus metrics. Currently it only exports memoory check events. It expoorts all the attribute of the event as prometheus labels, which is not optimal, because the label values are potentially unbounded (the address label can have many different values).

A simplified mode will be added, which does not add those labels, but instead offers a REST interface to retrieve the details of the events.

Bulding and Running rasexporter

The repository for rasexporter is at https://git.bofh.at/mla/rasexporter

go get git.bofh.at/mla/rasexporter

and

go build git.bofh.at/mla/rasexporter

is good enough to compile rasexporter. Run it like this

rasexporter --db <path to rasdaemon sqlite database> --listen <listen address>

the db option defaults to rasdaemon.db and listen to localhost:7010. The metrics are exposed under the metrics path.