From 8cd0869feba72cf29400f106fdeffb39a92a9aaf Mon Sep 17 00:00:00 2001 From: Anil Panta Date: Tue, 11 Nov 2025 09:59:59 -0500 Subject: [PATCH 01/11] Reorganize monitoring documentation #617 - Add figures to illustrate monitoring setup - Add configuration examples for common monitoring tools - Have each section for different monitoring tool. - add note that dashboard might be outdated. --- docs/operator/monitoring.md | 816 ++++++++++++++++++------------------ 1 file changed, 415 insertions(+), 401 deletions(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index 78b9d2f24b9..6b38f84dffc 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -3,429 +3,443 @@ title: Monitoring sidebar_label: Monitoring --- -There are three different monitoring components: +# Rucio Monitoring Guide + +Rucio provides multiple monitoring components to observe its internal operations, data transfers, file access, and database state. These components include: + +- **Internal Monitoring:** Observing Rucio server and daemon performance. +- **Transfers, Deletion, and More Monitoring:** Tracking transfers, deletions, and other Rucio events. +- **File/Dataset Access Monitoring:** Using traces to monitor client interactions. +- **Database Dump and Visualization:** Extracting database-level metrics for visualization. -- Rucio internal monitoring using Graphite/Grafana -- Transfer monitoring using the messages sent by Hermes -- File/Dataset Access monitoring using the traces ## Internal Monitoring This is to monitor the internals of Rucio servers and daemons, e.g., submission rate of the conveyor, state of conveyor queues, reaper deletion rate, server -response times, server active session, etc. We use Graphite[^1] for this. It's -easy to setup and then you have to point your Rucio instance to the Graphite -server using the \"carbon_server" options in the "monitor" section in -etc/rucio.cfg. - -The different Rucio components will then send metrics using those "record" -functions you will find all over the code. Graphite has a built-in web interface -to show graphs but more comfortable to use is the Grafana[^2] tool. - -The internal monitoring functions are defined in core/monitor.py, it includes: - -1) record_counter. This is to send the StatsD counter metrics. Counters are the -most basic and default type. They are treated as a count of a type of event per -second, and are, in Graphite, typically averaged over one minute. That is, when -looking at a graph, you are usually seeing the average number of events per -second during a one-minute period. - -2) record_timer. Timers are meant to track how long something took. They are an -invaluable tool for tracking application performance. The statsd server collects -all timers under the stats.timers prefix, and will calculate the lower bound, -mean, 90th percentile, upper bound, and count of each timer for each period (by -the time you see it in Graphite, that’s usually per minute). - -3) record_timer_block. This is the same to record_timer, just for simple using, -to calculate timer of a certain code block. - -4) record_gauge. Gauges are a constant data type. They are not subject to -averaging, and they don’t change unless you change them. That is, once you set a -gauge value, it will be a flat line on the graph until you change it again. - -### Set up the Rucio internal monitoring dashboard - -Set up a Rucio server for development - -```bash -git clone https://github.com/rucio/rucio.git -docker-compose --file etc/docker/dev/docker-compose.yml up --detach -``` - -The command will fire up various containers such as `dev-rucio-1`, `dev-graphite-1`, and -`dev-activemq-1`. `dev-graphite-1` is the one collecting internal -metrics from Rucio. The configurations of Rucio internal metrics sender are -defined under the [monitor] section of rucio.cfg. Change the carbon_server and -carbon_port according to your setting - -```toml -[monitor] -carbon_server = graphite -carbon_port = 8125 -user_scope = docker -``` - -The Graphite builtin web page is on port 80 of the host. To use Grafana, setup -Grafana and enable the graphite data source - -```bash -docker pull grafana/grafana -docker run --detach --name grafana --publish 3000:3000 grafana/grafana -``` - -The Grafana web-portal is on port 3000 of the host. Add one data source of the -type Graphite, choose access method to "Browser" and set URL to [http://ip:80](http://ip:80), -where ip is the address of the server hosting the Graphite container -`dev-graphite-1`. - -A set of pre-defined Grafana Rucio internal plots is provided -[here](https://github.com/rucio/rucio/blob/master/tools/monitoring/visualization/rucio-internal.json). -Users could import them directly into Grafana. - -### The list of Rucio internal metrics - -1) Core - -```text -credential.signswift, credential.signs3 (timer) -trace.nongrid_trace -core.request.* (counter) -core.request.archive_request.* (timer) -rule.add_rule, rule.add_rule.*, rule.delete_rule, rule.evaluate_did_detach, rule.evaluate_did_attach.(timer) -trace.trace (counter) -``` - -2) Transfertool - -```text -transfertool.fts3.delegate_proxy.success.*, \ - transfertool.fts3.delegate_proxy.fail.* (timer) -transfertool.fts3.submit_transfer.[externalhost] (timer) -transfertool.fts3.[externalhost].submission.success/failure (counter) -transfertool.fts3.[externalhost].cancel.success/failure (counter) -transfertool.fts3.[externalhost].update_priority.success/failure (counter) -transfertool.fts3.[externalhost].query.success/failure (counter) -transfertool.fts3.[externalhost].whoami.failure (counter) -transfertool.fts3.[externalhost].Version.failure (counter) -transfertool.fts3.[externalhost].query_details.failure (counter) -transfertool.fts3.[externalhost].bulk_query.failure (counter) -transfertool.fts3.[externalhost].query_latest.failure (counter) -transfertool.fts3myproxy.[externalhost].submission.success/failure (counter) -``` - -3) Judge - -```text -rule.judge.exceptions.* -``` - -4) Transmogrified - -```text -transmogrifier.addnewrule.errortype.* (counter) -transmogrifier.addnewrule.activity.* (counter) -transmogrifier.did.*.processed (counter) -``` - -5) Tracer - -```text -daemons.tracer.kronos.* (counter) -``` - -6) Reaper - -```text -reaper.list_unlocked_replicas, reaper.delete_replicas (timer) -reaper.deletion.being_deleted, reaper.deletion.done (counter) -daemons.reaper.delete.[scheme].[rse] (timer) -``` - -7) Undertaker - -```text -undertaker.delete_dids, undertaker.delete_dids.exceptions.LocksDetected (counter) -undertaker.rules, undertaker.parent_content, undertaker.content, \ - undertaker.dids (timer) -undertaker.content.rowcount (counter) -``` - -8) Replicarecover - -```text -replica.recoverer.exceptions.* (counter) -``` - -9) Hermes - -```text -daemons.hermes.reconnect.* (counter) +response times, server active session, etc. Metrics are typically categorized as: + +- Counters – measure the number of events (e.g., requests processed). +- Timers/Histograms – measure durations of operations (e.g., rule evaluation time, transfer submission time). +- Gauges – measure values that can go up and down (e.g., number of active sessions, queue sizes). + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart TB + subgraph RucioTransfer["**Internal Monitoring**"] + A1["Rucio Servers & Daemons"] + + G1["Graphite"] + P1["Prometheus"] + GF["Grafana"] + end + + %% Edges + A1 -- "push" --> G1 + A1 -- "pull or push" --> P1 + G1 --> GF + P1 --> GF + + %% Style definitions + classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef graphite fill:#555555,stroke:#333333,color:#fff,font-weight:bold; %% Dark gray for Graphite + classDef prometheus fill:#009688,stroke:#00695C,color:#fff,font-weight:bold; %% Teal, distinct + + %% Apply styles + class A1 mono; + class G1 graphite; + class P1 prometheus; + class GF grafana; ``` -10) Conveyor - -```text -daemons.conveyor.[submitter].submit_bulk_transfer.per_file, \ - daemons.conveyor.[submitter].submit_bulk_transfer.files (timer) -daemons.conveyor.[submitter].submit_bulk_transfer (counter) -daemons.conveyor.finisher.000-get_next (timer) -daemons.conveyor.finisher.handle_requests (timer & counter) -daemons.conveyor.common.update_request_state.request-requeue_and_archive (timer) -daemons.conveyor.poller.000-get_next (timer) -daemons.conveyor.poller.bulk_query_transfers (timer) -daemons.conveyor.poller.transfer_lost (counter) -daemons.conveyor.poller.query_transfer_exception (counter) -daemons.conveyor.poller.update_request_state.* (counter) -daemons.conveyor.receiver.error -daemons.conveyor.receiver.message_all -daemons.conveyor.receiver.message_rucio -daemons.conveyor.receiver.update_request_state.* -daemons.conveyor.receiver.set_transfer_update_time -daemons.messaging.fts3.reconnect.* -daemons.conveyor.stager.get_stagein_transfers.per_transfer, \ - daemons.conveyor.stager.get_stagein_transfers.transfer (timer) -daemons.conveyor.stager.get_stagein_transfers (count) -daemons.conveyor.stager.bulk_group_transfer (timer) -daemons.conveyor.submitter.get_stagein_transfers.per_transfer, \ - daemons.conveyor.submitter.get_stagein_transfers.transfer (timer) -daemons.conveyor.submitter.get_stagein_transfers (count) -daemons.conveyor.submitter.bulk_group_transfer (timer) -daemons.conveyor.throttler.set_rse_transfer_limits.\ - [rse].max_transfers/transfers/waitings (gauge) -daemons.conveyor.throttler.delete_rse_transfer_limits.[rse] (counter) -daemons.conveyor.throttler.delete_rse_transfer_limits.[activity].[rse] (counter) -daemons.conveyor.throttler.set_rse_transfer_limits.[activity].[rse] (gauge) -daemons.conveyor.throttler.release_waiting_requests.[activity].[rse].[account] (counter) +There are two options: + +1. Graphite + + Metrics are pushed to a Graphite server. + + ```cfg + [monitor] + # specify the hostname for carbon server + carbon_server = + carbon_port = 8125 + user_scope = rucio + ``` + +2. Prometheus + + Metrics can be scraped by Prometheus or optionally pushed to Prometheus Pushgateway for short-lived processes. Prometheus also supports multiprocess-safe metrics for deployments using multiple threads or Apache MPM subprocesses. Exposed over an HTTP endpoint for scraping `/metrics`. Multiprocess-safe metrics are supported using the PROMETHEUS_MULTIPROC_DIR environment variable. + + ```cfg + [monitor] + # Enable Prometheus metrics + enable_metrics = True + # Port for Prometheus HTTP server + metrics_port = 8080 + ``` +The used metrics can be found in following links (code search) +- [Counter](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.Counter&type=code) +- [Gauge](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.gauge&type=code) +- [Timer](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.timer&type=code) + +[Grafana Dashboard JSON](https://github.com/rucio/rucio/blob/master/tools/monitoring/visualization/rucio-internal.json) for Graphite is given here. +[Grafana Dashboard JSON](https://github.com/rucio/monitoring-templates/blob/main/prometheus-monitoring/Dashboards/Rucio-Internal.json) for prometheus is given here. + +Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. + +## Transfers, Deletion and Other Monitoring +Rucio generates a large volume of operational events for activities such as: transfers, deletions, rule evaluations, replication tasks, etc., originating from daemons like conveyor, reaper, judge, and others. + +These events are collected and delivered by the Hermes daemon, which can forward them to message queues or storage backends for further processing, analysis, and storage. According to your storage backend you can use visualization software like Grafana/Kibana. + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart TB + subgraph RucioTransfer["**Transfer, Deletion Traces & Other Monitoring**"] + A2["Hermes Daemon"] + Q1["ActiveMQ"] + ETL["ETL / Data Pipeline"] + OS1["OpenSearch / Elasticsearch / InfluxDB"] + KB["Grafana / Kibana"] + end + + A2 -- direct write --> OS1 + A2 -- publish --> Q1 + Q1 -- consume --> ETL + ETL --> OS1 + OS1 --> KB + + classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef opensearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; + classDef mq fill:#f69f03,stroke:#b35c00,color:#fff,font-weight:bold; + classDef etl fill:#4CAF50,stroke:#2E7D32,color:#fff,font-weight:bold; + + + class A2,A3 mono; + class Q1 mq; + class OS1 opensearch; + class KB grafana; + class ETL etl; ``` -11) Necromancer - -```text -necromancer.badfiles.lostfile, necromancer.badfiles.recovering (counter) -``` - -## Transfer monitoring - -If a transfer is submitted, queued, waiting, done or failed messages are sent to -ActiveMQ via Hermes and are also archived in the messages_history table. Same is -true for deletions. In the case of ATLAS we have a dedicated monitoring -infrastructure that reads the messages from -[ActiveMQ](https://activemq.apache.org), aggregates them and then writes the -aggregated data into ElasticSearch/InfluxDB from where it then can be visualised -using Kibana/Grafana. - -### Set up the Rucio internal monitoring dashboard - -1) Configure Rucio - -Rucio need to be configured to enable the message broker. In Rucio, message are -sent by the Hermes daemon. Settings are defined in therucio.cfg under the -[messaging-hermes] section - -```toml -[messaging-hermes] -username = -password = -port = 61613 -nonssl_port = 61613 -use_ssl = False -ssl_key_file = /etc/grid-security/hostkey.pem -ssl_cert_file = /etc/grid-security/hostcert.pem -destination = /topic/rucio.events -brokers = activemq -voname = atlas -email_from = -email_test = +Different options are shown in figure and described below. + +1. Queue-Based Pipelines + + Hermes publishes events to a queue or topic in message queue (ActiveMQ). Multiple consumers can process events independently. Enables real-time, decoupled processing pipelines. These events from ActiveMQ can be consumed by ETL pipelines. These Pipelines allow aggregation, transformation, enrichment, and forwarding to different storage backends of your choice. + + Example pipeline : ActiveMQ -> Logstash -> OpenSearch + + + Config for this is described below. + ```cfg + [hermes] + # List of services Hermes should send messages to. + services_list = activemq + + # Toggle query behavior: + # True -> fetch bulk messages for each service individually + # False -> fetch bulk messages across all services together + query_by_service = True + + # Bulk retrieval size for each call to the database + bulk = 1000 + + [messaging-hermes] + # ActiveMQ options + # List of broker hostnames or DNS aliases + brokers = amq1.example.com, amq2.example.com + # Destination queue or topic + destination = /queue/rucio + # Use SSL for ActiveMQ connection + use_ssl = True + # SSL certificate files (if using SSL) + ssl_cert_file = /etc/rucio/certs/hermes-client-cert.pem + ssl_key_file = /etc/rucio/certs/hermes-client-key.pem + # Virtual host, optional + broker_virtual_host = / + # Non-SSL port (used if use_ssl=False) + nonssl_port = 61613 + # SSL port (used if use_ssl=True) + port = 61614 + # ActiveMQ username/password (used if use_ssl=False) + username = + password = + ``` +2. Direct Delivery + + These options send events directly to storage or alerting systems, bypassing queues. + Hermes can write events straight to Elasticsearch, OpenSearch, or InfluxDB. In addtion can also deliver events via email which supports custom SMTP servers, credentials, and SSL/TLS. + + Configuration option for each type is described below. + + ```cfg + # rucio.cfg + # ========================= + # Hermes Daemon Configuration + # ========================= + + [hermes] + # List of services Hermes should send messages to. + # Supported values: influx, elastic, email, activemq + services_list = elastic, influx, email, activemq + + # Toggle query behavior: + # True -> fetch bulk messages for each service individually + # False -> fetch bulk messages across all services together + query_by_service = True + + # Bulk retrieval size for each call to the database + bulk = 1000 + + # InfluxDB endpoint for sending aggregated metrics + influxdb_endpoint = https://influxdb-host:8086/api/v2/write?org=my-org&bucket=my-bucket&precision=ns + # Token for authenticating to InfluxDB + influxdb_token = my-secret-influxdb-token + + # Elasticsearch endpoint for sending events + elastic_endpoint = https://elasticsearch-host:9200/rucio-eic-event/_bulk + # Optional credentials if Elasticsearch is secured + elastic_username = admin + elastic_password = password + + # Email sending options + send_email = True + email_from = rucio@cern.ch + smtp_host = smtp.cern.ch + smtp_port = 587 + smtp_username = my-smtp-user + smtp_password = my-smtp-pass + smtp_usessl = False + smtp_usetls = True + smtp_certfile = + smtp_keyfile = + ``` +### Event Types +1. Transfer Events + ``` + { + created_at: when the message was created (yyyy-MM-ddTHH:mm:ss.SSSSSS) + event_type: type of this event (transfer-submitted, \ + transfer-submission_failed, transfer-queued, transfer-failed, \ + transfer-done) + payload: { + account: account submitting the request + activity: activity of the request + bytes: size of the transferred file (byte) + checksum-adler: checksum using adler algorithm + checksum-md5: checksum using md5 algorithm + created_at: Time when the message was created (yyyy-MM-dd HH:mm:ss.SSSSSS) + dst-rse: destination rse + dst-type: type of destination rse (disk, tape) + dst-url: destination url of the transferred file + duration: duration of the transfer (second) + event_type: type of this event (transfer-submitted, \ + transfer-submittion_failed, transfer-queued, \ + transfer-failed, transfer-done) + file-size: same as bytes + guid: guid of the transfer + name: name of transferred file + previous-request-id: id of previous request + protocol: transfer protocol + reason: reason of the failure + request-id: id of this request + scope: scope of the transferred data + src-rse: source rse + src-type: type of source rse (disk, tape) + src-url: source file url + started_at: start time of the transfer (yyyy-MM-dd HH:mm:ss.SSSSSS) + submitted_at: submission time of the transfer (yyyy-MM-dd HH:mm:ss.SSSSSS) + tool-id: id of the transfer tool in rucio (rucio-conveyor) + transfer-endpoint: endpoint holder of the transfer (fts) + transfer-id: uuid of this transfer + transfer-link: link of this transfer (in form of fts url) + transferred_at: done time of this transfer + } + } + ``` +2. Deletion Event + ``` + { + created_at: when the message was created (yyyy-MM-ddTHH:mm:ss.SSSSSS) + event_type: type of this event (deletion-done,deletion-failed, deletion-not-found) + payload: { + scope: scope of the deleted replica + name: name of the deleted replica + rse: rse holding the deleted replica + file-size: size of the file + bytes: size of the file + url: url of the file + duration: duration of the deletion + protocol: prococol used in the deletion + reason: reason of the failure + } + } + ``` +3. Rule Event + ``` + created_at: when the message was created (yyyy-MM-ddTHH:mm:ss.SSSSSS) + event_type: 'RULE_OK' or 'RULE_PROGRESS' + payload:{ + 'scope': scope.external, + 'name': name, + 'rule_id': rule_id, # only for RULE_OK and RULE_PROGRESS + 'vo': vo # only if not default + 'progress': int #replication progress # only for RULE_PROGRESS + 'dataset_name': dataset_name, # only for LOST + 'dataset_scope': dataset_scope # only for LOST + } + ``` +4. Dataset Lock Event + ``` + { + created_at: when the message was created (yyyy-MM-ddTHH:mm:ss.SSSSSS) + event_type: 'DATASETLOCK_OK' + payload: { + 'scope': did_scope, + 'name': did_name, + 'rse': rse, + 'rse_id': rse_id, + 'rule_id': rule_id + 'vo': vo if not default + } + } + ``` +There are other event for replicas, dids etc not stated here. + +### Dashboard +[Kibana Dashbaord](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization) example was given. +[Grafana Dashboard](https://github.com/rucio/monitoring-templates/blob/main/message-monitoring/Dashboards/Rucio-Transfer.json) for transfer for elaticsearch/opensearch example given. + +Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. They might be also be on old versions. + +## Traces +The traces are sent by the pilots or the rucio clients whenever a file is downloaded/uploaded. These trace events are sent to the Rucio server via the /traces endpoint using HTTPS POST, where they are forwarded to messaging backends such as ActiveMQ. ActiveMQ acts as the messaging broker, delivering trace events to Kronos daemon. Any consumer like logstash can the be used for relaying traces to data pipelines for further processing if needed. And then directly or after processing be sent to storage backends such as OpenSearch, Elasticsearch, or InfluxDB, which allow querying, aggregation, and analytics. Finally, visualization tools like Grafana and Kibana can be used. + +This is shown in figure below. Schemas of the traces can be found in [trace.py](https://github.com/rucio/rucio/blob/master/lib/rucio/core/trace.py) which can be used for dashboards. + + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart TB + subgraph RucioTraceFlow["**Rucio Trace Flow**"] + C1["Clients / Pilots"] + RS["Rucio Server (/traces endpoint)"] + Q1["ActiveMQ"] + KR["Rucio Daemon: Kronos"] + ETL["ETL / Data Pipeline"] + OS1["OpenSearch / Elasticsearch / InfluxDB"] + KB["Grafana / Kibana"] + end + + C1 -- traces (HTTPS POST) --> RS + RS -- publish --> Q1 + Q1 -- consume --> KR + Q1 -- consume --> ETL + ETL --> OS1 + OS1 --> KB + + classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef opensearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; + classDef mq fill:#f69f03,stroke:#b35c00,color:#fff,font-weight:bold; + classDef etl fill:#4CAF50,stroke:#2E7D32,color:#fff,font-weight:bold; + + class C1,RS,KR mono; + class Q1 mq; + class OS1 opensearch; + class KB grafana; + class ETL etl; ``` -The default settings are listed above. If ssl is not used, set use_ssl to False -and define username and password. They should be "admin", "admin" for the -default activemq settings. If you are not using the containers created by the -docker-compose command, change the brokers and port to the server hosting the -message queue. - -2) Setup Elasticsearch and Kibana - -Next is to setup and configure Elasticsearch and Kibana for storing and -visualising the messages. This is an example of creating them in containers - -```bash -docker run --detach \ - --publish 9200:9200 \ - --publish 9300:9300 \ - --env "discovery.type=single-node" \ - --name elasticsearch \ - docker.elastic.co/elasticsearch/elasticsearch:7.8.1 - -docker run --detach \ - --link elasticsearch \ - --publish 5601:5601 \ - --name kibana \ - docker.elastic.co/kibana/kibana:7.8.1 +## Rucio database dumping +Database-level monitoring extracts different information directly from the Rucio database. This includes insights such as RSE usage statistics, account quotas, and other metadata relevant to experiments. These data are periodically queried and exported to external storage backends for visualization and long-term monitoring. + +Some Logstash pipeline definitions are given [here](https://github.com/rucio/rucio/tree/master/tools/monitoring/logstash-pipeline). These example pipelines use the Logstash JDBC input plugin to connect to the Rucio PostgreSQL database, execute SQL queries, and extract structured data periodically. The retrieved records are then sent to Elasticsearch but can be changed to other storage backends such as OpenSearch. + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart TB + subgraph DB["**Database Level Accouting/Monitoring**"] + DB1[("Rucio DB")] + LS["Logstash JDBC Input"] + OS2["OpenSearch / Elasticsearch"] + GD["Grafana / Kibana"] + end + + DB1 --> LS + LS --> OS2 + OS2 --> GD + + classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef opensearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; + classDef logstash fill:#b34700,stroke:#b34700,color:#fff,font-weight:bold; + + + class DB1,LS mono; + class LS logstash; + class OS2 opensearch; + class GD grafana; ``` -3) Import Elasticsearch indices - -Before transferring messages from the message queue to Elasticsearch, indices -need to be defined in Elasticsearch. This is a list of the message formats of -Rucio. - -### Transfer events - -```jsi -{ - created_at: when the message was created (yyyy-MM-dd HH:mm:ss.SSSSSS) - event_type: type of this event (transfer-submitted, \ - transfer-submittion_failed, transfer-queued, transfer-failed, \ - transfer-done) - payload: { - account: account submitting the request - activity: activity of the request - bytes: size of the transferred file (byte) - checksum-adler: checksum using adler algorithm - checksum-md5: checksum using md5 algorithm - created_at: Time when the message was created (yyyy-MM-dd HH:mm:ss.SSSSSS) - dst-rse: destination rse - dst-type: type of destination rse (disk, tape) - dst-url: destination url of the transferred file - duration: duration of the transfer (second) - event_type: type of this event (transfer-submitted, \ - transfer-submittion_failed, transfer-queued, \ - transfer-failed, transfer-done) - file-size: same as bytes - guid: guid of the transfer - name: name of transferred file - previous-request-id: id of previous request - protocol: transfer protocol - reason: reason of the failure - request-id: id of this request - scope: scope of the transferred data - src-rse: source rse - src-type: type of source rse (disk, tape) - src-url: source file url - started_at: start time of the transfer - submitted_at: submission time of the transfer - tool-id: id of the transfer tool in rucio (rucio-conveyor) - transfer-endpoint: endpoint holder of the transfer (fts) - transfer-id: uuid of this transfer - transfer-link: link of this transfer (in form of fts url) - transferred_at: done time of this transfer - } -} +A typical Logstash configuration consists of three sections — input, filter, and output. For example, the input section defines the PostgreSQL connection and SQL query to fetch data: ``` - -### Deletion events - -```json -{ - created_at: when the message was created (yyyy-MM-dd HH:mm:ss.SSSSSS) - event_type: type of this event (deletion-done,deletion-failed) - payload: { - scope: scope of the deleted replica - name: name of the deleted replica - rse: rse holding the deleted replica - file-size: size of the file - bytes: size of the file - url: url of the file - duration: duration of the deletion - protocol: prococol used in the deletion - reason: reason of the failure - } +input { + jdbc { + jdbc_connection_string => "" + jdbc_user => "" + jdbc_password => "" + jdbc_driver_library => "/usr/share/logstash/java/postgresql-42.2.6.jar" + jdbc_driver_class => "org.postgresql.Driver" + statement => "SELECT rses.rse, rse_usage.source, rse_usage.used, rse_usage.free, rse_usage.files FROM rse_usage INNER JOIN rses ON rse_usage.rse_id=rses.id WHERE rse_usage.files IS NOT NULL AND rse_usage.files!=0;" + schedule => "0 0 * * *" + } } ``` - -The formats of them are defined in [`rucio-transfer.json`](https://github.com/rucio/rucio/blob/master/tools/monitoring/rucio-transfer.json) -and [`rucio_deletion.json`](https://github.com/rucio/rucio/blob/master/tools/monitoring/rucio-deletion.json) -which could be imported into Kibana. - -Rucio also sends messages when adding/deleting rules/DIDs and for file/dataset -access. So the monitoring is not limited to data transferring. - -4) Transmit messages from message queue to Elastisearch - -This could be done via Logstash. Please refer to [Elastic's documentation.](https://www.elastic.co/blog/integrating-jms-with-elasticsearch-service-using-logstash). - -Alternatively you could use a simple Python script such as [`extract.py`](https://github.com/rucio/rucio/blob/master/tools/monitoring/extract.py) for -this after installing the required tools - -```bash -pip install --upgrade pip -pip install elasticsearch -wget https://files.pythonhosted.org/packages/52/7e/22ca617f61e0d5904e06c1ebd5d453adf30099526c0b64dca8d74fff0cad/stomp.py-4.1.22.tar.gz -tar --extract --gzip --verbose --file stomp.py-4.1.22.tar.gz -cd stomp.py-4.1.22 -python setup.py install -``` - -Change the configurations (message broker and elastisearch cluster) in -exporter.py and start it. It could be made as a systemd service for convenience. - -5) Create Kibana dashboards based on the imported messages. - -A set of pre-defined dashboards can be found -[here](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization) in -json format which could be imported to Kibana directly. But you may have to -resolve different UUIDs in Kibana. - -## Access monitoring - -The traces are sent by the pilots or the Rucio clients whenever a file is -downloaded/uploaded. This is similar with the data transferring monitoring. - -## Rucio database dumping - -Besides the internal, data transferring/deletion/accessing monitoring, it's also -possible to dump the Rucio internal database directly to Elasticsearch. Then -information like data location, accounting, RSE summary could be visualised -using Kibana or Grafana. - -We provide several examples of dumping Rucio DB tables using the logstash jdbc -plugin and making plots based on them. - -To start a logstash pipeline, run - -```bash -logstash -f rse.conf +The output section defines where the extracted data are delivered. In most deployments, these are indexed into OpenSearch or Elasticsearch for analytics dashboards in Grafana or Kibana: ``` - -Where the rse.conf contains - -```json -input { - jdbc { - jdbc_connection_string => "" - jdbc_user => "" - jdbc_password => "" - jdbc_driver_library => "/usr/share/logstash/java/postgresql-42.2.6.jar" - jdbc_driver_class => "org.postgresql.Driver" - statement => "SELECT rses.rse, rse_usage.source, rse_usage.used, \ - rse_usage.free, rse_usage.files FROM rse_usage INNER JOIN rses ON \ - rse_usage.rse_id=rses.id WHERE rse_usage.files IS NOT NULL AND \ - rse_usage.files!=0;" - } -} output { elasticsearch { - hosts => [""] + hosts => ["http://elasticsearh:9200"] action => "index" - index => "rucio_rse" - user => "" - password => "" + index => "rucio_account" + user => "elastic" + password => "password" } } ``` -The rse pipeline dumps data like how large is the total space, how large is the -used space, how many files are saved on each RSE etc. Please fill in the jdbc -connection details and Elastisearch connection details in the config file. - -More pipeline definitions can be found [here](https://github.com/rucio/rucio/tree/master/tools/monitoring/logstash-pipeline), -and users could design their own DB queries for their specific monitoring -needs. Also users could directly import the Elasticsearch indices and Kibana -dashboard from [these](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization/db_dump). -json files. - -## Footnotes - -[^1]: [https://graphiteapp.org/] -[^2]: [https://grafana.com/] +Some [Kibana dashboard](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization/db_dump) example given. +[Grafana dashboard](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Dashboards/Rucio-Storage.json) example for rse given. +Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. They might be also be on old versions. From eaf01ebbbf2abe4f502ad49a4356114e256756cd Mon Sep 17 00:00:00 2001 From: Anil Panta Date: Wed, 12 Nov 2025 09:29:09 -0500 Subject: [PATCH 02/11] some language fixes #617 --- docs/operator/monitoring.md | 44 +++++++++++++++++++++++++------------ 1 file changed, 30 insertions(+), 14 deletions(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index 6b38f84dffc..fd882fc79d2 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -7,10 +7,10 @@ sidebar_label: Monitoring Rucio provides multiple monitoring components to observe its internal operations, data transfers, file access, and database state. These components include: -- **Internal Monitoring:** Observing Rucio server and daemon performance. -- **Transfers, Deletion, and More Monitoring:** Tracking transfers, deletions, and other Rucio events. -- **File/Dataset Access Monitoring:** Using traces to monitor client interactions. -- **Database Dump and Visualization:** Extracting database-level metrics for visualization. +- [**Internal Monitoring**](#internal-monitoring): Observing Rucio server and daemon performance. +- [**Transfers, Deletion, and Other Monitoring**](#transfers-deletion-and-other-monitoring): Tracking transfers, deletions, and other Rucio events. +- [**File/Dataset Access Monitoring**](#traces):Using traces to monitor client interactions. +- [**Database Dump and Visualization**](#rucio-database-dump) Extracting database-level metrics for visualization. ## Internal Monitoring @@ -110,8 +110,8 @@ These events are collected and delivered by the Hermes daemon, which can forward 'lineColor': '#90a4ae' }}}%% flowchart TB - subgraph RucioTransfer["**Transfer, Deletion Traces & Other Monitoring**"] - A2["Hermes Daemon"] + subgraph RucioTransfer["**Transfer, Deletion & Other Monitoring**"] + A2["Rucio Daemon: Hermes"] Q1["ActiveMQ"] ETL["ETL / Data Pipeline"] OS1["OpenSearch / Elasticsearch / InfluxDB"] @@ -232,6 +232,7 @@ Different options are shown in figure and described below. smtp_keyfile = ``` ### Event Types +Different event types are listed below with their payload structure. 1. Transfer Events ``` { @@ -375,10 +376,10 @@ flowchart TB class ETL etl; ``` -## Rucio database dumping +## Rucio database dump Database-level monitoring extracts different information directly from the Rucio database. This includes insights such as RSE usage statistics, account quotas, and other metadata relevant to experiments. These data are periodically queried and exported to external storage backends for visualization and long-term monitoring. -Some Logstash pipeline definitions are given [here](https://github.com/rucio/rucio/tree/master/tools/monitoring/logstash-pipeline). These example pipelines use the Logstash JDBC input plugin to connect to the Rucio PostgreSQL database, execute SQL queries, and extract structured data periodically. The retrieved records are then sent to Elasticsearch but can be changed to other storage backends such as OpenSearch. +Some Logstash pipeline definitions are given [here](https://github.com/rucio/rucio/tree/master/tools/monitoring/logstash-pipeline). These example pipelines use the Logstash JDBC input plugin to connect to the Rucio PostgreSQL database, execute SQL queries, and extract structured data periodically. The retrieved records are then sent to Elasticsearch but can be changed to other storage backends such as OpenSearch. The following diagram shows the high-level flow for database-level monitoring using Logstash. ```mermaid %%{init: {'theme': 'base', 'themeVariables': { @@ -417,29 +418,44 @@ A typical Logstash configuration consists of three sections — input, filter, a ``` input { jdbc { - jdbc_connection_string => "" + jdbc_connection_string => "jdbc:postgresql://host:5432/"" jdbc_user => "" jdbc_password => "" - jdbc_driver_library => "/usr/share/logstash/java/postgresql-42.2.6.jar" + jdbc_driver_library => "/usr/share/logstash/java/postgresql-.jar" jdbc_driver_class => "org.postgresql.Driver" statement => "SELECT rses.rse, rse_usage.source, rse_usage.used, rse_usage.free, rse_usage.files FROM rse_usage INNER JOIN rses ON rse_usage.rse_id=rses.id WHERE rse_usage.files IS NOT NULL AND rse_usage.files!=0;" schedule => "0 0 * * *" } } -``` -The output section defines where the extracted data are delivered. In most deployments, these are indexed into OpenSearch or Elasticsearch for analytics dashboards in Grafana or Kibana: -``` + +filter { + # Placeholder for transformations or enrichments + # Examples: + # - Add computed fields + # - Rename fields + # - Convert units (e.g., bytes to GB) + # - Drop unwanted fields +} + + output { elasticsearch { hosts => ["http://elasticsearh:9200"] action => "index" - index => "rucio_account" + index => "rucio_rse" user => "elastic" password => "password" } } ``` +Few points: +- jdbc_driver_library: Can downloaded from [jdbc.postgresql.org](https://jdbc.postgresql.org/), choose the version that you want to use and enable that in logstash. +- schedule: Defines how often the query runs (Cron-like syntax). +- output: Defines where the extracted data are delivered. In most deployments, these are indexed into OpenSearch or Elasticsearch for analytics dashboards in Grafana or Kibana: +- filter: This is optional. It helps in preprocessing your data before indexing + Some [Kibana dashboard](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization/db_dump) example given. [Grafana dashboard](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Dashboards/Rucio-Storage.json) example for rse given. + Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. They might be also be on old versions. From 6680047bd3fde897c4c9fb4859f1dbb5663c4d98 Mon Sep 17 00:00:00 2001 From: Anil Panta Date: Thu, 13 Nov 2025 14:35:49 -0500 Subject: [PATCH 03/11] Add Probes and its related info #617 --- docs/operator/monitoring.md | 119 ++++++++++++++++++++++++++++++------ 1 file changed, 102 insertions(+), 17 deletions(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index fd882fc79d2..893a4303c85 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -10,7 +10,8 @@ Rucio provides multiple monitoring components to observe its internal operations - [**Internal Monitoring**](#internal-monitoring): Observing Rucio server and daemon performance. - [**Transfers, Deletion, and Other Monitoring**](#transfers-deletion-and-other-monitoring): Tracking transfers, deletions, and other Rucio events. - [**File/Dataset Access Monitoring**](#traces):Using traces to monitor client interactions. -- [**Database Dump and Visualization**](#rucio-database-dump) Extracting database-level metrics for visualization. +- [**Database Dump and Visualization**](#rucio-database-dump): Extracting database-level metrics for visualization. +- [**Probes**](#rucio-monitoring-probes): Automated checks and using Nagios or Prometheus Pushgateway. ## Internal Monitoring @@ -119,21 +120,21 @@ flowchart TB end A2 -- direct write --> OS1 - A2 -- publish --> Q1 + A2 -- publish(STOMP) --> Q1 Q1 -- consume --> ETL ETL --> OS1 OS1 --> KB classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; - classDef opensearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; + classDef OpenSearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; classDef mq fill:#f69f03,stroke:#b35c00,color:#fff,font-weight:bold; classDef etl fill:#4CAF50,stroke:#2E7D32,color:#fff,font-weight:bold; class A2,A3 mono; class Q1 mq; - class OS1 opensearch; + class OS1 OpenSearch; class KB grafana; class ETL etl; ``` @@ -142,7 +143,7 @@ Different options are shown in figure and described below. 1. Queue-Based Pipelines - Hermes publishes events to a queue or topic in message queue (ActiveMQ). Multiple consumers can process events independently. Enables real-time, decoupled processing pipelines. These events from ActiveMQ can be consumed by ETL pipelines. These Pipelines allow aggregation, transformation, enrichment, and forwarding to different storage backends of your choice. + Hermes publishes events to a queue or topic in message queue (like ActiveMQ) via STOMP. Multiple consumers can process events independently. Enables real-time, decoupled processing pipelines. These events from ActiveMQ can be consumed by ETL pipelines. These Pipelines allow aggregation, transformation, enrichment, and forwarding to different storage backends of your choice. Example pipeline : ActiveMQ -> Logstash -> OpenSearch @@ -185,7 +186,7 @@ Different options are shown in figure and described below. 2. Direct Delivery These options send events directly to storage or alerting systems, bypassing queues. - Hermes can write events straight to Elasticsearch, OpenSearch, or InfluxDB. In addtion can also deliver events via email which supports custom SMTP servers, credentials, and SSL/TLS. + Hermes can write events straight to Elasticsearch, OpenSearch, or InfluxDB. In addition can also deliver events via email which supports custom SMTP servers, credentials, and SSL/TLS. Configuration option for each type is described below. @@ -214,7 +215,7 @@ Different options are shown in figure and described below. influxdb_token = my-secret-influxdb-token # Elasticsearch endpoint for sending events - elastic_endpoint = https://elasticsearch-host:9200/rucio-eic-event/_bulk + elastic_endpoint = https://Elasticsearch-host:9200/rucio-eic-event/_bulk # Optional credentials if Elasticsearch is secured elastic_username = admin elastic_password = password @@ -252,7 +253,7 @@ Different event types are listed below with their payload structure. dst-url: destination url of the transferred file duration: duration of the transfer (second) event_type: type of this event (transfer-submitted, \ - transfer-submittion_failed, transfer-queued, \ + transfer-submission_failed, transfer-queued, \ transfer-failed, transfer-done) file-size: same as bytes guid: guid of the transfer @@ -325,13 +326,13 @@ Different event types are listed below with their payload structure. There are other event for replicas, dids etc not stated here. ### Dashboard -[Kibana Dashbaord](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization) example was given. -[Grafana Dashboard](https://github.com/rucio/monitoring-templates/blob/main/message-monitoring/Dashboards/Rucio-Transfer.json) for transfer for elaticsearch/opensearch example given. +[Kibana Dashboard](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization) example was given. +[Grafana Dashboard](https://github.com/rucio/monitoring-templates/blob/main/message-monitoring/Dashboards/Rucio-Transfer.json) for transfer for Elasticsearch/OpenSearch example given. Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. They might be also be on old versions. ## Traces -The traces are sent by the pilots or the rucio clients whenever a file is downloaded/uploaded. These trace events are sent to the Rucio server via the /traces endpoint using HTTPS POST, where they are forwarded to messaging backends such as ActiveMQ. ActiveMQ acts as the messaging broker, delivering trace events to Kronos daemon. Any consumer like logstash can the be used for relaying traces to data pipelines for further processing if needed. And then directly or after processing be sent to storage backends such as OpenSearch, Elasticsearch, or InfluxDB, which allow querying, aggregation, and analytics. Finally, visualization tools like Grafana and Kibana can be used. +The traces are sent by the pilots or the rucio clients whenever a file is downloaded/uploaded. These trace events are sent to the Rucio server via the /traces endpoint using HTTPS POST, where they are forwarded to messaging backends such as ActiveMQ via STOMP. ActiveMQ acts as the messaging broker, delivering trace events to Kronos daemon. Any consumer like logstash can the be used for relaying traces to data pipelines for further processing if needed. And then directly or after processing be sent to storage backends such as OpenSearch, Elasticsearch, or InfluxDB, which allow querying, aggregation, and analytics. Finally, visualization tools like Grafana and Kibana can be used. This is shown in figure below. Schemas of the traces can be found in [trace.py](https://github.com/rucio/rucio/blob/master/lib/rucio/core/trace.py) which can be used for dashboards. @@ -357,21 +358,21 @@ flowchart TB end C1 -- traces (HTTPS POST) --> RS - RS -- publish --> Q1 - Q1 -- consume --> KR + RS -- publish(STOMP) --> Q1 + Q1 -- consume(STOMP) --> KR Q1 -- consume --> ETL ETL --> OS1 OS1 --> KB classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; - classDef opensearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; + classDef OpenSearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; classDef mq fill:#f69f03,stroke:#b35c00,color:#fff,font-weight:bold; classDef etl fill:#4CAF50,stroke:#2E7D32,color:#fff,font-weight:bold; class C1,RS,KR mono; class Q1 mq; - class OS1 opensearch; + class OS1 OpenSearch; class KB grafana; class ETL etl; ``` @@ -381,6 +382,8 @@ Database-level monitoring extracts different information directly from the Rucio Some Logstash pipeline definitions are given [here](https://github.com/rucio/rucio/tree/master/tools/monitoring/logstash-pipeline). These example pipelines use the Logstash JDBC input plugin to connect to the Rucio PostgreSQL database, execute SQL queries, and extract structured data periodically. The retrieved records are then sent to Elasticsearch but can be changed to other storage backends such as OpenSearch. The following diagram shows the high-level flow for database-level monitoring using Logstash. +Note: While the examples above use Logstash for database-level monitoring, you can replace Logstash with other data ingestion options depending on your requirements. + ```mermaid %%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#d8e3e7', @@ -404,13 +407,13 @@ flowchart TB classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; - classDef opensearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; + classDef OpenSearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; classDef logstash fill:#b34700,stroke:#b34700,color:#fff,font-weight:bold; class DB1,LS mono; class LS logstash; - class OS2 opensearch; + class OS2 OpenSearch; class GD grafana; ``` @@ -459,3 +462,85 @@ Some [Kibana dashboard](https://github.com/rucio/rucio/tree/master/tools/monitor [Grafana dashboard](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Dashboards/Rucio-Storage.json) example for rse given. Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. They might be also be on old versions. + +## Rucio Monitoring Probes + +Rucio provides a collection of **monitoring probes** that check the different status metrics of the Rucio. +The list of probes is available [here](https://github.com/rucio/probes/tree/master). +There are [common](https://github.com/rucio/probes/tree/master/common) probes shared across experiments, and you can also create your own experiment-specific probes for custom monitoring. + +Rucio provides a prebuilt container on [Docker Hub](https://hub.docker.com/r/rucio/probes) that includes: + +- All dependencies for running the probes. +- A lightweight **Jobber** daemon for scheduling probe execution. +- The full Rucio probe repository. You can add extra probes as well. + +The container can push results either to a **Prometheus Pushgateway** or export data for **Nagios** alerting. + +Probe Execution Workflow is: + +- **Probes** are Python scripts under `rucio/probes/`. +- **Jobber** acts as a cron-like scheduler inside the container. +- **Output options:** + - **Prometheus Pushgateway:** for time-series metrics. Alert in prometheus Alert manager or Grafana Alert manager. + - **Nagios:** for exit-code–based alerting. + +Make sure you can your rucio.cfg file mounted to `/opt/rucio/etc/rucio.cfg` inside the container with db options and extra section for prometheus (if choosen) as: +```cfg +[monitor] +prometheus_servers = "https://prometheuserver:port" +prometheus_prefix = "" # default empty +prometheus_labels = "" # default empty +``` + +For adding cron-like scheduling fo each probe in jobber, make sure you have added needed config in [dot-jobber](https://github.com/rucio/containers/blob/master/probes/dot-jobber). Minimal config needed is defined jobs +``` +Example snippet from `.jobber`: +```yaml +version: 1.4 +jobs: + - name: CheckExpiredDIDs + cmd: ./check_expired_dids + time: '*/5 * * * *' # every 5 minutes + onError: Continue + - name: CheckStuckRules + cmd: ./check_stuck_rules + time: '0 * * * *' # hourly + onError: Continue +``` + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart TB + Probe["Rucio Probes"] + Nagios["Nagios Monitoring Server"] + Alert["Alerting (Email / Teams / Slack)"] + PromPush["Prometheus Pushgateway"] + Prometheus["Prometheus server"] + Grafana["Grafana Dashboards"] + + Probe -- Exit code + stdout --> Nagios + Nagios -- Alert triggered if CRITICAL/WARNING --> Alert + Probe -- Gauge metrics --> PromPush + PromPush --> Prometheus + Prometheus --> Grafana + + classDef probe fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef nagios fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef alert fill:#f44336,stroke:#b71c1c,color:#fff,font-weight:bold; + classDef prom fill:#009688,stroke:#00695C,color:#fff,font-weight:bold; + classDef graf fill:#FF9800,stroke:#E65100,color:#fff,font-weight:bold; + + class Probe probe; + class Nagios nagios; + class Alert alert; + class PromPush,prometheus prom; + class Grafana graf; +``` \ No newline at end of file From 3baf2481d572dd7c1b005b0d9a6a9da52e006a5d Mon Sep 17 00:00:00 2001 From: Anil Panta Date: Thu, 13 Nov 2025 15:16:29 -0500 Subject: [PATCH 04/11] minor edits on Probes mermaid #617 --- docs/operator/monitoring.md | 69 +++++++++++++++++-------------------- 1 file changed, 31 insertions(+), 38 deletions(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index 893a4303c85..3a89730cd99 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -477,6 +477,36 @@ Rucio provides a prebuilt container on [Docker Hub](https://hub.docker.com/r/ruc The container can push results either to a **Prometheus Pushgateway** or export data for **Nagios** alerting. +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart LR + Probe["Rucio Probes (Schedule via Jabber or others)"] + Nagios["Nagios"] + Prometheus["Prometheus"] + Grafana["Grafana Dashboards"] + + Probe -- Exit code + stdout --> Nagios + Probe -- Gauge metrics via Pushgateway--> Prometheus + Prometheus --> Grafana + + classDef probe fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef nagios fill:#E53935,stroke:#B71C1C,color:#fff,font-weight:bold; + classDef prom fill:#00868,stroke:#00695C,color:#fff,font-weight:bold; + classDef graf fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + + class Probe probe; + class Nagios nagios; + class prometheus prom; + class Grafana graf; +``` + Probe Execution Workflow is: - **Probes** are Python scripts under `rucio/probes/`. @@ -494,8 +524,7 @@ prometheus_labels = "" # default empty ``` For adding cron-like scheduling fo each probe in jobber, make sure you have added needed config in [dot-jobber](https://github.com/rucio/containers/blob/master/probes/dot-jobber). Minimal config needed is defined jobs -``` -Example snippet from `.jobber`: + ```yaml version: 1.4 jobs: @@ -508,39 +537,3 @@ jobs: time: '0 * * * *' # hourly onError: Continue ``` - -```mermaid -%%{init: {'theme': 'base', 'themeVariables': { - 'primaryColor': '#d8e3e7', - 'edgeLabelBackground': '#ffffff', - 'tertiaryColor': '#cdd5d9', - 'fontFamily': 'monospace', - 'primaryBorderColor': '#90a4ae', - 'lineColor': '#90a4ae' -}}}%% -flowchart TB - Probe["Rucio Probes"] - Nagios["Nagios Monitoring Server"] - Alert["Alerting (Email / Teams / Slack)"] - PromPush["Prometheus Pushgateway"] - Prometheus["Prometheus server"] - Grafana["Grafana Dashboards"] - - Probe -- Exit code + stdout --> Nagios - Nagios -- Alert triggered if CRITICAL/WARNING --> Alert - Probe -- Gauge metrics --> PromPush - PromPush --> Prometheus - Prometheus --> Grafana - - classDef probe fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; - classDef nagios fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; - classDef alert fill:#f44336,stroke:#b71c1c,color:#fff,font-weight:bold; - classDef prom fill:#009688,stroke:#00695C,color:#fff,font-weight:bold; - classDef graf fill:#FF9800,stroke:#E65100,color:#fff,font-weight:bold; - - class Probe probe; - class Nagios nagios; - class Alert alert; - class PromPush,prometheus prom; - class Grafana graf; -``` \ No newline at end of file From 5d01eec198b49b67f9a264831d34f3a47d51c88f Mon Sep 17 00:00:00 2001 From: Anil Panta Date: Mon, 17 Nov 2025 15:20:56 -0500 Subject: [PATCH 05/11] address some comments #617 - some spacing edit - added differet event type list - condense some Traces description. --- docs/operator/monitoring.md | 54 ++++++++++++++++++++++++++++--------- 1 file changed, 42 insertions(+), 12 deletions(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index 3a89730cd99..08e75e1a31d 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -9,7 +9,7 @@ Rucio provides multiple monitoring components to observe its internal operations - [**Internal Monitoring**](#internal-monitoring): Observing Rucio server and daemon performance. - [**Transfers, Deletion, and Other Monitoring**](#transfers-deletion-and-other-monitoring): Tracking transfers, deletions, and other Rucio events. -- [**File/Dataset Access Monitoring**](#traces):Using traces to monitor client interactions. +- [**File/Dataset Access Monitoring**](#traces): Using traces to monitor client interactions. - [**Database Dump and Visualization**](#rucio-database-dump): Extracting database-level metrics for visualization. - [**Probes**](#rucio-monitoring-probes): Automated checks and using Nagios or Prometheus Pushgateway. @@ -94,7 +94,7 @@ The used metrics can be found in following links (code search) [Grafana Dashboard JSON](https://github.com/rucio/rucio/blob/master/tools/monitoring/visualization/rucio-internal.json) for Graphite is given here. [Grafana Dashboard JSON](https://github.com/rucio/monitoring-templates/blob/main/prometheus-monitoring/Dashboards/Rucio-Internal.json) for prometheus is given here. -Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. +Note: This example is given as a suggestion, it might need to be tweaked according to your setup and needs. ## Transfers, Deletion and Other Monitoring Rucio generates a large volume of operational events for activities such as: transfers, deletions, rule evaluations, replication tasks, etc., originating from daemons like conveyor, reaper, judge, and others. @@ -143,7 +143,7 @@ Different options are shown in figure and described below. 1. Queue-Based Pipelines - Hermes publishes events to a queue or topic in message queue (like ActiveMQ) via STOMP. Multiple consumers can process events independently. Enables real-time, decoupled processing pipelines. These events from ActiveMQ can be consumed by ETL pipelines. These Pipelines allow aggregation, transformation, enrichment, and forwarding to different storage backends of your choice. + Hermes publishes events to a queue or topic in message queue (like ActiveMQ) via STOMP. Multiple consumers can process events independently, which enables real-time, decoupled processing pipelines. These events from ActiveMQ can be consumed by ETL pipelines. These Pipelines allow aggregation, transformation, enrichment, and forwarding to different storage backends of your choice. Example pipeline : ActiveMQ -> Logstash -> OpenSearch @@ -233,7 +233,37 @@ Different options are shown in figure and described below. smtp_keyfile = ``` ### Event Types -Different event types are listed below with their payload structure. +Different event types are created + - Transfers: `transfer-submitted`, `transfer-submission_failed`, `transfer-queued`, `transfer-failed`, `transfer-done` + - Deletions: `deletion-done`, `deletion-not-found`, `deletion-failed` + - Rules: `RULE_OK`, and `RULE_PROGRESS` + - Locks: `DATASETLOCK_OK` + - DIDs: `CREATE_CNT` and `CREATE_DTS` + - Replicas: `INCOMPLETE` and `ERASE` + +:::warning +Above list might not be complete list. +::: + +The structure of events is: +```json +{ + "id": "UUID4", + "services": "", + "event_type": "", + "created_at": "yyyy-MM-dd HH:mm:ss.SSSSSS", + "payload": {}, + "payload_nolimit": {}, +} +``` +where: +- id: UUID string +- event_type: string describing the event_type listed before +- payload: small JSON object (max 4000 chars), structure varies by event type +- payload_nolimit: optional large JSON object. Only if payload larger than 4000 characters +- services: optional comma string identifying the service. (elastic, activemq, influx) +- created_at: When the message was created. ISO 8601 timestamps + 1. Transfer Events ``` { @@ -332,7 +362,7 @@ There are other event for replicas, dids etc not stated here. Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. They might be also be on old versions. ## Traces -The traces are sent by the pilots or the rucio clients whenever a file is downloaded/uploaded. These trace events are sent to the Rucio server via the /traces endpoint using HTTPS POST, where they are forwarded to messaging backends such as ActiveMQ via STOMP. ActiveMQ acts as the messaging broker, delivering trace events to Kronos daemon. Any consumer like logstash can the be used for relaying traces to data pipelines for further processing if needed. And then directly or after processing be sent to storage backends such as OpenSearch, Elasticsearch, or InfluxDB, which allow querying, aggregation, and analytics. Finally, visualization tools like Grafana and Kibana can be used. +Rucio clients can send trace events on every file upload or download. These are posted to the /traces endpoint and forwarded to a message broker such as ActiveMQ via STOMP. Messages are consumed by Rucio’s Kronos daemon or by external consumers. This is shown in figure below. Schemas of the traces can be found in [trace.py](https://github.com/rucio/rucio/blob/master/lib/rucio/core/trace.py) which can be used for dashboards. @@ -466,14 +496,13 @@ Note: Dashboard example is just for giving some idea, they might need to be twea ## Rucio Monitoring Probes Rucio provides a collection of **monitoring probes** that check the different status metrics of the Rucio. -The list of probes is available [here](https://github.com/rucio/probes/tree/master). -There are [common](https://github.com/rucio/probes/tree/master/common) probes shared across experiments, and you can also create your own experiment-specific probes for custom monitoring. +The list of probes is available [here](https://github.com/rucio/probes/tree/master/common) probes shared across experiments. Also can create experiment-specific probes for custom monitoring like [ATLAS](https://github.com/rucio/probes/tree/master/atlas) and [CMS](https://github.com/rucio/probes/tree/master/cms). Rucio provides a prebuilt container on [Docker Hub](https://hub.docker.com/r/rucio/probes) that includes: - All dependencies for running the probes. - A lightweight **Jobber** daemon for scheduling probe execution. -- The full Rucio probe repository. You can add extra probes as well. +- The full Rucio probe repository. Custom probes can be added by introducing them to your own Rucio instance. The container can push results either to a **Prometheus Pushgateway** or export data for **Nagios** alerting. @@ -512,10 +541,11 @@ Probe Execution Workflow is: - **Probes** are Python scripts under `rucio/probes/`. - **Jobber** acts as a cron-like scheduler inside the container. - **Output options:** - - **Prometheus Pushgateway:** for time-series metrics. Alert in prometheus Alert manager or Grafana Alert manager. - - **Nagios:** for exit-code–based alerting. + - **Prometheus Pushgateway:** for time-series metrics. Alerts can be added with [Prometheus](https://prometheus.io/docs/alerting/latest/alertmanager/) and [Grafana](https://grafana.com/docs/grafana/latest/alerting/set-up/configure-alertmanager/) alert management. + - **Nagios:** Used mainly as a cron-style runner where exit codes trigger Nagios alerts, while probe metrics are sent to Prometheus. + +To make use of prometheus functionality, make sure your `rucio.cfg` inside the container with the probes has the extra sections and options: -Make sure you can your rucio.cfg file mounted to `/opt/rucio/etc/rucio.cfg` inside the container with db options and extra section for prometheus (if choosen) as: ```cfg [monitor] prometheus_servers = "https://prometheuserver:port" @@ -523,7 +553,7 @@ prometheus_prefix = "" # default empty prometheus_labels = "" # default empty ``` -For adding cron-like scheduling fo each probe in jobber, make sure you have added needed config in [dot-jobber](https://github.com/rucio/containers/blob/master/probes/dot-jobber). Minimal config needed is defined jobs +For adding cron-like scheduling fo each probe in jobber, make sure you have added needed config in [dot-jobber](https://github.com/rucio/containers/blob/master/probes/dot-jobber). An example config is given below, running the probes `check_expired_dids` and `check_stuck_rules`. This config assumes your probes are in the top level directory of the container. ```yaml version: 1.4 From d23715ad47c9bdb72b11cb937ba789b87a16c0ac Mon Sep 17 00:00:00 2001 From: Anil Panta Date: Tue, 9 Dec 2025 13:52:57 -0500 Subject: [PATCH 06/11] remove deprecated links and address comments #617 remove the link to https://github.com/rucio/rucio/tree/master/tools/monitoring as its being removed https://github.com/rucio/rucio/issues/7375 . Remove each events json and explain how to inspect them from the db. Added Hermes delivery format for each options. --- docs/operator/monitoring.md | 133 ++++++++---------------------------- 1 file changed, 27 insertions(+), 106 deletions(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index 08e75e1a31d..017e3ead017 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -91,10 +91,8 @@ The used metrics can be found in following links (code search) - [Gauge](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.gauge&type=code) - [Timer](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.timer&type=code) -[Grafana Dashboard JSON](https://github.com/rucio/rucio/blob/master/tools/monitoring/visualization/rucio-internal.json) for Graphite is given here. [Grafana Dashboard JSON](https://github.com/rucio/monitoring-templates/blob/main/prometheus-monitoring/Dashboards/Rucio-Internal.json) for prometheus is given here. -Note: This example is given as a suggestion, it might need to be tweaked according to your setup and needs. ## Transfers, Deletion and Other Monitoring Rucio generates a large volume of operational events for activities such as: transfers, deletions, rule evaluations, replication tasks, etc., originating from daemons like conveyor, reaper, judge, and others. @@ -183,6 +181,7 @@ Different options are shown in figure and described below. username = password = ``` + 2. Direct Delivery These options send events directly to storage or alerting systems, bypassing queues. @@ -241,11 +240,8 @@ Different event types are created - DIDs: `CREATE_CNT` and `CREATE_DTS` - Replicas: `INCOMPLETE` and `ERASE` -:::warning -Above list might not be complete list. -::: -The structure of events is: +The structure of messages table which is extracted by Hermes is: ```json { "id": "UUID4", @@ -264,102 +260,30 @@ where: - services: optional comma string identifying the service. (elastic, activemq, influx) - created_at: When the message was created. ISO 8601 timestamps -1. Transfer Events - ``` - { - created_at: when the message was created (yyyy-MM-ddTHH:mm:ss.SSSSSS) - event_type: type of this event (transfer-submitted, \ - transfer-submission_failed, transfer-queued, transfer-failed, \ - transfer-done) - payload: { - account: account submitting the request - activity: activity of the request - bytes: size of the transferred file (byte) - checksum-adler: checksum using adler algorithm - checksum-md5: checksum using md5 algorithm - created_at: Time when the message was created (yyyy-MM-dd HH:mm:ss.SSSSSS) - dst-rse: destination rse - dst-type: type of destination rse (disk, tape) - dst-url: destination url of the transferred file - duration: duration of the transfer (second) - event_type: type of this event (transfer-submitted, \ - transfer-submission_failed, transfer-queued, \ - transfer-failed, transfer-done) - file-size: same as bytes - guid: guid of the transfer - name: name of transferred file - previous-request-id: id of previous request - protocol: transfer protocol - reason: reason of the failure - request-id: id of this request - scope: scope of the transferred data - src-rse: source rse - src-type: type of source rse (disk, tape) - src-url: source file url - started_at: start time of the transfer (yyyy-MM-dd HH:mm:ss.SSSSSS) - submitted_at: submission time of the transfer (yyyy-MM-dd HH:mm:ss.SSSSSS) - tool-id: id of the transfer tool in rucio (rucio-conveyor) - transfer-endpoint: endpoint holder of the transfer (fts) - transfer-id: uuid of this transfer - transfer-link: link of this transfer (in form of fts url) - transferred_at: done time of this transfer - } - } - ``` -2. Deletion Event - ``` - { - created_at: when the message was created (yyyy-MM-ddTHH:mm:ss.SSSSSS) - event_type: type of this event (deletion-done,deletion-failed, deletion-not-found) - payload: { - scope: scope of the deleted replica - name: name of the deleted replica - rse: rse holding the deleted replica - file-size: size of the file - bytes: size of the file - url: url of the file - duration: duration of the deletion - protocol: prococol used in the deletion - reason: reason of the failure - } - } - ``` -3. Rule Event - ``` - created_at: when the message was created (yyyy-MM-ddTHH:mm:ss.SSSSSS) - event_type: 'RULE_OK' or 'RULE_PROGRESS' - payload:{ - 'scope': scope.external, - 'name': name, - 'rule_id': rule_id, # only for RULE_OK and RULE_PROGRESS - 'vo': vo # only if not default - 'progress': int #replication progress # only for RULE_PROGRESS - 'dataset_name': dataset_name, # only for LOST - 'dataset_scope': dataset_scope # only for LOST - } - ``` -4. Dataset Lock Event - ``` - { - created_at: when the message was created (yyyy-MM-ddTHH:mm:ss.SSSSSS) - event_type: 'DATASETLOCK_OK' - payload: { - 'scope': did_scope, - 'name': did_name, - 'rse': rse, - 'rse_id': rse_id, - 'rule_id': rule_id - 'vo': vo if not default - } - } - ``` -There are other event for replicas, dids etc not stated here. -### Dashboard -[Kibana Dashboard](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization) example was given. -[Grafana Dashboard](https://github.com/rucio/monitoring-templates/blob/main/message-monitoring/Dashboards/Rucio-Transfer.json) for transfer for Elasticsearch/OpenSearch example given. +To quickly inspect the payloads of these event types: +```sql +SELECT id, created_at, payload +FROM messages +WHERE event_type = '' +ORDER BY created_at DESC +LIMIT 2; +``` +replace event_type with actual name that you want to inspect. We can also check `messages_history` table. + +### Format of Messages Delivered by Hermes +The final format of the message is determined by the destination service, as Hermes transforms the raw database message into the required wire protocol for external systems. -Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. They might be also be on old versions. +- ActiveMQ (STOMP Message): The body is a streamlined JSON object containing only event_type, payload, and created_at. The message uses STOMP headers to set the event_type and flag the message as persistent. + +- Elasticsearch / OpenSearch (Bulk API): Hermes sends the raw database JSON message (including id and services) as a document, wrapped in the two-line Elasticsearch Bulk API format (i.e., {"index":{}} followed by the source JSON). + +- InfluxDB (Line Protocol): Hermes performs on-the-fly aggregation of transfers and deletions, counting successes/failures and bytes. It does not send the raw event JSON. The final format is the InfluxDB Line Protocol, which consists of a single text line combining the measurement, tags (e.g., RSE, activity), fields (e.g., nb_done=10), and a timestamp. + + +Example Grafana dashboard for transfer is provided [here](https://github.com/rucio/monitoring-templates/blob/main/message-monitoring/Dashboards/Rucio-Transfer.json) + +> **Note**: Please make changes to dashboard according to your setup and needs. ## Traces Rucio clients can send trace events on every file upload or download. These are posted to the /traces endpoint and forwarded to a message broker such as ActiveMQ via STOMP. Messages are consumed by Rucio’s Kronos daemon or by external consumers. @@ -410,9 +334,9 @@ flowchart TB ## Rucio database dump Database-level monitoring extracts different information directly from the Rucio database. This includes insights such as RSE usage statistics, account quotas, and other metadata relevant to experiments. These data are periodically queried and exported to external storage backends for visualization and long-term monitoring. -Some Logstash pipeline definitions are given [here](https://github.com/rucio/rucio/tree/master/tools/monitoring/logstash-pipeline). These example pipelines use the Logstash JDBC input plugin to connect to the Rucio PostgreSQL database, execute SQL queries, and extract structured data periodically. The retrieved records are then sent to Elasticsearch but can be changed to other storage backends such as OpenSearch. The following diagram shows the high-level flow for database-level monitoring using Logstash. +Some example Logstash pipeline definitions are given [here](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Pipelines/pipelines.yml). These example pipelines use the Logstash JDBC input plugin to connect to the Rucio PostgreSQL database, execute SQL queries, and extract structured data periodically. The retrieved records are then sent to Elasticsearch but can be changed to other storage backends such as OpenSearch. The following diagram shows the high-level flow for database-level monitoring using Logstash. -Note: While the examples above use Logstash for database-level monitoring, you can replace Logstash with other data ingestion options depending on your requirements. +> **Note** : While this example uses Logstash, you can use other data collector options like [fluentd](https://www.fluentd.org/) with [plugin](https://github.com/fluent/fluent-plugin-sql) depending on your requirements. ```mermaid %%{init: {'theme': 'base', 'themeVariables': { @@ -488,10 +412,7 @@ Few points: - filter: This is optional. It helps in preprocessing your data before indexing -Some [Kibana dashboard](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization/db_dump) example given. -[Grafana dashboard](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Dashboards/Rucio-Storage.json) example for rse given. - -Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. They might be also be on old versions. +[Grafana dashboard](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Dashboards/Rucio-Storage.json) example for rse given. ## Rucio Monitoring Probes From 5fc08e36c20ed4cec3f52aba9699cbb434c74846 Mon Sep 17 00:00:00 2001 From: Anil Panta Date: Tue, 9 Dec 2025 14:10:32 -0500 Subject: [PATCH 07/11] Address acron expression errors in monit docs #617 --- docs/operator/monitoring.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index 017e3ead017..ec0c7562404 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -276,7 +276,7 @@ The final format of the message is determined by the destination service, as Her - ActiveMQ (STOMP Message): The body is a streamlined JSON object containing only event_type, payload, and created_at. The message uses STOMP headers to set the event_type and flag the message as persistent. -- Elasticsearch / OpenSearch (Bulk API): Hermes sends the raw database JSON message (including id and services) as a document, wrapped in the two-line Elasticsearch Bulk API format (i.e., {"index":{}} followed by the source JSON). +- Elasticsearch / OpenSearch (Bulk API): Hermes sends the raw database JSON message (including id and services) as a document using Bulk API format (via a POST request). - InfluxDB (Line Protocol): Hermes performs on-the-fly aggregation of transfers and deletions, counting successes/failures and bytes. It does not send the raw event JSON. The final format is the InfluxDB Line Protocol, which consists of a single text line combining the measurement, tags (e.g., RSE, activity), fields (e.g., nb_done=10), and a timestamp. From c33950d45992235da380f26246fda09171d6efff Mon Sep 17 00:00:00 2001 From: Anil Panta Date: Fri, 12 Dec 2025 13:54:34 -0500 Subject: [PATCH 08/11] address comments on monitoring documentation #617 Capitalize some words everywhere. Clarify some sentences. Adding backticks to some references to code or commands or types. --- docs/operator/monitoring.md | 70 ++++++++++++++++++------------------- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index ec0c7562404..a65c216c3cb 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -50,15 +50,15 @@ flowchart TB %% Style definitions classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; - classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; - classDef graphite fill:#555555,stroke:#333333,color:#fff,font-weight:bold; %% Dark gray for Graphite - classDef prometheus fill:#009688,stroke:#00695C,color:#fff,font-weight:bold; %% Teal, distinct + classDef Grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef Graphite fill:#555555,stroke:#333333,color:#fff,font-weight:bold; %% Dark gray for Graphite + classDef Prometheus fill:#009688,stroke:#00695C,color:#fff,font-weight:bold; %% Teal, distinct %% Apply styles class A1 mono; - class G1 graphite; - class P1 prometheus; - class GF grafana; + class G1 Graphite; + class P1 Prometheus; + class GF Grafana; ``` There are two options: @@ -91,7 +91,7 @@ The used metrics can be found in following links (code search) - [Gauge](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.gauge&type=code) - [Timer](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.timer&type=code) -[Grafana Dashboard JSON](https://github.com/rucio/monitoring-templates/blob/main/prometheus-monitoring/Dashboards/Rucio-Internal.json) for prometheus is given here. +[Grafana Dashboard JSON](https://github.com/rucio/monitoring-templates/blob/main/prometheus-monitoring/Dashboards/Rucio-Internal.json) for Prometheus is given here. ## Transfers, Deletion and Other Monitoring @@ -124,7 +124,7 @@ flowchart TB OS1 --> KB classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; - classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef Grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; classDef OpenSearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; classDef mq fill:#f69f03,stroke:#b35c00,color:#fff,font-weight:bold; classDef etl fill:#4CAF50,stroke:#2E7D32,color:#fff,font-weight:bold; @@ -133,7 +133,7 @@ flowchart TB class A2,A3 mono; class Q1 mq; class OS1 OpenSearch; - class KB grafana; + class KB Grafana; class ETL etl; ``` @@ -253,12 +253,12 @@ The structure of messages table which is extracted by Hermes is: } ``` where: -- id: UUID string -- event_type: string describing the event_type listed before -- payload: small JSON object (max 4000 chars), structure varies by event type -- payload_nolimit: optional large JSON object. Only if payload larger than 4000 characters -- services: optional comma string identifying the service. (elastic, activemq, influx) -- created_at: When the message was created. ISO 8601 timestamps +- `id`: UUID string +- `event_type`: string describing the event_type listed before +- `payload`: small JSON object (max 4000 chars), structure varies by event type +- `payload_nolimit`: optional large JSON object. Only if payload larger than 4000 characters +- `services`: string identifying the service. (elastic, activemq, influx) +- `created_at`: When the message was created. ISO 8601 timestamps To quickly inspect the payloads of these event types: @@ -269,26 +269,26 @@ WHERE event_type = '' ORDER BY created_at DESC LIMIT 2; ``` -replace event_type with actual name that you want to inspect. We can also check `messages_history` table. +replace `event_type` with actual name that you want to inspect. We can also check `messages_history` table. ### Format of Messages Delivered by Hermes The final format of the message is determined by the destination service, as Hermes transforms the raw database message into the required wire protocol for external systems. -- ActiveMQ (STOMP Message): The body is a streamlined JSON object containing only event_type, payload, and created_at. The message uses STOMP headers to set the event_type and flag the message as persistent. +- ActiveMQ (STOMP Message): The body is a streamlined JSON object containing only `event_type`, `payload`, and `created_at`. The message uses STOMP headers to set the event_type and flag the message as persistent. -- Elasticsearch / OpenSearch (Bulk API): Hermes sends the raw database JSON message (including id and services) as a document using Bulk API format (via a POST request). +- Elasticsearch / OpenSearch (Bulk API): Hermes sends the raw database JSON message (including `id` and `services`) as a document using Bulk API format (via a POST request). -- InfluxDB (Line Protocol): Hermes performs on-the-fly aggregation of transfers and deletions, counting successes/failures and bytes. It does not send the raw event JSON. The final format is the InfluxDB Line Protocol, which consists of a single text line combining the measurement, tags (e.g., RSE, activity), fields (e.g., nb_done=10), and a timestamp. +- InfluxDB (Line Protocol): Hermes performs on-the-fly aggregation of transfers and deletions, counting successes/failures and bytes. It does not send the raw event JSON. The final format is the InfluxDB Line Protocol, which consists of a single text line combining the measurement, tags (e.g., RSE, activity), fields (e.g., `nb_done=10`), and a timestamp. Example Grafana dashboard for transfer is provided [here](https://github.com/rucio/monitoring-templates/blob/main/message-monitoring/Dashboards/Rucio-Transfer.json) -> **Note**: Please make changes to dashboard according to your setup and needs. +> **Note**: Please make changes to the dashboard according to your setup and needs. ## Traces -Rucio clients can send trace events on every file upload or download. These are posted to the /traces endpoint and forwarded to a message broker such as ActiveMQ via STOMP. Messages are consumed by Rucio’s Kronos daemon or by external consumers. +Rucio clients can send trace events on every file upload or download. These are posted to the `/traces` endpoint and forwarded to a message broker such as ActiveMQ via STOMP. Messages are consumed by Rucio’s Kronos daemon or by external consumers. -This is shown in figure below. Schemas of the traces can be found in [trace.py](https://github.com/rucio/rucio/blob/master/lib/rucio/core/trace.py) which can be used for dashboards. +This is shown in figure below. Schemas of the traces can be found in [`trace.py`](https://github.com/rucio/rucio/blob/master/lib/rucio/core/trace.py) which can be used for dashboards. ```mermaid @@ -319,7 +319,7 @@ flowchart TB OS1 --> KB classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; - classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef Grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; classDef OpenSearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; classDef mq fill:#f69f03,stroke:#b35c00,color:#fff,font-weight:bold; classDef etl fill:#4CAF50,stroke:#2E7D32,color:#fff,font-weight:bold; @@ -327,7 +327,7 @@ flowchart TB class C1,RS,KR mono; class Q1 mq; class OS1 OpenSearch; - class KB grafana; + class KB Grafana; class ETL etl; ``` @@ -360,15 +360,15 @@ flowchart TB OS2 --> GD classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; - classDef grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef Grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; classDef OpenSearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; - classDef logstash fill:#b34700,stroke:#b34700,color:#fff,font-weight:bold; + classDef Logstash fill:#b34700,stroke:#b34700,color:#fff,font-weight:bold; class DB1,LS mono; - class LS logstash; + class LS Logstash; class OS2 OpenSearch; - class GD grafana; + class GD Grafana; ``` A typical Logstash configuration consists of three sections — input, filter, and output. For example, the input section defines the PostgreSQL connection and SQL query to fetch data: @@ -406,13 +406,13 @@ output { } ``` Few points: -- jdbc_driver_library: Can downloaded from [jdbc.postgresql.org](https://jdbc.postgresql.org/), choose the version that you want to use and enable that in logstash. -- schedule: Defines how often the query runs (Cron-like syntax). -- output: Defines where the extracted data are delivered. In most deployments, these are indexed into OpenSearch or Elasticsearch for analytics dashboards in Grafana or Kibana: -- filter: This is optional. It helps in preprocessing your data before indexing +- `jdbc_driver_library`: Can be downloaded from [jdbc.postgresql.org](https://jdbc.postgresql.org/), choose the version that you want to use and enable that in Logstash. +- `schedule`: Defines how often the query runs (Cron-like syntax). +- `output`: Defines where the extracted data are delivered. In most deployments, these are indexed into OpenSearch or Elasticsearch for analytics dashboards in Grafana or Kibana. +- `filter`: This is optional. It helps in preprocessing your data before indexing -[Grafana dashboard](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Dashboards/Rucio-Storage.json) example for rse given. +[Grafana dashboard](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Dashboards/Rucio-Storage.json) example for RSE given. ## Rucio Monitoring Probes @@ -437,7 +437,7 @@ The container can push results either to a **Prometheus Pushgateway** or export 'lineColor': '#90a4ae' }}}%% flowchart LR - Probe["Rucio Probes (Schedule via Jabber or others)"] + Probe["Rucio Probes (Schedule via Jobber or others)"] Nagios["Nagios"] Prometheus["Prometheus"] Grafana["Grafana Dashboards"] @@ -465,7 +465,7 @@ Probe Execution Workflow is: - **Prometheus Pushgateway:** for time-series metrics. Alerts can be added with [Prometheus](https://prometheus.io/docs/alerting/latest/alertmanager/) and [Grafana](https://grafana.com/docs/grafana/latest/alerting/set-up/configure-alertmanager/) alert management. - **Nagios:** Used mainly as a cron-style runner where exit codes trigger Nagios alerts, while probe metrics are sent to Prometheus. -To make use of prometheus functionality, make sure your `rucio.cfg` inside the container with the probes has the extra sections and options: +To make use of Prometheus functionality, make sure your `rucio.cfg` inside the container with the probes has the extra sections and options: ```cfg [monitor] From 7e998bc2afc3ea87cec7d2ff327f28717226928c Mon Sep 17 00:00:00 2001 From: Anil Panta Date: Mon, 2 Feb 2026 08:53:26 -0500 Subject: [PATCH 09/11] fix some typo in monitoring documentation #617 --- docs/operator/monitoring.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index a65c216c3cb..84741374657 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -397,7 +397,7 @@ filter { output { elasticsearch { - hosts => ["http://elasticsearh:9200"] + hosts => ["http://elasticsearch:9200"] action => "index" index => "rucio_rse" user => "elastic" @@ -474,7 +474,7 @@ prometheus_prefix = "" # default empty prometheus_labels = "" # default empty ``` -For adding cron-like scheduling fo each probe in jobber, make sure you have added needed config in [dot-jobber](https://github.com/rucio/containers/blob/master/probes/dot-jobber). An example config is given below, running the probes `check_expired_dids` and `check_stuck_rules`. This config assumes your probes are in the top level directory of the container. +For adding cron-like scheduling for each probe in jobber, make sure you have added needed config in [dot-jobber](https://github.com/rucio/containers/blob/master/probes/dot-jobber). An example config is given below, running the probes `check_expired_dids` and `check_stuck_rules`. This config assumes your probes are in the top level directory of the container. ```yaml version: 1.4 From 6e43ba08865d31b365ff1d1c89dd95fd23506b1b Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 2 Feb 2026 13:56:04 +0000 Subject: [PATCH 10/11] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/operator/monitoring.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index 84741374657..7761cf58b90 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -64,9 +64,9 @@ flowchart TB There are two options: 1. Graphite - + Metrics are pushed to a Graphite server. - + ```cfg [monitor] # specify the hostname for carbon server @@ -91,7 +91,7 @@ The used metrics can be found in following links (code search) - [Gauge](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.gauge&type=code) - [Timer](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.timer&type=code) -[Grafana Dashboard JSON](https://github.com/rucio/monitoring-templates/blob/main/prometheus-monitoring/Dashboards/Rucio-Internal.json) for Prometheus is given here. +[Grafana Dashboard JSON](https://github.com/rucio/monitoring-templates/blob/main/prometheus-monitoring/Dashboards/Rucio-Internal.json) for Prometheus is given here. ## Transfers, Deletion and Other Monitoring @@ -178,12 +178,12 @@ Different options are shown in figure and described below. # SSL port (used if use_ssl=True) port = 61614 # ActiveMQ username/password (used if use_ssl=False) - username = + username = password = ``` 2. Direct Delivery - + These options send events directly to storage or alerting systems, bypassing queues. Hermes can write events straight to Elasticsearch, OpenSearch, or InfluxDB. In addition can also deliver events via email which supports custom SMTP servers, credentials, and SSL/TLS. @@ -228,8 +228,8 @@ Different options are shown in figure and described below. smtp_password = my-smtp-pass smtp_usessl = False smtp_usetls = True - smtp_certfile = - smtp_keyfile = + smtp_certfile = + smtp_keyfile = ``` ### Event Types Different event types are created @@ -271,7 +271,7 @@ LIMIT 2; ``` replace `event_type` with actual name that you want to inspect. We can also check `messages_history` table. -### Format of Messages Delivered by Hermes +### Format of Messages Delivered by Hermes The final format of the message is determined by the destination service, as Hermes transforms the raw database message into the required wire protocol for external systems. - ActiveMQ (STOMP Message): The body is a streamlined JSON object containing only `event_type`, `payload`, and `created_at`. The message uses STOMP headers to set the event_type and flag the message as persistent. @@ -412,7 +412,7 @@ Few points: - `filter`: This is optional. It helps in preprocessing your data before indexing -[Grafana dashboard](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Dashboards/Rucio-Storage.json) example for RSE given. +[Grafana dashboard](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Dashboards/Rucio-Storage.json) example for RSE given. ## Rucio Monitoring Probes @@ -474,7 +474,7 @@ prometheus_prefix = "" # default empty prometheus_labels = "" # default empty ``` -For adding cron-like scheduling for each probe in jobber, make sure you have added needed config in [dot-jobber](https://github.com/rucio/containers/blob/master/probes/dot-jobber). An example config is given below, running the probes `check_expired_dids` and `check_stuck_rules`. This config assumes your probes are in the top level directory of the container. +For adding cron-like scheduling for each probe in jobber, make sure you have added needed config in [dot-jobber](https://github.com/rucio/containers/blob/master/probes/dot-jobber). An example config is given below, running the probes `check_expired_dids` and `check_stuck_rules`. This config assumes your probes are in the top level directory of the container. ```yaml version: 1.4 From 52c6d0d01ab45be81966ba6ed4e1e5d75afd4f6c Mon Sep 17 00:00:00 2001 From: Anil Panta Date: Mon, 2 Feb 2026 09:12:41 -0500 Subject: [PATCH 11/11] fix codespell on monitoring docs #617 --- docs/operator/monitoring.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index 7761cf58b90..c86b343576c 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -348,7 +348,7 @@ Some example Logstash pipeline definitions are given [here](https://github.com/r 'lineColor': '#90a4ae' }}}%% flowchart TB - subgraph DB["**Database Level Accouting/Monitoring**"] + subgraph DB["**Database Level Accounting/Monitoring**"] DB1[("Rucio DB")] LS["Logstash JDBC Input"] OS2["OpenSearch / Elasticsearch"]