diff --git a/docs/operator/monitoring.md b/docs/operator/monitoring.md index 78b9d2f24b9..c86b343576c 100644 --- a/docs/operator/monitoring.md +++ b/docs/operator/monitoring.md @@ -3,429 +3,488 @@ title: Monitoring sidebar_label: Monitoring --- -There are three different monitoring components: +# Rucio Monitoring Guide + +Rucio provides multiple monitoring components to observe its internal operations, data transfers, file access, and database state. These components include: + +- [**Internal Monitoring**](#internal-monitoring): Observing Rucio server and daemon performance. +- [**Transfers, Deletion, and Other Monitoring**](#transfers-deletion-and-other-monitoring): Tracking transfers, deletions, and other Rucio events. +- [**File/Dataset Access Monitoring**](#traces): Using traces to monitor client interactions. +- [**Database Dump and Visualization**](#rucio-database-dump): Extracting database-level metrics for visualization. +- [**Probes**](#rucio-monitoring-probes): Automated checks and using Nagios or Prometheus Pushgateway. -- Rucio internal monitoring using Graphite/Grafana -- Transfer monitoring using the messages sent by Hermes -- File/Dataset Access monitoring using the traces ## Internal Monitoring This is to monitor the internals of Rucio servers and daemons, e.g., submission rate of the conveyor, state of conveyor queues, reaper deletion rate, server -response times, server active session, etc. We use Graphite[^1] for this. It's -easy to setup and then you have to point your Rucio instance to the Graphite -server using the \"carbon_server" options in the "monitor" section in -etc/rucio.cfg. - -The different Rucio components will then send metrics using those "record" -functions you will find all over the code. Graphite has a built-in web interface -to show graphs but more comfortable to use is the Grafana[^2] tool. - -The internal monitoring functions are defined in core/monitor.py, it includes: - -1) record_counter. This is to send the StatsD counter metrics. Counters are the -most basic and default type. They are treated as a count of a type of event per -second, and are, in Graphite, typically averaged over one minute. That is, when -looking at a graph, you are usually seeing the average number of events per -second during a one-minute period. - -2) record_timer. Timers are meant to track how long something took. They are an -invaluable tool for tracking application performance. The statsd server collects -all timers under the stats.timers prefix, and will calculate the lower bound, -mean, 90th percentile, upper bound, and count of each timer for each period (by -the time you see it in Graphite, that’s usually per minute). - -3) record_timer_block. This is the same to record_timer, just for simple using, -to calculate timer of a certain code block. - -4) record_gauge. Gauges are a constant data type. They are not subject to -averaging, and they don’t change unless you change them. That is, once you set a -gauge value, it will be a flat line on the graph until you change it again. - -### Set up the Rucio internal monitoring dashboard - -Set up a Rucio server for development - -```bash -git clone https://github.com/rucio/rucio.git -docker-compose --file etc/docker/dev/docker-compose.yml up --detach +response times, server active session, etc. Metrics are typically categorized as: + +- Counters – measure the number of events (e.g., requests processed). +- Timers/Histograms – measure durations of operations (e.g., rule evaluation time, transfer submission time). +- Gauges – measure values that can go up and down (e.g., number of active sessions, queue sizes). + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart TB + subgraph RucioTransfer["**Internal Monitoring**"] + A1["Rucio Servers & Daemons"] + + G1["Graphite"] + P1["Prometheus"] + GF["Grafana"] + end + + %% Edges + A1 -- "push" --> G1 + A1 -- "pull or push" --> P1 + G1 --> GF + P1 --> GF + + %% Style definitions + classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef Grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef Graphite fill:#555555,stroke:#333333,color:#fff,font-weight:bold; %% Dark gray for Graphite + classDef Prometheus fill:#009688,stroke:#00695C,color:#fff,font-weight:bold; %% Teal, distinct + + %% Apply styles + class A1 mono; + class G1 Graphite; + class P1 Prometheus; + class GF Grafana; ``` -The command will fire up various containers such as `dev-rucio-1`, `dev-graphite-1`, and -`dev-activemq-1`. `dev-graphite-1` is the one collecting internal -metrics from Rucio. The configurations of Rucio internal metrics sender are -defined under the [monitor] section of rucio.cfg. Change the carbon_server and -carbon_port according to your setting - -```toml -[monitor] -carbon_server = graphite -carbon_port = 8125 -user_scope = docker -``` - -The Graphite builtin web page is on port 80 of the host. To use Grafana, setup -Grafana and enable the graphite data source - -```bash -docker pull grafana/grafana -docker run --detach --name grafana --publish 3000:3000 grafana/grafana +There are two options: + +1. Graphite + + Metrics are pushed to a Graphite server. + + ```cfg + [monitor] + # specify the hostname for carbon server + carbon_server = + carbon_port = 8125 + user_scope = rucio + ``` + +2. Prometheus + + Metrics can be scraped by Prometheus or optionally pushed to Prometheus Pushgateway for short-lived processes. Prometheus also supports multiprocess-safe metrics for deployments using multiple threads or Apache MPM subprocesses. Exposed over an HTTP endpoint for scraping `/metrics`. Multiprocess-safe metrics are supported using the PROMETHEUS_MULTIPROC_DIR environment variable. + + ```cfg + [monitor] + # Enable Prometheus metrics + enable_metrics = True + # Port for Prometheus HTTP server + metrics_port = 8080 + ``` +The used metrics can be found in following links (code search) +- [Counter](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.Counter&type=code) +- [Gauge](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.gauge&type=code) +- [Timer](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.timer&type=code) + +[Grafana Dashboard JSON](https://github.com/rucio/monitoring-templates/blob/main/prometheus-monitoring/Dashboards/Rucio-Internal.json) for Prometheus is given here. + + +## Transfers, Deletion and Other Monitoring +Rucio generates a large volume of operational events for activities such as: transfers, deletions, rule evaluations, replication tasks, etc., originating from daemons like conveyor, reaper, judge, and others. + +These events are collected and delivered by the Hermes daemon, which can forward them to message queues or storage backends for further processing, analysis, and storage. According to your storage backend you can use visualization software like Grafana/Kibana. + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart TB + subgraph RucioTransfer["**Transfer, Deletion & Other Monitoring**"] + A2["Rucio Daemon: Hermes"] + Q1["ActiveMQ"] + ETL["ETL / Data Pipeline"] + OS1["OpenSearch / Elasticsearch / InfluxDB"] + KB["Grafana / Kibana"] + end + + A2 -- direct write --> OS1 + A2 -- publish(STOMP) --> Q1 + Q1 -- consume --> ETL + ETL --> OS1 + OS1 --> KB + + classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef Grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef OpenSearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; + classDef mq fill:#f69f03,stroke:#b35c00,color:#fff,font-weight:bold; + classDef etl fill:#4CAF50,stroke:#2E7D32,color:#fff,font-weight:bold; + + + class A2,A3 mono; + class Q1 mq; + class OS1 OpenSearch; + class KB Grafana; + class ETL etl; ``` -The Grafana web-portal is on port 3000 of the host. Add one data source of the -type Graphite, choose access method to "Browser" and set URL to [http://ip:80](http://ip:80), -where ip is the address of the server hosting the Graphite container -`dev-graphite-1`. - -A set of pre-defined Grafana Rucio internal plots is provided -[here](https://github.com/rucio/rucio/blob/master/tools/monitoring/visualization/rucio-internal.json). -Users could import them directly into Grafana. - -### The list of Rucio internal metrics - -1) Core - -```text -credential.signswift, credential.signs3 (timer) -trace.nongrid_trace -core.request.* (counter) -core.request.archive_request.* (timer) -rule.add_rule, rule.add_rule.*, rule.delete_rule, rule.evaluate_did_detach, rule.evaluate_did_attach.(timer) -trace.trace (counter) -``` - -2) Transfertool - -```text -transfertool.fts3.delegate_proxy.success.*, \ - transfertool.fts3.delegate_proxy.fail.* (timer) -transfertool.fts3.submit_transfer.[externalhost] (timer) -transfertool.fts3.[externalhost].submission.success/failure (counter) -transfertool.fts3.[externalhost].cancel.success/failure (counter) -transfertool.fts3.[externalhost].update_priority.success/failure (counter) -transfertool.fts3.[externalhost].query.success/failure (counter) -transfertool.fts3.[externalhost].whoami.failure (counter) -transfertool.fts3.[externalhost].Version.failure (counter) -transfertool.fts3.[externalhost].query_details.failure (counter) -transfertool.fts3.[externalhost].bulk_query.failure (counter) -transfertool.fts3.[externalhost].query_latest.failure (counter) -transfertool.fts3myproxy.[externalhost].submission.success/failure (counter) -``` - -3) Judge - -```text -rule.judge.exceptions.* -``` - -4) Transmogrified - -```text -transmogrifier.addnewrule.errortype.* (counter) -transmogrifier.addnewrule.activity.* (counter) -transmogrifier.did.*.processed (counter) -``` - -5) Tracer - -```text -daemons.tracer.kronos.* (counter) -``` - -6) Reaper - -```text -reaper.list_unlocked_replicas, reaper.delete_replicas (timer) -reaper.deletion.being_deleted, reaper.deletion.done (counter) -daemons.reaper.delete.[scheme].[rse] (timer) -``` - -7) Undertaker - -```text -undertaker.delete_dids, undertaker.delete_dids.exceptions.LocksDetected (counter) -undertaker.rules, undertaker.parent_content, undertaker.content, \ - undertaker.dids (timer) -undertaker.content.rowcount (counter) -``` - -8) Replicarecover - -```text -replica.recoverer.exceptions.* (counter) -``` - -9) Hermes - -```text -daemons.hermes.reconnect.* (counter) +Different options are shown in figure and described below. + +1. Queue-Based Pipelines + + Hermes publishes events to a queue or topic in message queue (like ActiveMQ) via STOMP. Multiple consumers can process events independently, which enables real-time, decoupled processing pipelines. These events from ActiveMQ can be consumed by ETL pipelines. These Pipelines allow aggregation, transformation, enrichment, and forwarding to different storage backends of your choice. + + Example pipeline : ActiveMQ -> Logstash -> OpenSearch + + + Config for this is described below. + ```cfg + [hermes] + # List of services Hermes should send messages to. + services_list = activemq + + # Toggle query behavior: + # True -> fetch bulk messages for each service individually + # False -> fetch bulk messages across all services together + query_by_service = True + + # Bulk retrieval size for each call to the database + bulk = 1000 + + [messaging-hermes] + # ActiveMQ options + # List of broker hostnames or DNS aliases + brokers = amq1.example.com, amq2.example.com + # Destination queue or topic + destination = /queue/rucio + # Use SSL for ActiveMQ connection + use_ssl = True + # SSL certificate files (if using SSL) + ssl_cert_file = /etc/rucio/certs/hermes-client-cert.pem + ssl_key_file = /etc/rucio/certs/hermes-client-key.pem + # Virtual host, optional + broker_virtual_host = / + # Non-SSL port (used if use_ssl=False) + nonssl_port = 61613 + # SSL port (used if use_ssl=True) + port = 61614 + # ActiveMQ username/password (used if use_ssl=False) + username = + password = + ``` + +2. Direct Delivery + + These options send events directly to storage or alerting systems, bypassing queues. + Hermes can write events straight to Elasticsearch, OpenSearch, or InfluxDB. In addition can also deliver events via email which supports custom SMTP servers, credentials, and SSL/TLS. + + Configuration option for each type is described below. + + ```cfg + # rucio.cfg + # ========================= + # Hermes Daemon Configuration + # ========================= + + [hermes] + # List of services Hermes should send messages to. + # Supported values: influx, elastic, email, activemq + services_list = elastic, influx, email, activemq + + # Toggle query behavior: + # True -> fetch bulk messages for each service individually + # False -> fetch bulk messages across all services together + query_by_service = True + + # Bulk retrieval size for each call to the database + bulk = 1000 + + # InfluxDB endpoint for sending aggregated metrics + influxdb_endpoint = https://influxdb-host:8086/api/v2/write?org=my-org&bucket=my-bucket&precision=ns + # Token for authenticating to InfluxDB + influxdb_token = my-secret-influxdb-token + + # Elasticsearch endpoint for sending events + elastic_endpoint = https://Elasticsearch-host:9200/rucio-eic-event/_bulk + # Optional credentials if Elasticsearch is secured + elastic_username = admin + elastic_password = password + + # Email sending options + send_email = True + email_from = rucio@cern.ch + smtp_host = smtp.cern.ch + smtp_port = 587 + smtp_username = my-smtp-user + smtp_password = my-smtp-pass + smtp_usessl = False + smtp_usetls = True + smtp_certfile = + smtp_keyfile = + ``` +### Event Types +Different event types are created + - Transfers: `transfer-submitted`, `transfer-submission_failed`, `transfer-queued`, `transfer-failed`, `transfer-done` + - Deletions: `deletion-done`, `deletion-not-found`, `deletion-failed` + - Rules: `RULE_OK`, and `RULE_PROGRESS` + - Locks: `DATASETLOCK_OK` + - DIDs: `CREATE_CNT` and `CREATE_DTS` + - Replicas: `INCOMPLETE` and `ERASE` + + +The structure of messages table which is extracted by Hermes is: +```json +{ + "id": "UUID4", + "services": "", + "event_type": "", + "created_at": "yyyy-MM-dd HH:mm:ss.SSSSSS", + "payload": {}, + "payload_nolimit": {}, +} ``` - -10) Conveyor - -```text -daemons.conveyor.[submitter].submit_bulk_transfer.per_file, \ - daemons.conveyor.[submitter].submit_bulk_transfer.files (timer) -daemons.conveyor.[submitter].submit_bulk_transfer (counter) -daemons.conveyor.finisher.000-get_next (timer) -daemons.conveyor.finisher.handle_requests (timer & counter) -daemons.conveyor.common.update_request_state.request-requeue_and_archive (timer) -daemons.conveyor.poller.000-get_next (timer) -daemons.conveyor.poller.bulk_query_transfers (timer) -daemons.conveyor.poller.transfer_lost (counter) -daemons.conveyor.poller.query_transfer_exception (counter) -daemons.conveyor.poller.update_request_state.* (counter) -daemons.conveyor.receiver.error -daemons.conveyor.receiver.message_all -daemons.conveyor.receiver.message_rucio -daemons.conveyor.receiver.update_request_state.* -daemons.conveyor.receiver.set_transfer_update_time -daemons.messaging.fts3.reconnect.* -daemons.conveyor.stager.get_stagein_transfers.per_transfer, \ - daemons.conveyor.stager.get_stagein_transfers.transfer (timer) -daemons.conveyor.stager.get_stagein_transfers (count) -daemons.conveyor.stager.bulk_group_transfer (timer) -daemons.conveyor.submitter.get_stagein_transfers.per_transfer, \ - daemons.conveyor.submitter.get_stagein_transfers.transfer (timer) -daemons.conveyor.submitter.get_stagein_transfers (count) -daemons.conveyor.submitter.bulk_group_transfer (timer) -daemons.conveyor.throttler.set_rse_transfer_limits.\ - [rse].max_transfers/transfers/waitings (gauge) -daemons.conveyor.throttler.delete_rse_transfer_limits.[rse] (counter) -daemons.conveyor.throttler.delete_rse_transfer_limits.[activity].[rse] (counter) -daemons.conveyor.throttler.set_rse_transfer_limits.[activity].[rse] (gauge) -daemons.conveyor.throttler.release_waiting_requests.[activity].[rse].[account] (counter) +where: +- `id`: UUID string +- `event_type`: string describing the event_type listed before +- `payload`: small JSON object (max 4000 chars), structure varies by event type +- `payload_nolimit`: optional large JSON object. Only if payload larger than 4000 characters +- `services`: string identifying the service. (elastic, activemq, influx) +- `created_at`: When the message was created. ISO 8601 timestamps + + +To quickly inspect the payloads of these event types: +```sql +SELECT id, created_at, payload +FROM messages +WHERE event_type = '' +ORDER BY created_at DESC +LIMIT 2; ``` - -11) Necromancer - -```text -necromancer.badfiles.lostfile, necromancer.badfiles.recovering (counter) +replace `event_type` with actual name that you want to inspect. We can also check `messages_history` table. + +### Format of Messages Delivered by Hermes +The final format of the message is determined by the destination service, as Hermes transforms the raw database message into the required wire protocol for external systems. + +- ActiveMQ (STOMP Message): The body is a streamlined JSON object containing only `event_type`, `payload`, and `created_at`. The message uses STOMP headers to set the event_type and flag the message as persistent. + +- Elasticsearch / OpenSearch (Bulk API): Hermes sends the raw database JSON message (including `id` and `services`) as a document using Bulk API format (via a POST request). + +- InfluxDB (Line Protocol): Hermes performs on-the-fly aggregation of transfers and deletions, counting successes/failures and bytes. It does not send the raw event JSON. The final format is the InfluxDB Line Protocol, which consists of a single text line combining the measurement, tags (e.g., RSE, activity), fields (e.g., `nb_done=10`), and a timestamp. + + +Example Grafana dashboard for transfer is provided [here](https://github.com/rucio/monitoring-templates/blob/main/message-monitoring/Dashboards/Rucio-Transfer.json) + +> **Note**: Please make changes to the dashboard according to your setup and needs. + +## Traces +Rucio clients can send trace events on every file upload or download. These are posted to the `/traces` endpoint and forwarded to a message broker such as ActiveMQ via STOMP. Messages are consumed by Rucio’s Kronos daemon or by external consumers. + +This is shown in figure below. Schemas of the traces can be found in [`trace.py`](https://github.com/rucio/rucio/blob/master/lib/rucio/core/trace.py) which can be used for dashboards. + + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart TB + subgraph RucioTraceFlow["**Rucio Trace Flow**"] + C1["Clients / Pilots"] + RS["Rucio Server (/traces endpoint)"] + Q1["ActiveMQ"] + KR["Rucio Daemon: Kronos"] + ETL["ETL / Data Pipeline"] + OS1["OpenSearch / Elasticsearch / InfluxDB"] + KB["Grafana / Kibana"] + end + + C1 -- traces (HTTPS POST) --> RS + RS -- publish(STOMP) --> Q1 + Q1 -- consume(STOMP) --> KR + Q1 -- consume --> ETL + ETL --> OS1 + OS1 --> KB + + classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef Grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef OpenSearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; + classDef mq fill:#f69f03,stroke:#b35c00,color:#fff,font-weight:bold; + classDef etl fill:#4CAF50,stroke:#2E7D32,color:#fff,font-weight:bold; + + class C1,RS,KR mono; + class Q1 mq; + class OS1 OpenSearch; + class KB Grafana; + class ETL etl; ``` -## Transfer monitoring - -If a transfer is submitted, queued, waiting, done or failed messages are sent to -ActiveMQ via Hermes and are also archived in the messages_history table. Same is -true for deletions. In the case of ATLAS we have a dedicated monitoring -infrastructure that reads the messages from -[ActiveMQ](https://activemq.apache.org), aggregates them and then writes the -aggregated data into ElasticSearch/InfluxDB from where it then can be visualised -using Kibana/Grafana. - -### Set up the Rucio internal monitoring dashboard - -1) Configure Rucio - -Rucio need to be configured to enable the message broker. In Rucio, message are -sent by the Hermes daemon. Settings are defined in therucio.cfg under the -[messaging-hermes] section - -```toml -[messaging-hermes] -username = -password = -port = 61613 -nonssl_port = 61613 -use_ssl = False -ssl_key_file = /etc/grid-security/hostkey.pem -ssl_cert_file = /etc/grid-security/hostcert.pem -destination = /topic/rucio.events -brokers = activemq -voname = atlas -email_from = -email_test = +## Rucio database dump +Database-level monitoring extracts different information directly from the Rucio database. This includes insights such as RSE usage statistics, account quotas, and other metadata relevant to experiments. These data are periodically queried and exported to external storage backends for visualization and long-term monitoring. + +Some example Logstash pipeline definitions are given [here](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Pipelines/pipelines.yml). These example pipelines use the Logstash JDBC input plugin to connect to the Rucio PostgreSQL database, execute SQL queries, and extract structured data periodically. The retrieved records are then sent to Elasticsearch but can be changed to other storage backends such as OpenSearch. The following diagram shows the high-level flow for database-level monitoring using Logstash. + +> **Note** : While this example uses Logstash, you can use other data collector options like [fluentd](https://www.fluentd.org/) with [plugin](https://github.com/fluent/fluent-plugin-sql) depending on your requirements. + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart TB + subgraph DB["**Database Level Accounting/Monitoring**"] + DB1[("Rucio DB")] + LS["Logstash JDBC Input"] + OS2["OpenSearch / Elasticsearch"] + GD["Grafana / Kibana"] + end + + DB1 --> LS + LS --> OS2 + OS2 --> GD + + classDef mono fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef Grafana fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + classDef OpenSearch fill:#005EB8,stroke:#003E75,color:#fff,font-weight:bold; + classDef Logstash fill:#b34700,stroke:#b34700,color:#fff,font-weight:bold; + + + class DB1,LS mono; + class LS Logstash; + class OS2 OpenSearch; + class GD Grafana; ``` -The default settings are listed above. If ssl is not used, set use_ssl to False -and define username and password. They should be "admin", "admin" for the -default activemq settings. If you are not using the containers created by the -docker-compose command, change the brokers and port to the server hosting the -message queue. - -2) Setup Elasticsearch and Kibana - -Next is to setup and configure Elasticsearch and Kibana for storing and -visualising the messages. This is an example of creating them in containers - -```bash -docker run --detach \ - --publish 9200:9200 \ - --publish 9300:9300 \ - --env "discovery.type=single-node" \ - --name elasticsearch \ - docker.elastic.co/elasticsearch/elasticsearch:7.8.1 - -docker run --detach \ - --link elasticsearch \ - --publish 5601:5601 \ - --name kibana \ - docker.elastic.co/kibana/kibana:7.8.1 +A typical Logstash configuration consists of three sections — input, filter, and output. For example, the input section defines the PostgreSQL connection and SQL query to fetch data: ``` - -3) Import Elasticsearch indices - -Before transferring messages from the message queue to Elasticsearch, indices -need to be defined in Elasticsearch. This is a list of the message formats of -Rucio. - -### Transfer events - -```jsi -{ - created_at: when the message was created (yyyy-MM-dd HH:mm:ss.SSSSSS) - event_type: type of this event (transfer-submitted, \ - transfer-submittion_failed, transfer-queued, transfer-failed, \ - transfer-done) - payload: { - account: account submitting the request - activity: activity of the request - bytes: size of the transferred file (byte) - checksum-adler: checksum using adler algorithm - checksum-md5: checksum using md5 algorithm - created_at: Time when the message was created (yyyy-MM-dd HH:mm:ss.SSSSSS) - dst-rse: destination rse - dst-type: type of destination rse (disk, tape) - dst-url: destination url of the transferred file - duration: duration of the transfer (second) - event_type: type of this event (transfer-submitted, \ - transfer-submittion_failed, transfer-queued, \ - transfer-failed, transfer-done) - file-size: same as bytes - guid: guid of the transfer - name: name of transferred file - previous-request-id: id of previous request - protocol: transfer protocol - reason: reason of the failure - request-id: id of this request - scope: scope of the transferred data - src-rse: source rse - src-type: type of source rse (disk, tape) - src-url: source file url - started_at: start time of the transfer - submitted_at: submission time of the transfer - tool-id: id of the transfer tool in rucio (rucio-conveyor) - transfer-endpoint: endpoint holder of the transfer (fts) - transfer-id: uuid of this transfer - transfer-link: link of this transfer (in form of fts url) - transferred_at: done time of this transfer - } +input { + jdbc { + jdbc_connection_string => "jdbc:postgresql://host:5432/"" + jdbc_user => "" + jdbc_password => "" + jdbc_driver_library => "/usr/share/logstash/java/postgresql-.jar" + jdbc_driver_class => "org.postgresql.Driver" + statement => "SELECT rses.rse, rse_usage.source, rse_usage.used, rse_usage.free, rse_usage.files FROM rse_usage INNER JOIN rses ON rse_usage.rse_id=rses.id WHERE rse_usage.files IS NOT NULL AND rse_usage.files!=0;" + schedule => "0 0 * * *" + } } -``` - -### Deletion events -```json -{ - created_at: when the message was created (yyyy-MM-dd HH:mm:ss.SSSSSS) - event_type: type of this event (deletion-done,deletion-failed) - payload: { - scope: scope of the deleted replica - name: name of the deleted replica - rse: rse holding the deleted replica - file-size: size of the file - bytes: size of the file - url: url of the file - duration: duration of the deletion - protocol: prococol used in the deletion - reason: reason of the failure - } +filter { + # Placeholder for transformations or enrichments + # Examples: + # - Add computed fields + # - Rename fields + # - Convert units (e.g., bytes to GB) + # - Drop unwanted fields } -``` - -The formats of them are defined in [`rucio-transfer.json`](https://github.com/rucio/rucio/blob/master/tools/monitoring/rucio-transfer.json) -and [`rucio_deletion.json`](https://github.com/rucio/rucio/blob/master/tools/monitoring/rucio-deletion.json) -which could be imported into Kibana. - -Rucio also sends messages when adding/deleting rules/DIDs and for file/dataset -access. So the monitoring is not limited to data transferring. - -4) Transmit messages from message queue to Elastisearch - -This could be done via Logstash. Please refer to [Elastic's documentation.](https://www.elastic.co/blog/integrating-jms-with-elasticsearch-service-using-logstash). - -Alternatively you could use a simple Python script such as [`extract.py`](https://github.com/rucio/rucio/blob/master/tools/monitoring/extract.py) for -this after installing the required tools -```bash -pip install --upgrade pip -pip install elasticsearch -wget https://files.pythonhosted.org/packages/52/7e/22ca617f61e0d5904e06c1ebd5d453adf30099526c0b64dca8d74fff0cad/stomp.py-4.1.22.tar.gz -tar --extract --gzip --verbose --file stomp.py-4.1.22.tar.gz -cd stomp.py-4.1.22 -python setup.py install -``` - -Change the configurations (message broker and elastisearch cluster) in -exporter.py and start it. It could be made as a systemd service for convenience. - -5) Create Kibana dashboards based on the imported messages. - -A set of pre-defined dashboards can be found -[here](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization) in -json format which could be imported to Kibana directly. But you may have to -resolve different UUIDs in Kibana. - -## Access monitoring - -The traces are sent by the pilots or the Rucio clients whenever a file is -downloaded/uploaded. This is similar with the data transferring monitoring. - -## Rucio database dumping - -Besides the internal, data transferring/deletion/accessing monitoring, it's also -possible to dump the Rucio internal database directly to Elasticsearch. Then -information like data location, accounting, RSE summary could be visualised -using Kibana or Grafana. - -We provide several examples of dumping Rucio DB tables using the logstash jdbc -plugin and making plots based on them. - -To start a logstash pipeline, run - -```bash -logstash -f rse.conf -``` - -Where the rse.conf contains -```json -input { - jdbc { - jdbc_connection_string => "" - jdbc_user => "" - jdbc_password => "" - jdbc_driver_library => "/usr/share/logstash/java/postgresql-42.2.6.jar" - jdbc_driver_class => "org.postgresql.Driver" - statement => "SELECT rses.rse, rse_usage.source, rse_usage.used, \ - rse_usage.free, rse_usage.files FROM rse_usage INNER JOIN rses ON \ - rse_usage.rse_id=rses.id WHERE rse_usage.files IS NOT NULL AND \ - rse_usage.files!=0;" - } -} output { elasticsearch { - hosts => [""] + hosts => ["http://elasticsearch:9200"] action => "index" index => "rucio_rse" - user => "" - password => "" + user => "elastic" + password => "password" } } ``` +Few points: +- `jdbc_driver_library`: Can be downloaded from [jdbc.postgresql.org](https://jdbc.postgresql.org/), choose the version that you want to use and enable that in Logstash. +- `schedule`: Defines how often the query runs (Cron-like syntax). +- `output`: Defines where the extracted data are delivered. In most deployments, these are indexed into OpenSearch or Elasticsearch for analytics dashboards in Grafana or Kibana. +- `filter`: This is optional. It helps in preprocessing your data before indexing + + +[Grafana dashboard](https://github.com/rucio/monitoring-templates/blob/main/logstash-monitoring/Dashboards/Rucio-Storage.json) example for RSE given. + +## Rucio Monitoring Probes + +Rucio provides a collection of **monitoring probes** that check the different status metrics of the Rucio. +The list of probes is available [here](https://github.com/rucio/probes/tree/master/common) probes shared across experiments. Also can create experiment-specific probes for custom monitoring like [ATLAS](https://github.com/rucio/probes/tree/master/atlas) and [CMS](https://github.com/rucio/probes/tree/master/cms). + +Rucio provides a prebuilt container on [Docker Hub](https://hub.docker.com/r/rucio/probes) that includes: + +- All dependencies for running the probes. +- A lightweight **Jobber** daemon for scheduling probe execution. +- The full Rucio probe repository. Custom probes can be added by introducing them to your own Rucio instance. + +The container can push results either to a **Prometheus Pushgateway** or export data for **Nagios** alerting. + +```mermaid +%%{init: {'theme': 'base', 'themeVariables': { + 'primaryColor': '#d8e3e7', + 'edgeLabelBackground': '#ffffff', + 'tertiaryColor': '#cdd5d9', + 'fontFamily': 'monospace', + 'primaryBorderColor': '#90a4ae', + 'lineColor': '#90a4ae' +}}}%% +flowchart LR + Probe["Rucio Probes (Schedule via Jobber or others)"] + Nagios["Nagios"] + Prometheus["Prometheus"] + Grafana["Grafana Dashboards"] + + Probe -- Exit code + stdout --> Nagios + Probe -- Gauge metrics via Pushgateway--> Prometheus + Prometheus --> Grafana + + classDef probe fill:#d8e3e7,stroke:#607d8b,color:#000,font-size:12px; + classDef nagios fill:#E53935,stroke:#B71C1C,color:#fff,font-weight:bold; + classDef prom fill:#00868,stroke:#00695C,color:#fff,font-weight:bold; + classDef graf fill:#F05A28,stroke:#b03e16,color:#fff,font-weight:bold; + + class Probe probe; + class Nagios nagios; + class prometheus prom; + class Grafana graf; +``` -The rse pipeline dumps data like how large is the total space, how large is the -used space, how many files are saved on each RSE etc. Please fill in the jdbc -connection details and Elastisearch connection details in the config file. +Probe Execution Workflow is: -More pipeline definitions can be found [here](https://github.com/rucio/rucio/tree/master/tools/monitoring/logstash-pipeline), -and users could design their own DB queries for their specific monitoring -needs. Also users could directly import the Elasticsearch indices and Kibana -dashboard from [these](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization/db_dump). -json files. +- **Probes** are Python scripts under `rucio/probes/`. +- **Jobber** acts as a cron-like scheduler inside the container. +- **Output options:** + - **Prometheus Pushgateway:** for time-series metrics. Alerts can be added with [Prometheus](https://prometheus.io/docs/alerting/latest/alertmanager/) and [Grafana](https://grafana.com/docs/grafana/latest/alerting/set-up/configure-alertmanager/) alert management. + - **Nagios:** Used mainly as a cron-style runner where exit codes trigger Nagios alerts, while probe metrics are sent to Prometheus. -## Footnotes +To make use of Prometheus functionality, make sure your `rucio.cfg` inside the container with the probes has the extra sections and options: -[^1]: [https://graphiteapp.org/] -[^2]: [https://grafana.com/] +```cfg +[monitor] +prometheus_servers = "https://prometheuserver:port" +prometheus_prefix = "" # default empty +prometheus_labels = "" # default empty +``` + +For adding cron-like scheduling for each probe in jobber, make sure you have added needed config in [dot-jobber](https://github.com/rucio/containers/blob/master/probes/dot-jobber). An example config is given below, running the probes `check_expired_dids` and `check_stuck_rules`. This config assumes your probes are in the top level directory of the container. + +```yaml +version: 1.4 +jobs: + - name: CheckExpiredDIDs + cmd: ./check_expired_dids + time: '*/5 * * * *' # every 5 minutes + onError: Continue + - name: CheckStuckRules + cmd: ./check_stuck_rules + time: '0 * * * *' # hourly + onError: Continue +```