Skip to content

Add routing stats module#8455

Open
erayd wants to merge 3 commits intomeshtastic:developfrom
erayd:routing-stats
Open

Add routing stats module#8455
erayd wants to merge 3 commits intomeshtastic:developfrom
erayd:routing-stats

Conversation

@erayd
Copy link
Contributor

@erayd erayd commented Oct 26, 2025

Adds some basic routing stats to assist with infrastructure planning and identifying problems. The idea is to have this enabled on infrastructure high sites. I also intend to add replay stats to this later, as part of the replay PR.

  • Transmission of stats is disabled by default, and can be configured to an arbitrary interval (seconds).
  • Received stats will be printed to the console upon receipt.
  • Local stats will be printed to the console every 60 seconds.
Screenshot 2025-10-26 at 19 20 07

Tracked metrics:

  • Channel util %
  • TX util %
  • TX milliseconds
  • RX milliseconds
  • Total number of packets sent
  • Number of relayed (rebroadcast) packets
  • Number of packets originated (excl. zero-hop)
  • Number of zero-hop packets originated
  • Number of packets dropped from the TX queue
  • Maximum number of packets waiting in the TX queue
  • Total number of packets received
  • Number of receive events where the incoming data couldn't be decoded to a packet
  • Number of packets received originating from a direct neighbour (excl. zero-hop)
  • Number of zero-hop packets received
  • Number of packets received with hop_limit=0 (excl. zero-hop)

Requires meshtastic/protobufs#799

🤝 Attestations

  • I have tested that my proposed changes behave as described.
  • I have tested that my proposed changes do not cause any obvious regressions on the following devices:
    • Heltec (Lora32) V3
    • LilyGo T-Deck
    • LilyGo T-Beam
    • RAK WisBlock 4631
    • Seeed Studio T-1000E tracker card
    • Other (please specify below)

@GUVWAF
Copy link
Member

GUVWAF commented Oct 28, 2025

Looks definitely useful. It has some overlap with LocalStats (which actually can also be requested remotely). Maybe it makes sense to include the new metrics from this module there as well? Then you don't need to enable this for nodes you can reach with a client app.

Also, probably good to enforce a minimum broadcast interval to avoid this becoming an airtime hog just like NeighborInfo was.

@erayd
Copy link
Contributor Author

erayd commented Oct 29, 2025

It has some overlap with LocalStats (which actually can also be requested remotely). Maybe it makes sense to include the new metrics from this module there as well? Then you don't need to enable this for nodes you can reach with a client app.

I had no idea that LocalStats was remotely requestable. That's good to know. I do think it serves a different purpose though - this one is very much intended as a live diagnostic tool, which is why most of the metrics are windowed. The stuff in LocalStats is mostly on a since-boot basis, isn't it?

Also, probably good to enforce a minimum broadcast interval to avoid this becoming an airtime hog just like NeighborInfo was.

Fair point. IMO it should be imposed after a timeout though - because this is a diagnostic tool, I want to be able to set it to e.g. a two minute window specifically in order to get a better resolution on peaks. Would having the value automatically clamped to max(1 hour, configured interval) after e.g. 4 hours of runtime be acceptable? That way small values can stll be configured while actively testing, but it ensures the device doesn't remain in that state indefinitely.

@GUVWAF
Copy link
Member

GUVWAF commented Oct 29, 2025

I had no idea that LocalStats was remotely requestable.

I believe it's only available using the Python CLI with --request-telemetry local_stats.

The stuff in LocalStats is mostly on a since-boot basis, isn't it?

Correct. Indeed it's a slightly different purpose and periodic broadcasts can be valuable.

Would having the value automatically clamped to max(1 hour, configured interval) after e.g. 4 hours of runtime be acceptable?

Sounds good to me.

@github-actions github-actions bot added the Stale Issues that will be closed if not triaged. label Dec 15, 2025
@github-actions github-actions bot closed this Dec 24, 2025
@erayd
Copy link
Contributor Author

erayd commented Dec 30, 2025

Not actually stale... could someone reopen this please?

@thebentern thebentern reopened this Dec 30, 2025
@thebentern thebentern added triaged Reviewed by the team, has enough information and ready to work on now. and removed Stale Issues that will be closed if not triaged. labels Dec 30, 2025
@erayd
Copy link
Contributor Author

erayd commented Jan 19, 2026

@NomDeTom Is your cumulative uptime thing split out somewhere? I'd like to make use of it from this module, in order to have it self-disable if nobody has actively indicated they want it for a while.

@garthvh
Copy link
Member

garthvh commented Jan 19, 2026

We probably should not have two different big packets for stats (local stats already exists and is integrated in the apps and cli), some of these stats are also in device metrics. This diagnostic data burns us over and over by being large.

@erayd
Copy link
Contributor Author

erayd commented Jan 19, 2026

We probably should not have two different big packets for stats (local stats already exists and is integrated in the apps and cli), some of these stats are also in device metrics. This diagnostic data burns us over and over by being large.

This serves a different purpose (see #8455 (comment)). Local stats doesn't have the windowing stuff unfortunately, and thus can't do what this is intended to do.

I totally get where you're coming from with stats bloat on the air though - that's why I was asking @NomDeTom about his cumulative uptime thing. Ideally, this module would be default-off, manually enabled when needed, and turn itself off again after a while. It's intended as a testing / diagnostic / planning tool for remote infrastructure sites, not a thing that gets left on all the time, and especially not at a high frequency / short window. However, high frequency is a capability that it needs to have.

It doesn't need to be exposed in the apps either IMO; this could easily be a CLI-only thing. I have no issue with it being pretty seriously gated away from casual users. They aren't the target audience for this.

@garthvh
Copy link
Member

garthvh commented Jan 19, 2026

Need to consider the worst case scenario always, assume this will wind up in mesh sense and be queried every minute. CLI calls functions in the firmware, it does not stop huge airtime use. This would probably have to be an admin message as large as it looks and be throttled pretty heavily.

@erayd
Copy link
Contributor Author

erayd commented Jan 19, 2026

Need to consider the worst case scenario always, assume this will wind up in mesh sense and be queried every minute.

Every minute is actually a legitimate use-case for this. I use it at that frequency occasionally on some of our high sites here to watch what happens during traffic spikes, so I can get a better idea of what's going on.

It's definitely not intended for meshsense. However, ultimately if someone wants to abuse the mesh, they are going to be able to do so no matter how nicely we gate things. There comes a point at which one just needs to accept that people can be dicks about stuff if they are motivitated to be so, and simply hope that they do not. Meshtastic by its very nature is easy to DoS. The trick is just to make sure that people can't do it by accident.

This would probably have to be an admin message as large as it looks...

Admin-enabled only, yes. There's zero reason to have this be solicited by random other nodes, and IMO it shouldn't be possible to request it at all; it's something that only the node owner should be able to turn on. The actual stats packets themselves don't need to be admin ones though... they don't contain confidential info, and having the payload be observable outside of an admin session is important.

...and be throttled pretty heavily.

I strongly disagree on this point. Have it self-disable, yes. And have it automatically increase the interval after a short period, yes. Those measures will ensure that it cannot be accidentally left in an obnoxious state. But throttling it somewhat defeats the purpose; it's intended to be occasionally utilised at a high rate.

@NomDeTom
Copy link
Contributor

NomDeTom commented Jan 19, 2026

@NomDeTom Is your cumulative uptime thing split out somewhere? I'd like to make use of it from this module, in order to have it self-disable if nobody has actively indicated they want it for a while.

The only place it is located at the moment is in #8378 . @ford-jones was going to look at it, but since the uptime accumulator is a relatively simple piece and I know you're capable of modularising PRs, it could be implemented separately than the role-retirement feature. Adding total-uptime, uptime-in-role, uptime-in-temporary-promotion, etc. would help hang the other functions off it at a later date.

My only caveat would be that the total-uptime figure was intended to be quite coarse and long-running - fine enough to capture daytime-only solar nodes, but infrequent enough not to add to the flash burden. I don't know if that aligns with the self-promotion plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2.8 enhancement New feature or request triaged Reviewed by the team, has enough information and ready to work on now.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants