Draft
Conversation
Fix a high memory consumption that also is part of the issue pokt-network#1457. Under high load of requests (1000/rps or more) the RAM got crazy and scale up to 40GB or close to that. Now after the fix of pokt-network#1457 with the worker pool, the node remains under 14gb of ram in my local tests.
… keep cpu in high load.
* Fixed RPC timeout handled as Seconds instead of Milliseconds * Updated mesh.md to handle new cache configurations * Updated mesh.md to list /v1/private/mesh/session as required on the whitelist endpoints/paths
* Fixed /v1/private/mesh/updatechains to properly update them on memory and disk * Added hot reload for servicer private key files (add & remove) * on add turn on the checks and start allowing it * on remove stop receiving and consume all the pending relays in queue * Version bump
* Enhanced log about missing sessions * Version Bump
…rivate key is removed after it been supported by the mesh node. * Version Bump
…ral solution) * Fixed error that panic process when load servicer_url without http/https schema. Now it will properly report the error. * Added manual cron to compact relays database every hour. * Removed a log2.Fatal that was crashing the process.
* relay_cache_background_sync_interval was not used * relay_cache_background_compaction_interval was not used Added: * hot_reload_interval allow to turn off using 0 the hot reload of chains/servicers - otherwise the amount of MS it will check the files again Updated: * Now health check of servicers is done every 60s - was 30s - future: will be configurable through config.json * Now old sessions are evaluated to be removed every 30m - was 30s - future: will be configurable through config.json * config.json example of docs. Removed: * Manual relays db compaction job removed; We receive reports that it was corrupting relays database if you run at same time of background configured by relay_cache_background_compaction_interval
… from storage in any case after they are success/failed. Fixed log that was printing node instead of app public key.
…very servicer on same session.
…unning mixed servicers.
…0.9.2. Bump Mesh client to RC-0.2.6
Added different key format. Refactor connectivity checks. Refactor node/servicer internal structure of mesh to reduce amount of worker/cron instances. Refactor chains/keys reload.
Added FullNode worker dynamic resize on servicers change. Updated servicers reload to only run the modification on maps when there is something new/removed.
…e and better readability of the code without so many casts. Refactor fullNode.Servicer to be a map instead of a slice. Enhance a bit more the logs and bootstrap time information.
Added metrics config support. Refactor code to split in files. Bump pond version to 1.8.3 (patch). Clean up the code.
Update config to handle rpc timeout for different things like chains, client and pocket node calls with a different value.
…able by config file.
Ensure that http response body is read even on errored request to reuse connections.
Enhanced chains reload logs. Enhanced startup logs.
… so many edge cases and possible infinite goroutine spams. Added name property to nodes as optional key, if not set use the hostname of the node url. Added minWorker, maxWorker, maxCapacity to prometheus metrics collectors. Refactor minWorker, maxWorker and maxCapacity option in config. Bump default to a more real world value. Updated docs.
…pact on mesh code, but help node runners to keep internal track.
…/queries on prometheus. Added chains name map so those metrics could contain the chain name you wish. ChainsNameMap could work with a local file or remote endpoint (GET)
… error log. Does not affect the code but is unnecessary.
Added status_type and status_code labels to error metrics. Added internal, notify and chain error metrics.
Added metrics docs and basic geo-mesh grafana dashboard.
…ould be leaked on logs. Moved Grafana Dashboard to a file to easily compare/copy from raw github files. Updated mesh.md
Enhance docs based on blade suggestions.
…and the one dispatching to chains.
## High-Level Summary
This update includes changes that allow for faster boot time and better overall performance & maintainability.
* Mesh bootstrap time has been greatly reduced to almost 0, which improves overall performance and makes it easier to maintain your key files.
* Mesh now supports hot reloading of keys and chains efficiently by deduplicating full nodes, which allows for faster updates and better overall performance.
* Code has been split into smaller files to make it easier to navigate and modify, which helps with both development and maintenance.
* The Relay, Relay Time, and Error metrics now use labels for chain id and name to differentiate between different chains and to help with identifying and categorizing the various metrics.
* Misc bug fixes such as connection re-use and invalid sessions which should increase the performance of your mesh and full node.
### Breaking Changes
This changes require you update both: Pocket Node & Mesh Node
1. Added `/v1/private/mesh/check` endpoint
2. Moved `/v1/mesh/health` to `/v1/private/mesh/health`
3. Removed `/v1/private/mesh/servicer`
4. Removed Prometheus metrics using chain id as part of the metric name
5. Removed `authtoken` query param from private mesh endpoints in favor of `Authorization: <token>` header.
* **_NOTE: No changes need on your side unless you have some tooling built over `/v1/private/mesh/*` endpoints_**
**_NOTE:_**
- You must add the new private endpoints (`check`) and use the same code/image on both sides or your mesh node will not start.
- You could remove the `/v1/health` endpoint from your proxy.
### Configuration Changes
1. Added `node_check_interval` to configure the rate at a full Node is health checked.
2. Added `log_chain_request` and `log_chain_response` to avoid unnecessary debug logs.
3. Added `chains_name_map` and `remote_chains_name_map` to enhance exposed metrics with chain name.
4. Added `metrics_moniker` to help to identify a mesh node instance on metrics queries.
7. Added `metrics_report_interval` to configure the rate at metrics about workers are reported.
8. Added `<servicer|chain>_rpc_max_idle_connections`
9. Added `<servicer|chain>_rpc_max_conns_per_host`
10. Added `<servicer|chain>_rpc_max_idle_conns_per_host`
11. Added `chain_drop_connections`
12. Added `client_rpc_read_timeout`
13. Added `client_rpc_read_header_timeout`
14. Added `client_rpc_write_timeout`
15. Added `chain_request_path_cleanup`
16. Removed `hot_reload_interval` in favor of `keys_hot_reload_interval` and `chains_hot_reload_interval` as separated values
17. Rename worker pool options to specify a different set for Servicer or Metrics
18. Remove `rpc_timeout` in favor of: `client_rpc_timeout` and `chain_rpc_timeout`
### Core Changes
#### New:
1. New keys structure
2. Deduplication of nodes (reduce bandwidth usage and faster boot time)
3. Allow control RPS of mesh to a node (using servicer_max_workers) due to how workers are used now
4. Chains check. Check against the Pocket node if the chains id on the chain file matches the one on Pocket Node.
6. Added a Worker for metrics
7. Chains name maps (local or remote) to enhance metrics
8. Added endpoint to allow mesh node check against servicer node (health, chains, addresses)
9. Added sanitization of the URL before call chain due to errors observed on chains like Avax/DFK sending `\t` characters on the path.
#### Fix:
1. Invalid sessions error
2. HTTP connection drops due to avoiding read response body (decreases CPU utilization and bandwidth)
3. Handle a few cases where after an error it keeps going without interrupting the flow, invalidating a relay/session.
4. Update chains using /v1/private/mesh/updatechains endpoint
#### Rework:
1. Relay, Relay Time, and Error metrics now use labels for chain id and name
2. Split Keys and Chains hot reload interval to allow them to work independently.
3. Moved the `authtoken` query parameter used to call private method to be passed using the HTTP header `Authorization` to avoid it being leaked on logs.
4. Split code in files to help development and readability of it.
5. Bootstrap time reduce to almost 0
#### Dependencies:
1. Added [xsync](github.com/puzpuzpuz/xsync) to have better performance that the native for concurrent access.
2. Added [gojsonschema](github.com/xeipuuv/gojsonschema) to allow validate keys and chains name map format.
3. Added [golang-set](github.com/deckarep/golang-set/v2) to manage set of values in arrays faster and async safe
4. Bump version github.com/alitto/pond from v1.8.1 to v1.8.3
5. Bump version github.com/prometheus/client_golang from v1.11.0 to v1.11.1
### Bump Version
New version: RC-0.3.0
### Docs
1. Update mesh.doc
2. Added missing links to dockerhub
3. Added Grafana dashboard to consume the new metrics.
4. Updated rpc-spec.yaml
Replaced Chains/Servicer http client (native) with fiber. Replace encoding/json with external library that is faster.
Enhance a metric error code to dispatch the understanding code from the error on the chain requests
0d5319d to
b318e54
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Default golang HTTP server/client is not very good compared to some custom implementations like Fiber, which in theory is 10x faster.
After a write and deploying it on live mesh, the only difference with the native one, is that under the same RPS; this code got less ms than the native.