diff --git a/docs/en/how_to/best-practices.mdx b/docs/en/how_to/best-practices.mdx new file mode 100644 index 0000000..74aa922 --- /dev/null +++ b/docs/en/how_to/best-practices.mdx @@ -0,0 +1,833 @@ +--- +weight: 5 +title: Best Practices +--- + +# Best Practices + +## Overview + +As the de facto standard for caching and key-value storage in cloud-native architectures, Redis handles core requirements for high-concurrency read/write operations and low latency. Running stateful Redis services in a Kubernetes containerized environment presents challenges distinct from traditional physical machine environments, including **persistence stability**, **dynamic network topology changes**, and **resource isolation and scheduling**. + +This Best Practices document aims to provide a standardized reference guide for Redis deployments in production environments. It covers the full lifecycle management from **architecture selection**, **resource planning**, **client integration** to **observability and operations**. By following this guide, users can build an enterprise-class Redis data service that is **High Availability (HA)**, **High Performance**, and **Maintainability**. + +## Architecture Selection + +The Full Stack Cloud Native Open Platform offers two standard Redis management architectures based on customer business scale and SLA requirements: + +### Sentinel Mode + +**Positioning: Classic High Availability Architecture, suitable for small to medium-scale businesses.** + +Sentinel mode is based on Redis's native master-replica replication mechanism. By deploying independent Sentinel process groups to monitor the status of master and replica nodes, it automatically executes Failover and notifies clients when the master node fails. +* **Pros**: Simple architecture, mature operations, lower requirements for client protocols. +* **Cons**: Write capacity is limited to a single node; storage capacity cannot scale horizontally. + +### Cluster Mode + +**Positioning: Distributed Sharding Architecture, suitable for large-scale high-concurrency businesses.** + +Cluster mode automatically shards data across multiple nodes using Hash Slots, enabling horizontal scaling (Scale-out) of storage capacity and read/write performance. +* **Pros**: True high availability distributed storage, supports dynamic Resharding. +* **Cons**: Complex client protocol; specific multi-key commands (e.g., `MGET`) are restricted by Slot distribution. + +### Selection Guide + +When selecting a Redis architecture, consider business requirements for availability, scalability, and complexity. + +| Feature | Sentinel Mode | Cluster Mode | +| :--- | :--- | :--- | +| **Scenarios** | Small/Medium business, Read-heavy/Write-light, moderate data | Large business, High concurrency R/W, massive data | +| **High Availability** | Via Sentinel monitoring and auto-failover | Via node auto-failure detection and recovery | +| **Scalability** | Vertical (Scale-up), Horizontal (Read-only) | Horizontal (R/W), supports dynamic resharding | +| **Read/Write Separation** | Supported (Client support required) | Supported (Usually direct connection to shard master, client support required) | +| **Data Sharding** | None (Single node stores full data) | Yes (Data auto-sharded across multiple nodes) | +| **Ops Complexity** | Lower, simple architecture | Higher, involves sharding, hash slots, migration | +| **Network Constraints** | Requires client support for Sentinel protocol | Requires client support for Cluster protocol | + +**Recommendations:** +* If data volume is small (fits in single node memory) and simplicity/stability is priority, **Sentinel Mode** is preferred. +* If data volume is massive or write pressure is extremely high and cannot be supported by a single node, choose **Cluster Mode**. + +## Version Selection + +Alauda Cache Service for Redis OSS currently supports `5.0`, `6.0`, and `7.2` stable versions. All three versions have undergone complete automated testing and production verification. + +**For new deployments, we strongly recommend choosing Redis `7.2`:** + +1. **Lifecycle** + * **`5.0` / `6.0`**: Community versions are End of Life (EOL) and no longer receive new features or security patches. Recommended only for compatibility with legacy applications. + * **`7.2`**: As the current Long Term Support (LTS) version, it has the longest lifecycle, ensuring operational stability and security updates for years to come. + +2. **Compatibility** + * Redis `7.2` maintains high compatibility with `5.0` and `6.0` data commands. Most business code can migrate smoothly without modification. + * *Note*: RDB persistence file format (v11) is not backward compatible (i.e., RDB generated by `7.2` cannot be loaded by `6.0`), but this does not affect new services. + +3. **Key Features** + * **ACL v2**: Provides granular access control (Key-based permission selectors), significantly enhancing security in multi-tenant environments. + * **Redis Functions**: Introduces Server-side Scripting standards, resolving issues with Lua script loss and replication, keeping logic closer to data. + * **Sharded Pub/Sub**: Resolves network storm issues caused by Pub/Sub broadcasting in Cluster mode, significantly improving messaging scalability via sharding. + * **Performance Optimization**: Deep optimizations in data structures (especially Sorted Sets) and memory management provide higher throughput and lower latency. + +> For more details on Redis 7.2 features, please refer to the official [Redis 7.2 Release Notes](https://github.com/redis/redis/blob/7.2/00-RELEASENOTES). + +## Resource Planning + +### Kernel Tuning + +To ensure stability and high performance in production, the following kernel parameter optimizations are recommended at the Kubernetes node level: + +1. **Memory Allocation (`vm.overcommit_memory`)** + * **Recommended**: `1` + * **Explanation**: Setting to `1` (Always) ensures the kernel allows memory allocation during Redis Fork operations (RDB snapshot/AOF rewrite), even if physical memory appears insufficient. This effectively prevents persistence failures due to allocation errors. + +2. **Connection Queue (`net.core.somaxconn`)** + * **Recommended**: `2048` or higher + * **Explanation**: Redis default tcp-backlog is 511. In high concurrency scenarios, system `net.core.somaxconn` should be increased to avoid dropping client connection requests. + +3. **Transparent Huge Pages (THP)** + * **Action**: **Disable** (`never`) + * **Explanation**: THP causes significant latency spikes during memory allocation in Redis, especially during Copy-on-Write (CoW) after Fork. It is recommended to disable this on the host or via startup scripts. + +### Memory Specifications + +Redis uses a snapshot mechanism to asynchronously replicate in-memory data to disk for long-term storage. This keeps Redis high-performing but carries a risk of data loss between snapshots. + +In Kubernetes containerized environments, we recommend a tiered memory management strategy: +* **✅ Standard Specs (< 8GB)**: **Strongly Recommended**. Ensures extremely low Fork latency and fast failure recovery (RTO < 60s); the most robust production choice. +* **⚠️ High-Performance Specs (8GB - 16GB)**: **Acceptable**. Requires high-performance host and **THP must be disabled**. Fork is controllable but may cause ~100ms jitter under high load. +* **❌ High-Risk Specs (> 16GB)**: **Not Recommended**. Single point of failure impact is too large, and full synchronization can easily saturate network bandwidth. Recommend horizontal splitting into Cluster mode. + +#### Why Limit to 8GB? +While single instances on physical machines often run 32GB+, the 8GB limit in cloud-native environments is based on the "Golden Rule" of these core technologies: + +1. **Fork Blocking & Page Table Copy** + * Redis calls `fork()` during RDB/AOF Rewrite. Although memory pages are CoW, **Process Page Tables must be fully copied**, blocking the **main thread**. + * *Estimation*: 10GB memory ≈ 20MB page table ≈ 10~50ms blocking (depending on virtualization overhead). Exceeding 8GB increases blocking risk exponentially, impacting SLA. + +2. **Failure Recovery Efficiency (RTO)** + * Container restart loading RDB is a **single-threaded CPU-bound task** (object deserialization). Tests show loading 8GB data takes **30-50s** (even with SSD). Maintaining 32GB could result in multi-minute start times, contradicting K8s "fast self-healing" philosophy. + +#### Memory Configuration Best Practices + +To avoid OOM (OOM Kill) during persistence due to memory expansion, strict adherence to these principles is required: + +1. **Set MaxMemory**: Do not set `maxmemory` to 100% of the container Memory Limit. Recommend setting to **70% ~ 80%** of the Limit. +2. **Reserve CoW Space**: Redis Forks a child process during RDB/AOF Rewrite. If there are heavy write updates, OS Copy-on-Write mechanisms duplicate memory pages; in extreme cases, memory usage can double from 8GB to 16GB. +3. **Overcommit Config**: Ensure host `vm.overcommit_memory = 1` to allow kernel forks without requesting equivalent physical memory (relying on CoW), preventing fork failures. + +> [!INFO] +> +> **Resource Reservation Formula**: `Container_Memory_Limit` ≈ `Redis_MaxMemory` / 0.7 +> * Example: To store 8GB data, configure Container Memory Limit to 10GB ~ 12GB, leaving 2GB+ for CoW and fragmentation overhead. + +### CPU Resources + +Redis core command execution is single-threaded, but persistence (Fork) and other operations require child processes. Therefore, allocate **at least 2 Cores** per Redis instance: +* **Core 1**: Handles main thread requests and commands. +* **Core 2**: Handles persistence fork, background tasks, and system overhead. + +#### Multi-threading + +Redis 6.0+ introduced multi-threaded I/O (disabled by default) to overcome single-thread network I/O bottlenecks. + +* **When to Enable?** + * **Bottleneck Analysis**: When Redis CPU usage nears 100% and analysis shows time spent on Kernel State Network I/O (System CPU) rather than user-space command execution. + * **Traffic Profile**: Typically beneficial when single instance QPS > 80,000 or network traffic is huge (> 1GB/s). + * **Resource Conditions**: Ensure node has sufficient CPU cores (at least 4 cores). + +* **Configuration Best Practices**: + * **Thread Count**: Recommend 4~8 I/O threads. Exceeding 8 threads rarely yields significant gain. + * **Config Example**: + ```yaml + io-threads 4 + io-threads-do-reads yes + ``` + * **Note**: Multi-threaded I/O only improves network throughput; it **does NOT** improve execution speed of single complex commands (e.g., `SORT`, `KEYS`). + +### Storage Planning + +#### Capacity Planning +Persistence mode directly determines disk quota requirements. Refer to the following calculation formula: + +| Mode | Recommended Quota Formula | Details | +| :--- | :--- | :--- | +| **Diskless (Cache)** | `0` (No PVC) | Used as pure cache, no RDB/AOF. Logs collected via stdout in K8s, no persistence disk needed. | +| **RDB (Snapshot)** | `MaxMemory * 2` | RDB uses CoW. During snapshot generation, both "old snapshot" and "new snapshot being written" exist on disk.
**Recommendation**: Reserve at least 2x memory space. | +| **AOF (Append Only)** | `MaxMemory * 3` | AOF grows with write operations. Default config (`auto-aof-rewrite-percentage 100`) triggers rewrite when AOF reaches **2x** data size. Disk must hold:
1. Old AOF file (2x)
2. New AOF file from rewrite (1x)
**Peak total 3x**. Recommend reserving at least 3x space. | + +#### Performance Requirements +* **With AOF**: Disk performance is critical. Insufficient IOPS or high fsync latency will directly block the main thread (when `appendfsync everysec`). +* **Media**: Production environments strongly recommend SSD/NVMe local disks or high-performance cloud disks. + +### Parameter Configuration + +Alauda Cache Service for Redis OSS parameters are specified via Custom Resource (CR) fields. + +#### Built-in Templates + +Alauda Cache Service for Redis OSS provides multiple parameter templates for different business scenarios. Selection depends on the trade-off between persistence (Diskless/AOF/RDB) and performance. + +| Template Name | Description | Scenarios | Risks | +| :--- | :--- | :--- | :--- | +| **rdb-redis-<version>-<sentinel\|cluster>** | Enables RDB persistence, periodic snapshots to disk. | **Balanced**: Limited resources, balances performance/reliability, accepts minute-level data loss. | Data loss depends on `save` config, usually minute-level RPO. | +| **aof-redis-<version>-<sentinel\|cluster>** | Enables AOF persistence, logs every write op. | **Secure**: Ample resources, high data security (second-level loss), slight performance compromise. | Frequent fsync requires high-performance storage, high IO pressure. | +| **diskless-redis-<version>-<sentinel\|cluster>** | Disables persistence, pure in-memory. | **High-Perf Cache**: Acceleration only, data loss acceptable or rebuildable from source. | Restart or failure leads to **full data loss**. | + +> `` represents Redis version, e.g., `6.0`, `7.2`. + +Key parameter differences: + +| Parameter | RDB Template | AOF Template | Diskless Template | Explanation | +| :--- | :--- | :--- | :--- | :--- | +| `appendonly` | `no` | `yes` | `no` | Enable AOF logging. | +| `save` | `60 10000 300 100 600 1` | `""` (Disabled) | `""` (Disabled) | RDB snapshot triggers. | +| `repl-diskless-sync` | `no` | `no` | `yes` | Master-replica full sync via socket without disk. | +| `repl-diskless-sync-delay` | `5` | `5` | `0` | Delay for diskless sync; 0 for Diskless to speed up sync. | + +##### Persistence Selection Recommendations + +1. **Pure Cache**: Choose **Diskless Template**. Data rebuildable, no overhead, best performance. +2. **General Business**: Choose **RDB Template**. Periodic snapshots provide minute-level RPO, moderate resource usage. +3. **Financial/High-Reliability**: Choose **AOF Template** with `appendfsync everysec` for second-level protection. + +> [!WARNING] +> **Should RDB + AOF be enabled simultaneously?** +> +> Redis supports running RDB and AOF together, but it is **generally not recommended** in Kubernetes: +> * **Performance**: AOF fsync creates IO pressure; adding RDB fork + disk write significantly increases resource contention. +> * **Storage Doubling**: Requires space for both RDB snapshots and AOF files, complicating PVC planning. +> * **Recovery Priority**: Redis loads AOF first on start (more complete data); RDB acts only as backup, offering limited benefit. +> * **Platform Backup**: Alauda Cache Service for Redis OSS provides independent auto/manual backup, removing reliance on RDB snapshots for extra insurance. +> +> **Recommendation**: Choose **Single Persistence Mode** (RDB or AOF) based on needs, and use platform backup for disaster recovery. If mixed mode is necessary, ensure sufficient Storage IOPS (SSD) and reserve 5x data volume disk space. + +#### Parameter Update + +Redis parameters are categorized by application method: + +| Category | Parameters | Behavior | +| :--- | :--- | :--- | +| **Hot Update** | Most runtime params (`maxmemory`, `loglevel`, etc.) | **Immediate effect** after modification, no restart. | +| **Restart Update** | `databases`, `rename-command`, `rdbchecksum`, `tcp-backlog`, `io-threads`, `io-threads-do-reads` | Requires **Instance Restart** to take effect. | +| **Immutable** | `bind`, `protected-mode`, `port`, `supervised`, `pidfile`, `dir`, etc. | Managed by system, modification may cause anomalies. | + +> [!TIP] +> Always assume data backup before modifying parameters requiring restart. + +#### Modification Examples + +**Update Data Node Parameters**: Configure via `spec.customConfig`. + +```bash +# Example: Modify save strategy (Hot update) +kubectl -n patch redis --type=merge --patch='{"spec": {"customConfig": {"save":"600 1"}}}' +``` + +**Update Sentinel Node Parameters**: Configure via `spec.sentinel.monitorConfig`. +> Currently supports `down-after-milliseconds`, `failover-timeout`, `parallel-syncs`. + +```bash +# Example: Modify failover timeout +kubectl -n patch redis --type=merge --patch='{"spec": {"sentinel": {"monitorConfig": {"down-after-milliseconds":"30000"}}}}' +``` + +### Resource Specs + +Deploy resources according to your actual business scenario. + +#### Sentinel Mode Specs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PersistenceTemplateInstance SpecReplica / SentinelSentinel Podredis-exporterredis (Spec)Backup PodTotal ResourcesStorage QuotaAuto Backup (Keep 7)Manual Backup (Keep 7)
AOFaof-redis-<version>-sentinel2c4g1 / 3100m128Mi100m200Mi2c4gUnlimited (Reserve resources)4.5c4.8GEvaluate based on actual write volume
aof-redis-<version>-sentinel4c8g4c8g8.5c8.8G
RDBrdb-redis-<version>-sentinel2c4g2c4g4.5c4.8G8G28G28G
rdb-redis-<version>-sentinel4c8g4c8g8.5c8.8G16G56G56G
Disklessdiskless-redis-<version>-sentinel2c4g2c4g4.5c4.8G/28G28G
diskless-redis-<version>-sentinel4c8g4c8g8.5c8.8G56G56G
+ +#### Cluster Mode Specs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PersistenceTemplateInstance SpecSharding / Replicaredis-exporterredis (Spec)Backup PodTotal ResourcesStorage QuotaAuto Backup (Keep 7)Manual Backup (Keep 7)
AOFaof-redis-<version>-cluster2c4g3 / 1100m300Mi2c4gUnlimited (Reserve resources)12.6c25.8GEvaluate based on actual write volume
aof-redis-<version>-cluster4c8g4c8g24.6c49.8G
RDBrdb-redis-<version>-cluster2c4g2c4g12.6c25.8G24G84G84G
rdb-redis-<version>-cluster4c8g4c8g24.6c49.8G48G168G168G
Disklessdiskless-redis-<version>-cluster2c4g2c4g12.6c25.8G/84G84G
diskless-redis-<version>-cluster4c8g4c8g24.6c49.8G168G168G
+ +> `` represents Redis version, e.g., `6.0`, `7.2`. + +## Scheduling + +Alauda Cache Service for Redis OSS offers flexible scheduling strategies, supporting node selection, taint toleration, and various anti-affinity configurations to meet high availability needs in different resource environments. + +### Node Selection + +You can use the `spec.nodeSelector` field to specify which nodes Redis Pods should be scheduled on. This is typically used with Kubernetes Node Labels to isolate database workloads to dedicated node pools. + +> [!WARNING] +> **Persistence Limitation**: If your Redis instance mounts **Non-Network Storage** (e.g., Local PV) PVCs, be cautious when updating `nodeSelector`. Since local data resides on specific nodes and cannot migrate with Pods, the updated `nodeSelector` set **MUST include the node where the Pod currently resides**. If the original node is excluded, the Pod will fail to access data or start. Network storage (Ceph RBD, NFS) follows the Pod and is not subject to this restriction. + +### Taint Toleration + +Use `spec.tolerations` to allow Redis Pods to tolerate node Taints. This allows deploying Redis on dedicated nodes with specific taints (e.g., `key=redis:NoSchedule`), preventing other non-critical workloads from preempting resources. + +### Anti-Affinity + +To prevent single points of failure, Alauda Cache Service for Redis OSS provides anti-affinity configuration. Configuration differs by architecture mode. + +> [!CAUTION] +> **Immutable**: To ensure consistency and reliability, anti-affinity configurations (both `affinityPolicy` and `affinity`) **cannot be modified** after instance creation. Please plan ahead. + +#### Cluster Mode + +In Cluster mode, the system **prioritizes `spec.affinityPolicy`**. Alauda Cache Service for Redis OSS uses this enum to abstract complex topology rules, automatically generating affinity rules for each shard's StatefulSet. + +* **Priority**: `spec.affinityPolicy` > `spec.affinity`. +* **If `affinityPolicy` is unset**: Alauda Cache Service for Redis OSS checks `spec.affinity`. If you need custom topology rules beyond the enums below, leave `affinityPolicy` empty and configure native `spec.affinity`. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Policy NameaffinityPolicy ValueBehaviorPros/ConsScenario
All Pods Forced Anti-AffinityAntiAffinityForces ALL Pods in the cluster (including primary/replicas of different shards) to be on different nodes. Fails if node count < total Pod count. +
    +
  • Pros: Highest disaster recovery, minimal single-node failure impact.
  • +
  • Cons: extremely high resource requirement, Node count must be >= Total Pods.
  • +
+
Cluster Mode Core Business
Ample resources, strict HA requirements.
Shard Primary-Replica Forced Anti-AffinityAntiAffinityInShardingForces Primary and Replicas within same shard to be on different nodes. Pods from different shards can coexist. +
    +
  • Pros: Guarantees physical isolation of data replicas, preventing shard migration data loss.
  • +
  • Cons: Scheduling fails if live nodes < replica count. Primaries of different shards might land on same node (single point of failure risk).
  • +
+
Production Standard
Balances resource usage and data safety.
Shard Primary-Replica Soft Anti-AffinitySoftAntiAffinityPrioritizes spreading shard primary/replicas. If impossible (e.g., insufficient nodes), allows scheduling on same node. +
    +
  • Pros: Highest deployment success rate, runs with limited resources.
  • +
  • Cons: Primary/Replica may share node in extreme cases, risking data loss.
  • +
+
Test/Dev Environments
Or resource-constrained edge environments.
+ +#### Sentinel Mode + +> **Important** +> **Sentinel Mode does not support `spec.affinityPolicy`**. + +For Sentinel mode, Redis Data Nodes and Sentinel Nodes require separate Kubernetes native Affinity rules: + +* **Redis Data Nodes**: Configured via **`spec.affinity`**. +* **Sentinel Nodes**: Configured via **`spec.sentinel.affinity`**. + +You need to manually write complete `Affinity` rules. Example for forcing anti-affinity for both Data and Sentinel nodes: + +```yaml +spec: + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchExpressions: + - key: app.kubernetes.io/component + operator: In + values: + - redis + - key: redisfailovers.databases.spotahome.com/name + operator: In + values: + - + topologyKey: kubernetes.io/hostname + sentinel: + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchExpressions: + - key: app.kubernetes.io/component + operator: In + values: + - sentinel + - key: redissentinels.databases.spotahome.com/name + operator: In + values: + - + topologyKey: kubernetes.io/hostname +``` + +To force anti-affinity across ALL nodes (Data + Sentinel), refer to: + +```yaml +spec: + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchExpressions: + - key: middleware.instance/type + operator: In + values: + - redis-failover + - key: middleware.instance/name + operator: In + values: + - + topologyKey: kubernetes.io/hostname + sentinel: + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchExpressions: + - key: middleware.instance/type + operator: In + values: + - redis-failover + - key: middleware.instance/name + operator: In + values: + - + topologyKey: kubernetes.io/hostname +``` + +## User Management + +Alauda Cache Service for Redis OSS (v6.0+) provides declarative user management via `RedisUser` CRD, supporting ACLs. + +> [!TIP] +> **Compatibility**: Redis 5.0 only supports single-user auth; Redis 6.0+ implements full ACLs for multi-user/granular control. + +### Permission Profiles + +The platform pre-defines permission profiles for common scenarios: + +| Profile | ACL Rule | Explanation | +| :--- | :--- | :--- | +| **NotDangerous** | `+@all -@dangerous ~*` | Allows all commands except dangerous ones (e.g., `FLUSHDB`). | +| **ReadWrite** | `-@all +@write +@read -@dangerous ~*` | Allows read/write, blocks dangerous ops. | +| **ReadOnly** | `-@all +@read -keys ~*` | Allows read-only operations. | +| **Administrator** | `+@all -acl ~*` | Admin privileges, allows all commands except ACL management. | + +For custom ACLs, see [Redis ACL Documentation](https://redis.io/topics/acl). + +### Security Mechanisms + +1. **ACL Force Revocation**: All `RedisUser` creation/updates undergo Webhook validation to **force remove `acl` permissions**, preventing privilege escalation. +2. **Cluster Command Injection**: For **Cluster Mode**, Alauda Cache Service for Redis OSS automatically injects topology commands: `cluster|slots`, `cluster|nodes`, `cluster|info`, `cluster|keyslot`, `cluster|getkeysinslot`, `cluster|countkeysinslot` to ensure client awareness. +3. **6.0 -> 7.2 Upgrade Compatibility**: When upgrading 6.0 -> 7.2, the operator adds `&*` (Pub/Sub Channel) permission to ensure consistency with 7.x's new Channel ACLs. + +### System Account + +Each Redis instance automatically generates a system account named `operator`. Its roles include: + +1. **Cluster Init**: Slot assignment, node joining. +2. **Config Simplification**: Unified system account reduces user configuration complexity. +3. **Operations**: Used for health checks, failovers, scaling. +4. **Avoid Restarts**: Password updates for business users don't affect this account, avoiding restarts. + +> [!CAUTION] +> `operator` is a **Reserved System Account**: +> * **Complexity**: Random 64-char string (alphanumeric+special). +> * **Privilege**: Highest level (includes user management). +> * **Restriction**: **No online password update** and **DO NOT manually modify/delete**, as it may cause irreversible failure. + +### Production Best Practices + +1. **App Isolation**: Create **independent user accounts** for each app/microservice. Avoid sharing accounts to enable auditing and isolation. +2. **Principle of Least Privilege**: + * **Read-Only App**: Use `ReadOnly`. + * **Read-Write App**: Use `ReadWrite`. + * **Ops Tools**: Use `NotDangerous` or custom permissions. + * **Avoid `Administrator`**: Unless absolutely necessary. +3. **Key Namespace Isolation**: Combine ACL Key patterns (e.g., `~app1:*`) to restrict apps to specific key prefixes. +4. **Password Rotation**: Establish mechanisms to regularly rotate app passwords. + +For operation steps, see [User Management Docs](../functions/20-user.mdx). + +## Client Access + +### Topology Discovery + +Both **Sentinel** and **Cluster** modes rely on clients actively discovering and connecting to data nodes, differing from traditional LB proxy modes: + +#### Sentinel Mode + +1. Client connects to **Sentinel Node**. +2. Client sends `SENTINEL get-master-addr-by-name mymaster` to get Master **IP/Port**. +3. Client **directly connects** to Master. +4. On failover, Sentinel notifies client (or client polls) to switch to new Master. + +#### Cluster Mode + +1. Client connects to any **Cluster Node**. +2. Sends `CLUSTER SLOTS` / `CLUSTER NODES` to get **Slot Distribution**. +3. Calculates hash slot for Key and **directly connects** to target node. +4. If slot migrates, node returns `MOVED`/`ASK`; client must refresh topology. + +Both protocols return **Real Node IPs**. If a reverse proxy (HAProxy/Nginx) is used, clients still get backend real IPs, which may be unreachable from outside the cluster. +Thus, **Each Redis Pod needs an independent external address** (NodePort/LoadBalancer), not a single proxy address. + +### Network Access Strategies + +Alauda Cache Service for Redis OSS supports multiple access methods: + +#### Sentinel Mode + +| Method | Recommended | Description | +| :--- | :--- | :--- | +| **ClusterIP** | ✅ **Internal Preferred** | Access Sentinel via K8s Service (port 26379). Clients auto-discover Master. Lowest latency, highest security. | +| **LoadBalancer** | ✅ **External Preferred** | Exposes Sentinel via MetalLB/Cloud LB. Stable external entry, no port management. | +| **NodePort** | ⚠️ External Backup | Exposes Sentinel via Node ports. Requires manual port management, risky, potential multi-NIC binding issues. | + +#### Cluster Mode + +| Method | Recommended | Description | +| :--- | :--- | :--- | +| **ClusterIP** | ✅ **Internal Preferred** | Access via K8s Service. Client must support Cluster protocol. | +| **LoadBalancer** | ✅ **External Preferred** | Configure LB for each shard Master. Stable external access. Client must handle MOVED/ASK. | +| **NodePort** | ⚠️ External Backup | Expose underlying Pod NodePorts. Client connects directly. Complex port management. | + +> [!WARNING] +> **NodePort Notes**: +> * **Port Management**: Range limited (30000-32767), conflicts easy in multi-instance. +> * **Security**: Increases attack surface. +> * **Multi-NIC**: Redis binds default NIC; clients may fail to connect if IPs mismatch. +> * **No LB Proxy**: Sentinel/Cluster protocols require direct node connection; cannot be proxied by standard LBs. + +> [!INFO] +> **Resource Usage**: LB/NodePort creates a **Service per Pod**. +> * **Sentinel** (1P1R + 3 Sentinels): Needs **8 NodePorts/LBs**. +> * **Cluster** (3 Shards x 1P1R): Needs **7 NodePorts/LBs**. + +### Code Examples + +We provide best practice examples for **go-redis**, **Jedis**, **Lettuce**, and **Redisson**: + +* **Sentinel Access**: [How to Access Sentinel Instance](./access/10-sentinel.mdx) +* **Cluster Access**: [How to Access Cluster Instance](./access/20-cluster.mdx) + +> [!INFO] +> **Master Group Name**: In Sentinel mode, the master name is fixed to `mymaster`. + +### Client Reliability Best Practices + +1. **Timeouts** + * **Connect Timeout**: distinct from Read Timeout. Recommend 1-3s. + * **Read/Write Timeout**: Based on SLA, usually hundreds of ms. + +2. **Retry Strategy** + * **Exponential Backoff**: Do not retry immediately on failure; use backoff (100ms, 200ms...) to avoid retry storms. + +3. **Connection Pooling** + * **Reuse**: Always use pooling (JedisPool, go-redis Pool) to save handshake costs. + * **Max Connections**: Set `MaxTotal` reasonably to avoid hitting Redis `maxclients`. + +4. **Topology Refresh (Cluster)** + * **Auto-refresh**: Ensure client enables `MOVED/ASK` handling. + * **Periodic refresh**: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes. + +## Observability & Operations + +### Backup & Security + +The platform Backup Center provides convenient data management. You can backup instances, manage them centrally, and support S3 offloading. Support for restoring history to specific instances. + +See **[Backup & Restore](../functions/70-backup-restore.mdx)**. + +### Upgrade & Scaling + +#### Upgrade +See **[Upgrade](../upgrade.mdx)**. + +#### Scaling Notes +When changing specs (CPU/Mem) or expanding: +1. **Assess Resources**: Ensure cluster has capacity. +2. **Progressive**: Rolling updates to minimize interruption. +3. **Off-peak**: Execute during low traffic. + +> [!CAUTION] +> When reducing replicas or specs, ensure current data/load fits new specs to avoid data loss/crash. + +### Monitoring + +Alauda Cache Service for Redis OSS has built-in metrics integrated with Prometheus. + +#### Built-in Metrics + +Variables `{{.namespace}}` and `{{.name}}` should be replaced with actual values. + +##### Key Hit Rate + +* **Desc**: Cache hit rate. +* **Unit**: % +* **Expr**: + ```text + 1/(1+(avg(irate(redis_keyspace_misses_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) by(namespace,service) / (avg(irate(redis_keyspace_hits_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) by(namespace,service)+1))) + ``` + +##### Average Response Time + +* **Desc**: Avg command latency. High = slow queries/bottleneck. +* **Unit**: s +* **Expr**: + ```text + avg((redis_commands_duration_seconds_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"} / redis_commands_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"})) by (namespace,service) + ``` + +##### Role Switching + +* **Desc**: Master-Replica switches in 5m. Non-zero = failover occurred. +* **Unit**: Count +* **Expr**: + ```text + sum by(namespace,service) (changes((sum by(namespace,service,pod)(redis_instance_info{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*",role="master"}) OR (sum by(namespace,service,pod)(redis_instance_info{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*",}) * 0))[5m:10s])) + ``` + +##### Instance Status + +* **Desc**: Health status. 0 = Abnormal. +* **Expr**: + ```text + ((count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster"}) % count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster",role="master"})) == bool 0 and count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster",role="master"}) >= bool 3) or (count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"rfr-({{.name}})",redisarch="sentinel",role="master"})) > bool 0 + ``` + +##### Node Input Bandwidth + +* **Desc**: Peak ingress traffic. +* **Unit**: Bps +* **Expr**: + ```text + max by(namespace,service)(irate(redis_net_input_bytes_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) + ``` + +##### Node Output Bandwidth + +* **Desc**: Peak egress traffic. +* **Unit**: Bps +* **Expr**: + ```text + max by(namespace,service)(irate(redis_net_output_bytes_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) + ``` + +##### Node Connections + +* **Desc**: Peak client connections. Watch if near `maxclients`. +* **Unit**: Count +* **Expr**: + ```text + max by(namespace,service)(redis_connected_clients{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*"}) + ``` + +##### CPU Usage + +* **Desc**: Node CPU usage. Sustained high = perf impact. +* **Unit**: % +* **Expr**: + ```text + avg by(namespace,pod_name)(irate(container_cpu_usage_seconds_total{namespace=~"{{.namespace}}",pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}[5m]))/avg by(namespace,pod_name)(container_spec_cpu_quota{namespace=~"{{.namespace}}",pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"})*100000 + ``` + +##### Memory Usage + +* **Desc**: Node memory usage. >80% suggest scaling. +* **Unit**: % +* **Expr**: + ```text + avg by(namespace,pod_name)(container_memory_usage_bytes{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"} - container_memory_cache{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}) / avg by(namespace,pod_name)(container_spec_memory_limit_bytes{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}) + ``` + +##### Storage Usage + +* **Desc**: PVC usage. Full = persistence failure. +* **Unit**: % +* **Expr**: + ```text + avg(kubelet_volume_stats_used_bytes{namespace=~"{{.namespace}}",persistentvolumeclaim=~"redis-data-(drc|rfr)-({{.name}})-.*"}) by(namespace,persistentvolumeclaim) / avg(kubelet_volume_stats_capacity_bytes{namespace=~"{{.namespace}}",persistentvolumeclaim=~"redis-data-(drc|rfr)-({{.name}})-.*"}) by(namespace,persistentvolumeclaim) + ``` + +#### Key Metrics & Alert Recommendations + +Recommended production alerts: + +| Metric | Threshold | Note | +| :--- | :--- | :--- | +| **Memory Usage** | > 80% | Risk of eviction/OOM. | +| **CPU Usage** | > 80% (Sustained) | Latency spikes. | +| **Hit Rate** | < 80% | Strategy issue or capacity missing. | +| **Failovers** | > 0 | Check network/node health. | +| **Connections** | Near maxclients | New connections rejected. | +| **Storage Usage** | > 80% | Ensure space for AOF/RDB. | +| **Response Time** | > 10ms | Slow queries/bottlenecks. | + +### Troubleshooting + +For specific issues, search the [Customer Portal](https://cloud.alauda.cn/kb). + +## References + +* [High availability with Redis Sentinel](https://redis.io/docs/latest/operate/oss_and_stack/management/sentinel/) +* [Redis cluster specification](https://redis.io/docs/latest/operate/oss_and_stack/reference/cluster-spec/) +* [Redis persistence](https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/) +* [Scale with Redis Cluster](https://redis.io/docs/latest/operate/oss_and_stack/management/scaling/) +* [Optimizing Redis](https://redis.io/docs/latest/operate/oss_and_stack/management/optimization/)