From 1171f3458e1a3c2211155e9d136e105dc9e4f431 Mon Sep 17 00:00:00 2001
From: hfxia <hfxia@alauda.io>
Date: Tue, 9 Dec 2025 16:14:58 +0800
Subject: [PATCH 1/9] backup

---
 docs/en/how_to/90-redis-references.mdx | 401 +++++++++++++++++++++++++
 1 file changed, 401 insertions(+)
 create mode 100644 docs/en/how_to/90-redis-references.mdx

diff --git a/docs/en/how_to/90-redis-references.mdx b/docs/en/how_to/90-redis-references.mdx
new file mode 100644
index 0000000..5f37e50
--- /dev/null
+++ b/docs/en/how_to/90-redis-references.mdx
@@ -0,0 +1,401 @@
+# Redis 最佳实践
+
+# 1. 概述 (Overview)
+
+Redis 作为云原生架构中事实标准的缓存与键值存储组件，承载着业务高并发读写与低延迟的核心需求。在 Kubernetes 容器化环境下运行有状态（Stateful）的 Redis 服务，面临着与传统物理机环境截然不同的挑战，包括**存储持久化的稳定性**、**网络拓扑的动态变化**以及**资源隔离与调度**等问题。
+
+本最佳实践文档旨在为生产环境下的 Redis 部署提供一套标准化的参考指南。涵盖从**架构选型**、**资源规划**、**客户端对接**到**可观测性运维**的全生命周期管理规范。通过遵循本指南，旨在帮助用户构建一个**高可用 (High Availability)**、**高性能 (High Performance)** 且**易维护 (Maintainability)** 的企业级 Redis 数据服务。
+
+# 2. 架构设计 (Architecture Design)
+
+全栈云原生开放平台根据客户业务规模与 SLA 需求，提供两种标准的 Redis 管理架构：
+
+## 2.1 高可用架构：哨兵模式 (HA: Sentinel)
+**定位：经典高可用架构，适用于中小规模业务。**
+
+哨兵模式基于 Redis 原生的主从复制机制。通过部署独立的 Sentinel 进程组监控主从节点状态，在主节点故障时自动执行 Failover（故障转移）并通知客户端。
+*   **优点**：架构简单，运维成熟，对客户端协议要求较低。
+*   **局限**：写能力受限于单节点，无法水平扩展存储容量。
+
+## 2.2 分布式架构：集群模式 (Distributed: Cluster)
+**定位：分布式分片架构，适用于大规模高并发业务。**
+
+集群模式通过 Hash Slot（哈希槽）将数据自动分片到多个节点，实现存储容量和读写性能的水平扩展（Scale-out）。
+*   **优点**：真正的高可用分布式存储，支持动态扩缩容（Resharding）。
+*   **局限**：客户端协议复杂，特定多 Key 命令（如 `MGET`）受限于 Slot 分布。
+
+## 2.3 选型指南 (Selection Guide)
+
+在选择 Redis 架构时，需要考虑业务对可用性、扩展性和复杂度的要求。
+
+| 特性 | 哨兵模式 (Sentinel) | 集群模式 (Cluster) |
+| :--- | :--- | :--- |
+| **适用场景** | 中小型业务，读多写少，数据量适中 | 大型业务，高并发读写，海量数据 |
+| **高可用性** | 通过哨兵监控和自动故障转移实现 | 通过节点自动故障检测和恢复实现 |
+| **扩展性** | 垂直扩展（升级规格），水平扩展仅限读能力 | 水平扩展（读写能力），支持动态扩缩容 |
+| **读写分离** | 支持（需要客户端配合或中间件） | 不支持（通常直接连接分片主节点） |
+| **数据分片** | 无（单节点存储全量数据） | 有（数据自动分片到多个节点） |
+| **运维复杂度** | 较低，架构简单 | 较高，涉及分片、哈希槽、数据迁移等 |
+| **网络限制** | 需要客户端支持 Sentinel 协议 | 需要客户端支持 Cluster 协议 |
+
+**建议：**
+*   如果数据量较小（单节点内存可容纳），且主要追求简单稳定，**哨兵模式**是首选。
+*   如果数据量巨大或写入压力极高，单节点无法支撑，选择**集群模式**。
+
+# 3. 部署配置 (Deployment Configuration)
+
+## 3.1 实例部署 (Deployment)
+
+在 **Data Services** 页面，单击 **Redis** ，选择您的 项目 及 命名空间 后，单击 创建 **Redis** 实例 ，根据实际业务场
+景配置实例，推荐您使用最新的 Redis 6.0 版本。
+
+## 3.2 参数模板选择 (Parameter Templates)
+
+Redis 实例的参数配置通过 Custom Resource (CR) 的字段进行指定。
+
+*   **数据节点参数**：通过 `spec.customConfig` 字段配置。例如设置 `save` 策略：
+    ```yaml
+    spec:
+      customConfig:
+        save: "600 1"
+    ```
+*   **哨兵节点参数**（仅哨兵模式）：通过 `spec.sentinel.monitorConfig` 字段配置。支持 `down-after-milliseconds`, `failover-timeout`, `parallel-syncs` 等关键参数。
+
+选择模板时，应根据业务对持久化（AOF vs RDB）和性能的权衡来决定。
+
+## 3.3 资源规格配置 (Resource Specs)
+
+您可以根据业务场景的实际情况，部署资源规格。
+
+| 集群架构 | 持久化 | 模板 | 实例规格 | Replica | Sentinel | Sharding | Sentinel容器组 | redis-exporter | redis (实例规格) | 备份容器 (backup) | 资源总量 | 实例存储配额 | 自动备份 (保留7份) | 手动备份 (保留7份) |
+|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
+| Redis Sentinel | AOF | aof-redis-6.0-sentinel | 2c4g | 1 | 3 | / | 100m128Mi | 100m200Mi | 2c4g | 不限制（需要预留资源） | 4.5c4.8g | 需要根据实际写入量评估 | | |
+| | | aof-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.8g | | | |
+| | RDB | rdb-redis-6.0-sentinel | 2c4g | | | | | | 2c4g | | 4.5c4.8g | 8G | 28G | 28G |
+| | | rdb-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.64g | 16G | 56G | 56G |
+| | Diskless | diskless-redis-6.0-sentinel | 2c4g | | | | | | 2c4g | | 4.5c4.8g | / | 28G | 28G |
+| | | diskless-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.8g | | 56G | 56G |
+| Redis Cluster | AOF | aof-redis-6.0-cluster | 2c4g | / | 3 | / | | 100m300Mi | 2c4g | | 12.6c25.8g | 需要根据实际写入量评估 | | |
+| | | aof-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | | | |
+| | RDB | rdb-redis-6.0-cluster | 2c4g | | | | | | 2c4g | | 12.6c25.8g | 24G | 84G | 84G |
+| | | rdb-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | 48G | 168G | 168G |
+| | Diskless | diskless-redis-6.0-cluster | 2c4g | | | | | | 2c4g | | 12.6c25.8g | / | 84G | 84G |
+| | | diskless-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | | 168G | 168G |
+
+## 3.4 调度策略配置 (Scheduling)
+
+Redis 集群模式提供三种推荐反亲和策略，所有 Pod 强制反亲和、分片内主从强制反亲和以及分片内主从尽量
+反亲和。
+
+**所有 Pod 强制反亲和**
+
+Redis Cluster 实例会将 Redis 强制分布到不同的节点上。如图所示，假设有三个节点，每个节点上可以
+调度三个 Pod。然而，在本例中，Redis Cluster 策略被设置为强制 Pod 反亲和性。由于这个设置，当尝
+试在相同节点上调度不同的 Redis 分片副本时，部署失败。
+
+在可视化图中，三个主节点（master）分布在三个不同的节点上，每个节点有三个可用的 Pod 位置。同
+时，三个从节点（replica）都无法被调度，因为它们与已存在的主节点存在反亲和性约束，会导致部署
+失败。
+
+**分片内主从强制反亲和**
+
+分片内主从强制反亲和策略会将同一个分片的 Redis 节点分布到不同的节点上，确保即使某个节点出现
+故障，数据仍然具有完整性。
+
+如图所示，假设有三个节点，每个节点可以调度三个 Pod。在 Node2 和 Node3 上的 Pod 位置已经被占
+满，而 Node1 上仍有空余的 Pod 资源。然而，由于 Redis shard-2 主节点（master）位于 Node1 上，并
+且分片内主从节点被设置为强制反亲和性，Redis shard-2 从节点（replica）不能被调度到 Node1 上。这
+导致 Redis shard-2 从节点无法找到可用的节点进行调度，从而导致部署失败。
+
+**分片内主从尽量反亲和**
+
+在 Redis Cluster 中，为了提高数据的可靠性和容错能力，我们可以采用“分片内主从尽量反亲和”的策
+略。这意味着同一个分片内的主从节点尽量被调度到不同的节点上。这样，即使某个节点出现故障，数
+据仍然具有完整性。
+
+采用“分片内主从尽量反亲和”的策略时，调度器会优先尝试将主从节点部署在不同的节点上。然而，如果
+无法找到符合条件的节点，调度器仍然会允许将主从节点部署在同一个节点上。这种策略在节点资源有
+限的情况下，可以确保 Redis 集群的正常部署和运行，同时在尽量遵循反亲和原则的基础上，提高数据
+的可靠性。
+
+如图，Node1 在资源不足的情况下部署了 redis shard-2 的主从节点。
+
+您可以根据下表，对您的实际业务情景设置集群模式 Redis 的反亲和策略。
+
+|所有 Pod 之间强制要求在<br>不同节点上部署，不允许在<br>同一节点上部署。|可以最大程度地<br>保证负载均衡和<br>系统的高可用<br>性。|节点资源有限时可<br>能导致部署失败。|保障<br>自愈|
+|---|---|---|---|
+|同一个分片内的主从节点必<br>须部署在不同节点上。|保证分片内主从<br>节点故障隔离，<br>提高数据可靠<br>性。|节点资源有限时可<br>能导致部署失败。|有可<br>能自<br>愈|
+
+在哨兵模式下，默认采用分片内主从尽量反亲和策略。这意味着同一个分片内的主从节点会尽量部署在不同的
+节点上，但在资源不足的情况下，也允许主从节点部署在同一个节点上。采用这种策略的目的是为了在保证数
+据可靠性的同时，适应有限的节点资源。
+
+根据上文描述，分片内主从尽量反亲和策略在资源有限的情况下，可以在保证集群正常部署的基础上，尽可能
+提高数据可靠性。这种策略在某些情况下可以实现单点故障的自愈。然而，在同一个节点上部署主从节点时，
+数据可靠性相对较低。因此，在实际应用中，我们需要根据实际情况和需求来选择合适的策略，以实现 Redis
+高可用性和故障自动切换的目标。
+
+# 4. 资源规划 (Resource Planning)
+
+## 4.1 内存规格 (Memory)
+
+Redis 采用快照机制将内存中的数据异步复制到磁盘上进行长期存储。这种机制使得 Redis 保持了高性能，但
+也存在在快照之间可能丢失数据的风险。
+
+在 Kubernetes 容器化环境下，我们建议采用分级内存管理策略：
+*   **✅ 标准规格 (< 8GB)**：**强烈推荐**。能够保证极低的 Fork 延迟和快速的故障恢复（RTO < 60s），是最稳健的生产选择。
+*   **⚠️ 高性能规格 (8GB - 16GB)**：**可接受**。需配合高性能宿主机且**必须关闭 THP**。在此规格下，Fork 虽然可控，但在高负载时仍可能造成 100ms 级别的抖动。
+*   **❌ 高风险规格 (> 16GB)**：**不推荐**。单点故障影响面过大，且全量同步极易打满网络带宽。建议水平拆分为 Cluster 模式。
+
+### 为什么限制 8GB？（技术深究）
+虽然物理机时代单实例常运行 32GB+，但在云原生环境下，8GB 限制是基于以下核心技术原理的“黄金法则”：
+
+1.  **Fork 阻塞与页表复制 (The Fork Blocking)**
+    *   Redis 执行 RDB/AOF Rewrite 时调用 `fork()`。虽然内存页是 CoW（写时复制），但**进程页表 (Page Table) 必须全量拷贝**，此过程会**阻塞主线程**。
+    *   *估算公式*：10GB 内存 ≈ 20MB 页表 ≈ 10~50ms 阻塞（视虚拟化层开销而定）。一旦超过 8GB，分钟级的阻塞风险将呈指数级上升，严重影响 SLA。
+
+2.  **故障恢复效率 (RTO)**
+    *   容器重启加载 RDB 是**单线程 CPU 密集型任务**（对象反序列化）。测试表明，受限于 CPU 解析效率，加载 8GB 数据通常需 **30-50秒**（即使使用 SSD）。若维持在 32GB，重启时间可能长达数分钟，这与 Kubernetes "快速自愈" 的设计理念背道而驰。
+
+### 内存配置最佳实践 (MaxMemory vs Limit)
+
+为了避免 Redis 在持久化期间因内存膨胀而导致 OOM（OOM Kill），需严格遵循以下配置原则：
+
+1.  **设置 MaxMemory**: 不要将 `maxmemory` 设置为容器内存 Limit 的 100%。建议设置为 Limit 的 **70% ~ 80%**。
+2.  **预留 CoW 空间**: Redis 在 RDB/AOF Rewrite 时会 Fork 子进程。若此时有大量写入更新，OS 的写时复制 (Copy-on-Write) 机制会导致内存页被复制，极端情况下内存占用可能由 8GB 翻倍至 16GB。
+3.  **Overcommit 配置**: 确保宿主机开启 `vm.overcommit_memory = 1`，允许内核在 fork 时无需申请等量物理内存（依赖 CoW），防止 fork 失败。
+
+> [!IMPORTANT]
+> **资源预留公式**: `Container_Memory_Limit` ≈ `Redis_MaxMemory` / 0.7
+> *   示例: 如果希望 Redis 存储 8GB 数据，容器内存 Limit 建议配置为 10GB ~ 12GB，留出 2GB+ 给 CoW 和碎片开销。
+
+## 4.2 计算资源 (Compute)
+
+### 基础建议：2 Core
+Redis 核心命令执行是单线程的，但持久化（Fork）等操作需要子进程。因此，建议为每个 Redis 实例分配 **至少 2 Core** 的 CPU 资源：
+*   **Core 1**: 处理主线程的请求与命令。
+*   **Core 2**: 应对持久化 fork、后台任务以及系统开销。
+
+### 多线程 (Multi-threading)
+Redis 6.0+ 引入了多线程 I/O（默认关闭）。
+*   **作用**: 利用多核处理网络数据的读写（Socket I/O），优化单核 IO 限制。
+*   **局限性**: 命令执行仍然是单线程的。因此，增加 CPU 核心数只能提升网络吞吐，无法提升命令处理速度（Latency）。
+*   **配置建议**: 如果网络是瓶颈（如大流量场景），可开启多线程，但建议总核心数配置不超过 4~8 Core，过多的核心并不能带来线性提升。
+
+## 4.3 存储规划 (Storage)
+
+### 容量规划 (Capacity)
+持久化模式直接决定了磁盘空间的配额需求，请参考以下计算公式：
+
+| 模式 | 建议磁盘配额公式 | 详细说明 |
+| :--- | :--- | :--- |
+| **Diskless (纯缓存)** | `0` (无需 PVC) | 仅作为纯缓存使用，不开启 RDB/AOF。K8s 环境下日志通过标准输出 (stdout) 采集，无需挂载持久化数据盘。 |
+| **RDB (快照)** | `MaxMemory * 2` | RDB 也是写时复制（CoW）。在生成新快照时，磁盘上同时存在“旧快照”和“正在写入的新快照”。<br>**建议**：至少预留 2 倍内存空间。 |
+| **AOF (追加写)** | `MaxMemory * 3` | AOF 文件会随着写操作不断增长。默认配置下 (`auto-aof-rewrite-percentage 100`)，AOF 文件会增长到数据量的 **2倍** 才触发重写。此时磁盘需同时容纳：<br>1. 旧 AOF 文件 (2x)<br>2. 重写生成的新 AOF 文件 (1x)<br>**峰值总计 3x**。建议预留至少 3 倍空间以防磁盘写满。 |
+
+### 性能要求 (IOPS & Latency)
+*   **AOF 开启时**: 磁盘性能至关重要。如果磁盘 IOPS 不足或 fsync 延迟过高，会直接阻塞主线程（当 `appendfsync everysec` 时）。
+*   **推荐介质**: 生产环境强烈建议使用 SSD/NVMe 本地盘或高性能云盘。
+
+## 4.4 内核参数调优 (Kernel Tuning)
+
+为了确保 Redis 在生产环境中的稳定性和高性能，建议在 Kubernetes 节点层面进行以下内核参数优化：
+
+1.  **内存分配控制 (`vm.overcommit_memory`)**
+    *   **建议值**: `1`
+    *   **说明**: 设置为 `1` (Always) 确保内核在 Redis 执行 Fork 操作（RDB 快照/AOF 重写）时允许分配内存，即使物理内存看似不足。这能有效防止因内存分配失败导致的持久化失败。
+
+2.  **连接队列大小 (`net.core.somaxconn`)**
+    *   **建议值**: `2048` 或更高
+    *   **说明**: Redis 默认的 tcp-backlog 为 511。在高并发场景下，应将系统的 `net.core.somaxconn` 提高，以避免客户端连接请求被丢弃。
+
+3.  **透明大页 (Transparent Huge Pages, THP)**
+    *   **建议操作**: **禁用** (`never`)
+    *   **说明**: THP 会导致 Redis 在发生内存分配时产生显著的延迟尖峰，尤其是在 Fork 后的写时复制（CoW）期间。建议在宿主机或启动脚本中禁用。
+
+
+# 5. 客户端接入 (Client Integration)
+
+## 5.1 网络访问策略 (Network Access)
+
+您可以根据下表，选择适合实际业务尝尽的访问方式。
+
+|架构|访问方式|推荐|特性说明|
+|---|---|---|---|
+|哨兵<br>模式|集群内访<br>问|哨兵访问地址|用于从集群内连接哨兵（哨兵会去发现底层的数据节<br>点）。提供包括哨兵节点对应的内部路由的名称、IP地<br>址和端口号。 端口号默认为 26379。|
+|集群<br>模式|集群外访<br>问—无代理|集群模式暴露 底层数据节<br>点（Pod）的 NodePort|暴露底层数据节点 的 NodePort，直连底层数据节点，<br>Pod本身不具备高可用性，但可以写多个节点地址。<br>应用连接时需要写多个 节点IP+端口，部分节点宕机<br>时，连接仍保留。|
+
+> [!TIP]
+> **外网访问推荐**：对于生产环境的集群外访问，**强烈推荐使用 LoadBalancer 方式**。
+> *   **LoadBalancer**：为 Redis 实例配置外部负载均衡器，提供稳定的访问入口。
+> *   **NodePort 限制**：虽然支持 NodePort，但客户端需要直接连接到具体的节点端口。在多网卡环境或网络策略复杂的场景下，直接使用 NodePort 可能会因为网络接口绑定问题导致连接失败（Redis Sentinel/Cluster 默认绑定默认网卡）。且 NodePort 不支持通过额外的负载均衡器进行代理（因为 Sentinel/Cluster 协议要求客户端直连感知拓扑）。
+
+## 5.2 代码接入示例 (Code Examples)
+
+以下是使用 **Go (go-redis/v9)** 客户端连接 Redis 的示例代码：
+
+#### 哨兵模式 (Sentinel Mode)
+
+哨兵模式下，客户端连接 Sentinel 节点，通过 `mymaster`（默认主节点组名）发现主节点。
+
+```go
+package main
+
+import (
+    "context"
+    "fmt"
+    "time"
+    "github.com/redis/go-redis/v9"
+)
+
+func main() {
+    client := redis.NewFailoverClient(&redis.FailoverOptions{
+        SentinelAddrs: []string{"<sentinel-ip>:26379"}, // 替换为 Sentinel 地址
+        MasterName:    "mymaster",                       // 固定名称
+        Password:      "<password>",                     // Redis 密码
+        DialTimeout:   3 * time.Second,
+        ReadTimeout:   5 * time.Second,
+        WriteTimeout:  10 * time.Second,
+        PoolSize:      100,
+    })
+    defer client.Close()
+
+    if val, err := client.Get(context.TODO(), "test").Result(); err != nil {
+        panic(err)
+    } else {
+        fmt.Println(val)
+    }
+}
+```
+
+#### 集群模式 (Cluster Mode)
+
+集群模式下，客户端连接任一集群节点获取拓扑。建议配置多个种子节点地址。
+
+```go
+package main
+
+import (
+    "context"
+    "fmt"
+    "time"
+    "github.com/redis/go-redis/v9"
+)
+
+func main() {
+    client := redis.NewClusterClient(&redis.ClusterOptions{
+        Addrs:         []string{"<node-ip>:6379"}, // 替换为集群节点地址
+        Password:      "<password>",
+        DialTimeout:   3 * time.Second,
+        ReadTimeout:   5 * time.Second,
+        WriteTimeout:  10 * time.Second,
+        PoolSize:      100,
+    })
+    defer client.Close()
+
+    if val, err := client.Get(context.TODO(), "test").Result(); err != nil {
+        panic(err)
+    } else {
+        fmt.Println(val)
+    }
+}
+```
+
+## 5.3 客户端可靠性最佳实践
+
+除了正确的连接配置，客户端的健壮性设计对于保障业务稳定性至关重要：
+
+1.  **超时设置 (Timeouts)**
+    *   **连接超时 (Connect Timeout)**: 建议设为 1-3秒，避免网络不可达时长时间阻塞。
+    *   **读写超时 (Read/Write Timeout)**: 根据业务 SLA 设置，通常只有几百毫秒。不要设置过长，防止慢查询拖垮应用线程池。
+
+2.  **重试策略 (Retry Strategy)**
+    *   **指数退避 (Exponential Backoff)**: 遇到网络抖动或主从切换时，不要立即重试，而应使用指数退避算法（如 wait 100ms, 200ms, 400ms...），避免重试风暴加剧 Redis 负载。
+
+3.  **连接池 (Connection Pooling)**
+    *   **预热与复用**: 务必使用连接池（如 JedisPool, go-redis Pool）。Redis 连接握手成本较高，频繁创建/销毁连接会严重影响性能。
+    *   **最大连接数**: 根据客户端并发量合理设置 `MaxTotal`，避免超过 Redis 的 `maxclients` 限制。
+
+4.  **拓扑刷新机制 (Topology Refresh) - 集群模式关键**
+    *   **自动刷新**: 确保客户端开启了基于 `MOVED/ASK` 转向错误的自动拓扑更新功能（绝大多数 Smart Client 默认开启）。
+    *   **周期性刷新**: 在拓扑变化频繁（如弹性伸缩）或网络不稳定的场景下，建议配置客户端进行周期性拓扑刷新（例如每 60 秒），以主动感知节点变更，减少请求重定向带来的延迟。
+
+
+
+
+
+# 6. 可观测性与运维 (Observability & Operations)
+
+## 6.1 数据安全与备份
+
+平台备份中心为您提供便捷的数据备份和管理解决方案，您可以在各实例下进行备份，并在备份中心中统一管
+理。此外，备份中心支持外接S3存储，可实现安全备份和恢复。您还可以恢复历史备份到特定实例，以及查看
+和删除备份记录。
+
+## 6.3 认证与权限管理
+
+Redis Operator（6.0+ 版本）支持通过 `RedisUser` CRD 进行声明式用户管理，支持 ACL 权限控制。
+
+**管理操作：**
+1.  **查看用户**：查看 `RedisUser` 资源。
+    ```bash
+    kubectl -n <namespace> get RedisUser
+    ```
+2.  **修改密码**：密码存储在 Secret 中。找到 `RedisUser` 关联的 `passwordSecrets`，修改对应的 Secret 内容（Base64 编码）。修改后，Operator 会自动同步，状态暂时变为 Pending，同步完成后变回 Success。
+3.  **修改权限**：直接编辑 `RedisUser` 的 `spec.aclRules` 字段。例如：`+@all ~*` 表示所有权限。
+4.  **创建新用户**：创建一个 `RedisUser` yaml 文件，指定 `redisName` 关联到 Redis 实例。
+
+## 6.4 变更与维护
+
+### 更新注意事项
+在对 Redis 实例进行规格变更（CPU/Memory）或扩容时：
+1.  **资源评估**：确保集群有足够的剩余资源。
+2.  **渐进式执行**：Operator 会尽量逐个 Pod 进行滚动更新，以减少服务中断。
+3.  **避开高峰期**：对于大数据量实例，建议在业务低峰期进行变更。
+
+> [!CAUTION]
+> 减少副本数或降低规格时，务必确认当前数据量和负载可以被新规格承载，否则可能导致数据丢失或服务崩溃。
+
+### 重启注意事项
+通过管理控制台或 kubectl 触发重启时：
+1.  **逐个重启**：Pod 会被逐个重启（Rolling Restart）。
+2.  **服务中断**：虽然是滚动重启，但在主从切换或节点重连期间，客户端可能会遇到短暂的连接断开或超时，客户端应具备重试机制。
+3.  **前置条件**：实例必须处于 `Running` 状态才能执行重启操作。
+
+## 6.5 监控告警体系
+
+### 关键监控指标
+
+生产环境建议重点关注以下指标，并配置相应的告警规则：
+
+| 类别 | 指标名称 (Prometheus) | 含义 | 告警阈值建议 |
+| :--- | :--- | :--- | :--- |
+| **性能** | `redis_cpu_user_seconds_total` | CPU 使用率 | 持续 > 80% (单核) |
+| | `redis_memory_used_bytes` | 内存使用量 | > 80% limit |
+| **可靠性** | `redis_connected_clients` | 连接客户端数 | 接近 maxclients |
+| | `redis_evicted_keys_total` | 驱逐 Key 数量 | > 0 (若非预期) |
+| | `redis_rejected_connections_total` | 拒绝连接数 | > 0 |
+| **缓存效果** | **Hit Rate** (计算得出) | 缓存命中率 | < 80% (视业务而定) |
+
+**缓存命中率计算 (PromQL 示例):**
+```promql
+rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
+```
+
+## 6.6 故障排查指南
+
+您可以在 [Customer Portal](https://cloud.alauda.cn/knowledges) 查询您遇到的问题。
+
+查询关键词：Redis、集群模式、监控、崩溃恢复、污点容忍、部署优化……
+
+
+
+# 7. 其他
+
+## 7.1 参考资料
+
+[https://docs.redis.com/latest/ri/memory-optimizations/](https://docs.redis.com/latest/ri/memory-optimizations/)
+
+[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/)
+
+[https://redis.io/docs/](https://redis.io/docs/)

From ad7182dabee379e257d1333aec20f9ba1eb70f2f Mon Sep 17 00:00:00 2001
From: hfxia <hfxia@alauda.io>
Date: Wed, 10 Dec 2025 17:56:55 +0800
Subject: [PATCH 2/9] feat: added redis best practices

---
 docs/en/how_to/90-redis-references.mdx  | 401 ------------
 docs/en/how_to/redis-best-practices.mdx | 833 ++++++++++++++++++++++++
 2 files changed, 833 insertions(+), 401 deletions(-)
 delete mode 100644 docs/en/how_to/90-redis-references.mdx
 create mode 100644 docs/en/how_to/redis-best-practices.mdx

diff --git a/docs/en/how_to/90-redis-references.mdx b/docs/en/how_to/90-redis-references.mdx
deleted file mode 100644
index 5f37e50..0000000
--- a/docs/en/how_to/90-redis-references.mdx
+++ /dev/null
@@ -1,401 +0,0 @@
-# Redis 最佳实践
-
-# 1. 概述 (Overview)
-
-Redis 作为云原生架构中事实标准的缓存与键值存储组件，承载着业务高并发读写与低延迟的核心需求。在 Kubernetes 容器化环境下运行有状态（Stateful）的 Redis 服务，面临着与传统物理机环境截然不同的挑战，包括**存储持久化的稳定性**、**网络拓扑的动态变化**以及**资源隔离与调度**等问题。
-
-本最佳实践文档旨在为生产环境下的 Redis 部署提供一套标准化的参考指南。涵盖从**架构选型**、**资源规划**、**客户端对接**到**可观测性运维**的全生命周期管理规范。通过遵循本指南，旨在帮助用户构建一个**高可用 (High Availability)**、**高性能 (High Performance)** 且**易维护 (Maintainability)** 的企业级 Redis 数据服务。
-
-# 2. 架构设计 (Architecture Design)
-
-全栈云原生开放平台根据客户业务规模与 SLA 需求，提供两种标准的 Redis 管理架构：
-
-## 2.1 高可用架构：哨兵模式 (HA: Sentinel)
-**定位：经典高可用架构，适用于中小规模业务。**
-
-哨兵模式基于 Redis 原生的主从复制机制。通过部署独立的 Sentinel 进程组监控主从节点状态，在主节点故障时自动执行 Failover（故障转移）并通知客户端。
-*   **优点**：架构简单，运维成熟，对客户端协议要求较低。
-*   **局限**：写能力受限于单节点，无法水平扩展存储容量。
-
-## 2.2 分布式架构：集群模式 (Distributed: Cluster)
-**定位：分布式分片架构，适用于大规模高并发业务。**
-
-集群模式通过 Hash Slot（哈希槽）将数据自动分片到多个节点，实现存储容量和读写性能的水平扩展（Scale-out）。
-*   **优点**：真正的高可用分布式存储，支持动态扩缩容（Resharding）。
-*   **局限**：客户端协议复杂，特定多 Key 命令（如 `MGET`）受限于 Slot 分布。
-
-## 2.3 选型指南 (Selection Guide)
-
-在选择 Redis 架构时，需要考虑业务对可用性、扩展性和复杂度的要求。
-
-| 特性 | 哨兵模式 (Sentinel) | 集群模式 (Cluster) |
-| :--- | :--- | :--- |
-| **适用场景** | 中小型业务，读多写少，数据量适中 | 大型业务，高并发读写，海量数据 |
-| **高可用性** | 通过哨兵监控和自动故障转移实现 | 通过节点自动故障检测和恢复实现 |
-| **扩展性** | 垂直扩展（升级规格），水平扩展仅限读能力 | 水平扩展（读写能力），支持动态扩缩容 |
-| **读写分离** | 支持（需要客户端配合或中间件） | 不支持（通常直接连接分片主节点） |
-| **数据分片** | 无（单节点存储全量数据） | 有（数据自动分片到多个节点） |
-| **运维复杂度** | 较低，架构简单 | 较高，涉及分片、哈希槽、数据迁移等 |
-| **网络限制** | 需要客户端支持 Sentinel 协议 | 需要客户端支持 Cluster 协议 |
-
-**建议：**
-*   如果数据量较小（单节点内存可容纳），且主要追求简单稳定，**哨兵模式**是首选。
-*   如果数据量巨大或写入压力极高，单节点无法支撑，选择**集群模式**。
-
-# 3. 部署配置 (Deployment Configuration)
-
-## 3.1 实例部署 (Deployment)
-
-在 **Data Services** 页面，单击 **Redis** ，选择您的 项目 及 命名空间 后，单击 创建 **Redis** 实例 ，根据实际业务场
-景配置实例，推荐您使用最新的 Redis 6.0 版本。
-
-## 3.2 参数模板选择 (Parameter Templates)
-
-Redis 实例的参数配置通过 Custom Resource (CR) 的字段进行指定。
-
-*   **数据节点参数**：通过 `spec.customConfig` 字段配置。例如设置 `save` 策略：
-    ```yaml
-    spec:
-      customConfig:
-        save: "600 1"
-    ```
-*   **哨兵节点参数**（仅哨兵模式）：通过 `spec.sentinel.monitorConfig` 字段配置。支持 `down-after-milliseconds`, `failover-timeout`, `parallel-syncs` 等关键参数。
-
-选择模板时，应根据业务对持久化（AOF vs RDB）和性能的权衡来决定。
-
-## 3.3 资源规格配置 (Resource Specs)
-
-您可以根据业务场景的实际情况，部署资源规格。
-
-| 集群架构 | 持久化 | 模板 | 实例规格 | Replica | Sentinel | Sharding | Sentinel容器组 | redis-exporter | redis (实例规格) | 备份容器 (backup) | 资源总量 | 实例存储配额 | 自动备份 (保留7份) | 手动备份 (保留7份) |
-|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-| Redis Sentinel | AOF | aof-redis-6.0-sentinel | 2c4g | 1 | 3 | / | 100m128Mi | 100m200Mi | 2c4g | 不限制（需要预留资源） | 4.5c4.8g | 需要根据实际写入量评估 | | |
-| | | aof-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.8g | | | |
-| | RDB | rdb-redis-6.0-sentinel | 2c4g | | | | | | 2c4g | | 4.5c4.8g | 8G | 28G | 28G |
-| | | rdb-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.64g | 16G | 56G | 56G |
-| | Diskless | diskless-redis-6.0-sentinel | 2c4g | | | | | | 2c4g | | 4.5c4.8g | / | 28G | 28G |
-| | | diskless-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.8g | | 56G | 56G |
-| Redis Cluster | AOF | aof-redis-6.0-cluster | 2c4g | / | 3 | / | | 100m300Mi | 2c4g | | 12.6c25.8g | 需要根据实际写入量评估 | | |
-| | | aof-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | | | |
-| | RDB | rdb-redis-6.0-cluster | 2c4g | | | | | | 2c4g | | 12.6c25.8g | 24G | 84G | 84G |
-| | | rdb-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | 48G | 168G | 168G |
-| | Diskless | diskless-redis-6.0-cluster | 2c4g | | | | | | 2c4g | | 12.6c25.8g | / | 84G | 84G |
-| | | diskless-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | | 168G | 168G |
-
-## 3.4 调度策略配置 (Scheduling)
-
-Redis 集群模式提供三种推荐反亲和策略，所有 Pod 强制反亲和、分片内主从强制反亲和以及分片内主从尽量
-反亲和。
-
-**所有 Pod 强制反亲和**
-
-Redis Cluster 实例会将 Redis 强制分布到不同的节点上。如图所示，假设有三个节点，每个节点上可以
-调度三个 Pod。然而，在本例中，Redis Cluster 策略被设置为强制 Pod 反亲和性。由于这个设置，当尝
-试在相同节点上调度不同的 Redis 分片副本时，部署失败。
-
-在可视化图中，三个主节点（master）分布在三个不同的节点上，每个节点有三个可用的 Pod 位置。同
-时，三个从节点（replica）都无法被调度，因为它们与已存在的主节点存在反亲和性约束，会导致部署
-失败。
-
-**分片内主从强制反亲和**
-
-分片内主从强制反亲和策略会将同一个分片的 Redis 节点分布到不同的节点上，确保即使某个节点出现
-故障，数据仍然具有完整性。
-
-如图所示，假设有三个节点，每个节点可以调度三个 Pod。在 Node2 和 Node3 上的 Pod 位置已经被占
-满，而 Node1 上仍有空余的 Pod 资源。然而，由于 Redis shard-2 主节点（master）位于 Node1 上，并
-且分片内主从节点被设置为强制反亲和性，Redis shard-2 从节点（replica）不能被调度到 Node1 上。这
-导致 Redis shard-2 从节点无法找到可用的节点进行调度，从而导致部署失败。
-
-**分片内主从尽量反亲和**
-
-在 Redis Cluster 中，为了提高数据的可靠性和容错能力，我们可以采用“分片内主从尽量反亲和”的策
-略。这意味着同一个分片内的主从节点尽量被调度到不同的节点上。这样，即使某个节点出现故障，数
-据仍然具有完整性。
-
-采用“分片内主从尽量反亲和”的策略时，调度器会优先尝试将主从节点部署在不同的节点上。然而，如果
-无法找到符合条件的节点，调度器仍然会允许将主从节点部署在同一个节点上。这种策略在节点资源有
-限的情况下，可以确保 Redis 集群的正常部署和运行，同时在尽量遵循反亲和原则的基础上，提高数据
-的可靠性。
-
-如图，Node1 在资源不足的情况下部署了 redis shard-2 的主从节点。
-
-您可以根据下表，对您的实际业务情景设置集群模式 Redis 的反亲和策略。
-
-|所有 Pod 之间强制要求在<br>不同节点上部署，不允许在<br>同一节点上部署。|可以最大程度地<br>保证负载均衡和<br>系统的高可用<br>性。|节点资源有限时可<br>能导致部署失败。|保障<br>自愈|
-|---|---|---|---|
-|同一个分片内的主从节点必<br>须部署在不同节点上。|保证分片内主从<br>节点故障隔离，<br>提高数据可靠<br>性。|节点资源有限时可<br>能导致部署失败。|有可<br>能自<br>愈|
-
-在哨兵模式下，默认采用分片内主从尽量反亲和策略。这意味着同一个分片内的主从节点会尽量部署在不同的
-节点上，但在资源不足的情况下，也允许主从节点部署在同一个节点上。采用这种策略的目的是为了在保证数
-据可靠性的同时，适应有限的节点资源。
-
-根据上文描述，分片内主从尽量反亲和策略在资源有限的情况下，可以在保证集群正常部署的基础上，尽可能
-提高数据可靠性。这种策略在某些情况下可以实现单点故障的自愈。然而，在同一个节点上部署主从节点时，
-数据可靠性相对较低。因此，在实际应用中，我们需要根据实际情况和需求来选择合适的策略，以实现 Redis
-高可用性和故障自动切换的目标。
-
-# 4. 资源规划 (Resource Planning)
-
-## 4.1 内存规格 (Memory)
-
-Redis 采用快照机制将内存中的数据异步复制到磁盘上进行长期存储。这种机制使得 Redis 保持了高性能，但
-也存在在快照之间可能丢失数据的风险。
-
-在 Kubernetes 容器化环境下，我们建议采用分级内存管理策略：
-*   **✅ 标准规格 (< 8GB)**：**强烈推荐**。能够保证极低的 Fork 延迟和快速的故障恢复（RTO < 60s），是最稳健的生产选择。
-*   **⚠️ 高性能规格 (8GB - 16GB)**：**可接受**。需配合高性能宿主机且**必须关闭 THP**。在此规格下，Fork 虽然可控，但在高负载时仍可能造成 100ms 级别的抖动。
-*   **❌ 高风险规格 (> 16GB)**：**不推荐**。单点故障影响面过大，且全量同步极易打满网络带宽。建议水平拆分为 Cluster 模式。
-
-### 为什么限制 8GB？（技术深究）
-虽然物理机时代单实例常运行 32GB+，但在云原生环境下，8GB 限制是基于以下核心技术原理的“黄金法则”：
-
-1.  **Fork 阻塞与页表复制 (The Fork Blocking)**
-    *   Redis 执行 RDB/AOF Rewrite 时调用 `fork()`。虽然内存页是 CoW（写时复制），但**进程页表 (Page Table) 必须全量拷贝**，此过程会**阻塞主线程**。
-    *   *估算公式*：10GB 内存 ≈ 20MB 页表 ≈ 10~50ms 阻塞（视虚拟化层开销而定）。一旦超过 8GB，分钟级的阻塞风险将呈指数级上升，严重影响 SLA。
-
-2.  **故障恢复效率 (RTO)**
-    *   容器重启加载 RDB 是**单线程 CPU 密集型任务**（对象反序列化）。测试表明，受限于 CPU 解析效率，加载 8GB 数据通常需 **30-50秒**（即使使用 SSD）。若维持在 32GB，重启时间可能长达数分钟，这与 Kubernetes "快速自愈" 的设计理念背道而驰。
-
-### 内存配置最佳实践 (MaxMemory vs Limit)
-
-为了避免 Redis 在持久化期间因内存膨胀而导致 OOM（OOM Kill），需严格遵循以下配置原则：
-
-1.  **设置 MaxMemory**: 不要将 `maxmemory` 设置为容器内存 Limit 的 100%。建议设置为 Limit 的 **70% ~ 80%**。
-2.  **预留 CoW 空间**: Redis 在 RDB/AOF Rewrite 时会 Fork 子进程。若此时有大量写入更新，OS 的写时复制 (Copy-on-Write) 机制会导致内存页被复制，极端情况下内存占用可能由 8GB 翻倍至 16GB。
-3.  **Overcommit 配置**: 确保宿主机开启 `vm.overcommit_memory = 1`，允许内核在 fork 时无需申请等量物理内存（依赖 CoW），防止 fork 失败。
-
-> [!IMPORTANT]
-> **资源预留公式**: `Container_Memory_Limit` ≈ `Redis_MaxMemory` / 0.7
-> *   示例: 如果希望 Redis 存储 8GB 数据，容器内存 Limit 建议配置为 10GB ~ 12GB，留出 2GB+ 给 CoW 和碎片开销。
-
-## 4.2 计算资源 (Compute)
-
-### 基础建议：2 Core
-Redis 核心命令执行是单线程的，但持久化（Fork）等操作需要子进程。因此，建议为每个 Redis 实例分配 **至少 2 Core** 的 CPU 资源：
-*   **Core 1**: 处理主线程的请求与命令。
-*   **Core 2**: 应对持久化 fork、后台任务以及系统开销。
-
-### 多线程 (Multi-threading)
-Redis 6.0+ 引入了多线程 I/O（默认关闭）。
-*   **作用**: 利用多核处理网络数据的读写（Socket I/O），优化单核 IO 限制。
-*   **局限性**: 命令执行仍然是单线程的。因此，增加 CPU 核心数只能提升网络吞吐，无法提升命令处理速度（Latency）。
-*   **配置建议**: 如果网络是瓶颈（如大流量场景），可开启多线程，但建议总核心数配置不超过 4~8 Core，过多的核心并不能带来线性提升。
-
-## 4.3 存储规划 (Storage)
-
-### 容量规划 (Capacity)
-持久化模式直接决定了磁盘空间的配额需求，请参考以下计算公式：
-
-| 模式 | 建议磁盘配额公式 | 详细说明 |
-| :--- | :--- | :--- |
-| **Diskless (纯缓存)** | `0` (无需 PVC) | 仅作为纯缓存使用，不开启 RDB/AOF。K8s 环境下日志通过标准输出 (stdout) 采集，无需挂载持久化数据盘。 |
-| **RDB (快照)** | `MaxMemory * 2` | RDB 也是写时复制（CoW）。在生成新快照时，磁盘上同时存在“旧快照”和“正在写入的新快照”。<br>**建议**：至少预留 2 倍内存空间。 |
-| **AOF (追加写)** | `MaxMemory * 3` | AOF 文件会随着写操作不断增长。默认配置下 (`auto-aof-rewrite-percentage 100`)，AOF 文件会增长到数据量的 **2倍** 才触发重写。此时磁盘需同时容纳：<br>1. 旧 AOF 文件 (2x)<br>2. 重写生成的新 AOF 文件 (1x)<br>**峰值总计 3x**。建议预留至少 3 倍空间以防磁盘写满。 |
-
-### 性能要求 (IOPS & Latency)
-*   **AOF 开启时**: 磁盘性能至关重要。如果磁盘 IOPS 不足或 fsync 延迟过高，会直接阻塞主线程（当 `appendfsync everysec` 时）。
-*   **推荐介质**: 生产环境强烈建议使用 SSD/NVMe 本地盘或高性能云盘。
-
-## 4.4 内核参数调优 (Kernel Tuning)
-
-为了确保 Redis 在生产环境中的稳定性和高性能，建议在 Kubernetes 节点层面进行以下内核参数优化：
-
-1.  **内存分配控制 (`vm.overcommit_memory`)**
-    *   **建议值**: `1`
-    *   **说明**: 设置为 `1` (Always) 确保内核在 Redis 执行 Fork 操作（RDB 快照/AOF 重写）时允许分配内存，即使物理内存看似不足。这能有效防止因内存分配失败导致的持久化失败。
-
-2.  **连接队列大小 (`net.core.somaxconn`)**
-    *   **建议值**: `2048` 或更高
-    *   **说明**: Redis 默认的 tcp-backlog 为 511。在高并发场景下，应将系统的 `net.core.somaxconn` 提高，以避免客户端连接请求被丢弃。
-
-3.  **透明大页 (Transparent Huge Pages, THP)**
-    *   **建议操作**: **禁用** (`never`)
-    *   **说明**: THP 会导致 Redis 在发生内存分配时产生显著的延迟尖峰，尤其是在 Fork 后的写时复制（CoW）期间。建议在宿主机或启动脚本中禁用。
-
-
-# 5. 客户端接入 (Client Integration)
-
-## 5.1 网络访问策略 (Network Access)
-
-您可以根据下表，选择适合实际业务尝尽的访问方式。
-
-|架构|访问方式|推荐|特性说明|
-|---|---|---|---|
-|哨兵<br>模式|集群内访<br>问|哨兵访问地址|用于从集群内连接哨兵（哨兵会去发现底层的数据节<br>点）。提供包括哨兵节点对应的内部路由的名称、IP地<br>址和端口号。 端口号默认为 26379。|
-|集群<br>模式|集群外访<br>问—无代理|集群模式暴露 底层数据节<br>点（Pod）的 NodePort|暴露底层数据节点 的 NodePort，直连底层数据节点，<br>Pod本身不具备高可用性，但可以写多个节点地址。<br>应用连接时需要写多个 节点IP+端口，部分节点宕机<br>时，连接仍保留。|
-
-> [!TIP]
-> **外网访问推荐**：对于生产环境的集群外访问，**强烈推荐使用 LoadBalancer 方式**。
-> *   **LoadBalancer**：为 Redis 实例配置外部负载均衡器，提供稳定的访问入口。
-> *   **NodePort 限制**：虽然支持 NodePort，但客户端需要直接连接到具体的节点端口。在多网卡环境或网络策略复杂的场景下，直接使用 NodePort 可能会因为网络接口绑定问题导致连接失败（Redis Sentinel/Cluster 默认绑定默认网卡）。且 NodePort 不支持通过额外的负载均衡器进行代理（因为 Sentinel/Cluster 协议要求客户端直连感知拓扑）。
-
-## 5.2 代码接入示例 (Code Examples)
-
-以下是使用 **Go (go-redis/v9)** 客户端连接 Redis 的示例代码：
-
-#### 哨兵模式 (Sentinel Mode)
-
-哨兵模式下，客户端连接 Sentinel 节点，通过 `mymaster`（默认主节点组名）发现主节点。
-
-```go
-package main
-
-import (
-    "context"
-    "fmt"
-    "time"
-    "github.com/redis/go-redis/v9"
-)
-
-func main() {
-    client := redis.NewFailoverClient(&redis.FailoverOptions{
-        SentinelAddrs: []string{"<sentinel-ip>:26379"}, // 替换为 Sentinel 地址
-        MasterName:    "mymaster",                       // 固定名称
-        Password:      "<password>",                     // Redis 密码
-        DialTimeout:   3 * time.Second,
-        ReadTimeout:   5 * time.Second,
-        WriteTimeout:  10 * time.Second,
-        PoolSize:      100,
-    })
-    defer client.Close()
-
-    if val, err := client.Get(context.TODO(), "test").Result(); err != nil {
-        panic(err)
-    } else {
-        fmt.Println(val)
-    }
-}
-```
-
-#### 集群模式 (Cluster Mode)
-
-集群模式下，客户端连接任一集群节点获取拓扑。建议配置多个种子节点地址。
-
-```go
-package main
-
-import (
-    "context"
-    "fmt"
-    "time"
-    "github.com/redis/go-redis/v9"
-)
-
-func main() {
-    client := redis.NewClusterClient(&redis.ClusterOptions{
-        Addrs:         []string{"<node-ip>:6379"}, // 替换为集群节点地址
-        Password:      "<password>",
-        DialTimeout:   3 * time.Second,
-        ReadTimeout:   5 * time.Second,
-        WriteTimeout:  10 * time.Second,
-        PoolSize:      100,
-    })
-    defer client.Close()
-
-    if val, err := client.Get(context.TODO(), "test").Result(); err != nil {
-        panic(err)
-    } else {
-        fmt.Println(val)
-    }
-}
-```
-
-## 5.3 客户端可靠性最佳实践
-
-除了正确的连接配置，客户端的健壮性设计对于保障业务稳定性至关重要：
-
-1.  **超时设置 (Timeouts)**
-    *   **连接超时 (Connect Timeout)**: 建议设为 1-3秒，避免网络不可达时长时间阻塞。
-    *   **读写超时 (Read/Write Timeout)**: 根据业务 SLA 设置，通常只有几百毫秒。不要设置过长，防止慢查询拖垮应用线程池。
-
-2.  **重试策略 (Retry Strategy)**
-    *   **指数退避 (Exponential Backoff)**: 遇到网络抖动或主从切换时，不要立即重试，而应使用指数退避算法（如 wait 100ms, 200ms, 400ms...），避免重试风暴加剧 Redis 负载。
-
-3.  **连接池 (Connection Pooling)**
-    *   **预热与复用**: 务必使用连接池（如 JedisPool, go-redis Pool）。Redis 连接握手成本较高，频繁创建/销毁连接会严重影响性能。
-    *   **最大连接数**: 根据客户端并发量合理设置 `MaxTotal`，避免超过 Redis 的 `maxclients` 限制。
-
-4.  **拓扑刷新机制 (Topology Refresh) - 集群模式关键**
-    *   **自动刷新**: 确保客户端开启了基于 `MOVED/ASK` 转向错误的自动拓扑更新功能（绝大多数 Smart Client 默认开启）。
-    *   **周期性刷新**: 在拓扑变化频繁（如弹性伸缩）或网络不稳定的场景下，建议配置客户端进行周期性拓扑刷新（例如每 60 秒），以主动感知节点变更，减少请求重定向带来的延迟。
-
-
-
-
-
-# 6. 可观测性与运维 (Observability & Operations)
-
-## 6.1 数据安全与备份
-
-平台备份中心为您提供便捷的数据备份和管理解决方案，您可以在各实例下进行备份，并在备份中心中统一管
-理。此外，备份中心支持外接S3存储，可实现安全备份和恢复。您还可以恢复历史备份到特定实例，以及查看
-和删除备份记录。
-
-## 6.3 认证与权限管理
-
-Redis Operator（6.0+ 版本）支持通过 `RedisUser` CRD 进行声明式用户管理，支持 ACL 权限控制。
-
-**管理操作：**
-1.  **查看用户**：查看 `RedisUser` 资源。
-    ```bash
-    kubectl -n <namespace> get RedisUser
-    ```
-2.  **修改密码**：密码存储在 Secret 中。找到 `RedisUser` 关联的 `passwordSecrets`，修改对应的 Secret 内容（Base64 编码）。修改后，Operator 会自动同步，状态暂时变为 Pending，同步完成后变回 Success。
-3.  **修改权限**：直接编辑 `RedisUser` 的 `spec.aclRules` 字段。例如：`+@all ~*` 表示所有权限。
-4.  **创建新用户**：创建一个 `RedisUser` yaml 文件，指定 `redisName` 关联到 Redis 实例。
-
-## 6.4 变更与维护
-
-### 更新注意事项
-在对 Redis 实例进行规格变更（CPU/Memory）或扩容时：
-1.  **资源评估**：确保集群有足够的剩余资源。
-2.  **渐进式执行**：Operator 会尽量逐个 Pod 进行滚动更新，以减少服务中断。
-3.  **避开高峰期**：对于大数据量实例，建议在业务低峰期进行变更。
-
-> [!CAUTION]
-> 减少副本数或降低规格时，务必确认当前数据量和负载可以被新规格承载，否则可能导致数据丢失或服务崩溃。
-
-### 重启注意事项
-通过管理控制台或 kubectl 触发重启时：
-1.  **逐个重启**：Pod 会被逐个重启（Rolling Restart）。
-2.  **服务中断**：虽然是滚动重启，但在主从切换或节点重连期间，客户端可能会遇到短暂的连接断开或超时，客户端应具备重试机制。
-3.  **前置条件**：实例必须处于 `Running` 状态才能执行重启操作。
-
-## 6.5 监控告警体系
-
-### 关键监控指标
-
-生产环境建议重点关注以下指标，并配置相应的告警规则：
-
-| 类别 | 指标名称 (Prometheus) | 含义 | 告警阈值建议 |
-| :--- | :--- | :--- | :--- |
-| **性能** | `redis_cpu_user_seconds_total` | CPU 使用率 | 持续 > 80% (单核) |
-| | `redis_memory_used_bytes` | 内存使用量 | > 80% limit |
-| **可靠性** | `redis_connected_clients` | 连接客户端数 | 接近 maxclients |
-| | `redis_evicted_keys_total` | 驱逐 Key 数量 | > 0 (若非预期) |
-| | `redis_rejected_connections_total` | 拒绝连接数 | > 0 |
-| **缓存效果** | **Hit Rate** (计算得出) | 缓存命中率 | < 80% (视业务而定) |
-
-**缓存命中率计算 (PromQL 示例):**
-```promql
-rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
-```
-
-## 6.6 故障排查指南
-
-您可以在 [Customer Portal](https://cloud.alauda.cn/knowledges) 查询您遇到的问题。
-
-查询关键词：Redis、集群模式、监控、崩溃恢复、污点容忍、部署优化……
-
-
-
-# 7. 其他
-
-## 7.1 参考资料
-
-[https://docs.redis.com/latest/ri/memory-optimizations/](https://docs.redis.com/latest/ri/memory-optimizations/)
-
-[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/)
-
-[https://redis.io/docs/](https://redis.io/docs/)
diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx
new file mode 100644
index 0000000..40be42e
--- /dev/null
+++ b/docs/en/how_to/redis-best-practices.mdx
@@ -0,0 +1,833 @@
+---
+weight: 5
+title: Redis Best Practices
+---
+
+# Redis Best Practices
+
+## Overview
+
+As the de facto standard for caching and key-value storage in cloud-native architectures, Redis handles core requirements for high-concurrency read/write operations and low latency. Running stateful Redis services in a Kubernetes containerized environment presents challenges distinct from traditional physical machine environments, including **persistence stability**, **dynamic network topology changes**, and **resource isolation and scheduling**.
+
+This Best Practices document aims to provide a standardized reference guide for Redis deployments in production environments. It covers the full lifecycle management from **architecture selection**, **resource planning**, **client integration** to **observability and operations**. By following this guide, users can build an enterprise-class Redis data service that is **High Availability (HA)**, **High Performance**, and **Maintainability**.
+
+## Architecture Selection
+
+The Full Stack Cloud Native Open Platform offers two standard Redis management architectures based on customer business scale and SLA requirements:
+
+### Sentinel Mode
+
+**Positioning: Classic High Availability Architecture, suitable for small to medium-scale businesses.**
+
+Sentinel mode is based on Redis's native master-replica replication mechanism. By deploying independent Sentinel process groups to monitor the status of master and replica nodes, it automatically executes Failover and notifies clients when the master node fails.
+*   **Pros**: Simple architecture, mature operations, lower requirements for client protocols.
+*   **Cons**: Write capacity is limited to a single node; storage capacity cannot scale horizontally.
+
+### Cluster Mode
+
+**Positioning: Distributed Sharding Architecture, suitable for large-scale high-concurrency businesses.**
+
+Cluster mode automatically shards data across multiple nodes using Hash Slots, enabling horizontal scaling (Scale-out) of storage capacity and read/write performance.
+*   **Pros**: True high availability distributed storage, supports dynamic Resharding.
+*   **Cons**: Complex client protocol; specific multi-key commands (e.g., `MGET`) are restricted by Slot distribution.
+
+### Selection Guide
+
+When selecting a Redis architecture, consider business requirements for availability, scalability, and complexity.
+
+| Feature | Sentinel Mode | Cluster Mode |
+| :--- | :--- | :--- |
+| **Scenarios** | Small/Medium business, Read-heavy/Write-light, moderate data | Large business, High concurrency R/W, massive data |
+| **High Availability** | Via Sentinel monitoring and auto-failover | Via node auto-failure detection and recovery |
+| **Scalability** | Vertical (Scale-up), Horizontal (Read-only) | Horizontal (R/W), supports dynamic resharding |
+| **Read/Write Separation** | Supported (Client support required) | Supported (Usually direct connection to shard master, client support required) |
+| **Data Sharding** | None (Single node stores full data) | Yes (Data auto-sharded across multiple nodes) |
+| **Ops Complexity** | Lower, simple architecture | Higher, involves sharding, hash slots, migration |
+| **Network Constraints** | Requires client support for Sentinel protocol | Requires client support for Cluster protocol |
+
+**Recommendations:**
+*   If data volume is small (fits in single node memory) and simplicity/stability is priority, **Sentinel Mode** is preferred.
+*   If data volume is massive or write pressure is extremely high and cannot be supported by a single node, choose **Cluster Mode**.
+
+## Version Selection
+
+Alauda Cache Service for Redis OSS currently supports `5.0`, `6.0`, and `7.2` stable versions. All three versions have undergone complete automated testing and production verification.
+
+**For new deployments, we strongly recommend choosing Redis `7.2`:**
+
+1.  **Lifecycle**
+    *   **`5.0` / `6.0`**: Community versions are End of Life (EOL) and no longer receive new features or security patches. Recommended only for compatibility with legacy applications.
+    *   **`7.2`**: As the current Long Term Support (LTS) version, it has the longest lifecycle, ensuring operational stability and security updates for years to come.
+
+2.  **Compatibility**
+    *   Redis `7.2` maintains high compatibility with `5.0` and `6.0` data commands. Most business code can migrate smoothly without modification.
+    *   *Note*: RDB persistence file format (v11) is not backward compatible (i.e., RDB generated by `7.2` cannot be loaded by `6.0`), but this does not affect new services.
+
+3.  **Key Features**
+    *   **ACL v2**: Provides granular access control (Key-based permission selectors), significantly enhancing security in multi-tenant environments.
+    *   **Redis Functions**: Introduces Server-side Scripting standards, resolving issues with Lua script loss and replication, keeping logic closer to data.
+    *   **Sharded Pub/Sub**: Resolves network storm issues caused by Pub/Sub broadcasting in Cluster mode, significantly improving messaging scalability via sharding.
+    *   **Performance Optimization**: Deep optimizations in data structures (especially Sorted Sets) and memory management provide higher throughput and lower latency.
+
+> For more details on Redis 7.2 features, please refer to the official [Redis 7.2 Release Notes](https://github.com/redis/redis/blob/7.2/00-RELEASENOTES).
+
+## Resource Planning
+
+### Kernel Tuning
+
+To ensure stability and high performance in production, the following kernel parameter optimizations are recommended at the Kubernetes node level:
+
+1.  **Memory Allocation (`vm.overcommit_memory`)**
+    *   **Recommended**: `1`
+    *   **Explanation**: Setting to `1` (Always) ensures the kernel allows memory allocation during Redis Fork operations (RDB snapshot/AOF rewrite), even if physical memory appears insufficient. This effectively prevents persistence failures due to allocation errors.
+
+2.  **Connection Queue (`net.core.somaxconn`)**
+    *   **Recommended**: `2048` or higher
+    *   **Explanation**: Redis default tcp-backlog is 511. In high concurrency scenarios, system `net.core.somaxconn` should be increased to avoid dropping client connection requests.
+
+3.  **Transparent Huge Pages (THP)**
+    *   **Action**: **Disable** (`never`)
+    *   **Explanation**: THP causes significant latency spikes during memory allocation in Redis, especially during Copy-on-Write (CoW) after Fork. It is recommended to disable this on the host or via startup scripts.
+
+### Memory Specifications
+
+Redis uses a snapshot mechanism to asynchronously replicate in-memory data to disk for long-term storage. This keeps Redis high-performing but carries a risk of data loss between snapshots.
+
+In Kubernetes containerized environments, we recommend a tiered memory management strategy:
+*   **✅ Standard Specs (< 8GB)**: **Strongly Recommended**. Ensures extremely low Fork latency and fast failure recovery (RTO < 60s); the most robust production choice.
+*   **⚠️ High-Performance Specs (8GB - 16GB)**: **Acceptable**. Requires high-performance host and **THP must be disabled**. Fork is controllable but may cause ~100ms jitter under high load.
+*   **❌ High-Risk Specs (> 16GB)**: **Not Recommended**. Single point of failure impact is too large, and full synchronization can easily saturate network bandwidth. Recommend horizontal splitting into Cluster mode.
+
+#### Why Limit to 8GB?
+While single instances on physical machines often run 32GB+, the 8GB limit in cloud-native environments is based on the "Golden Rule" of these core technologies:
+
+1.  **Fork Blocking & Page Table Copy**
+    *   Redis calls `fork()` during RDB/AOF Rewrite. Although memory pages are CoW, **Process Page Tables must be fully copied**, blocking the **main thread**.
+    *   *Estimation*: 10GB memory ≈ 20MB page table ≈ 10~50ms blocking (depending on virtualization overhead). Exceeding 8GB increases blocking risk exponentially, impacting SLA.
+
+2.  **Failure Recovery Efficiency (RTO)**
+    *   Container restart loading RDB is a **single-threaded CPU-bound task** (object deserialization). Tests show loading 8GB data takes **30-50s** (even with SSD). Maintaining 32GB could result in multi-minute start times, contradicting K8s "fast self-healing" philosophy.
+
+#### Memory Configuration Best Practices
+
+To avoid OOM (OOM Kill) during persistence due to memory expansion, strict adherence to these principles is required:
+
+1.  **Set MaxMemory**: Do not set `maxmemory` to 100% of the container Memory Limit. Recommend setting to **70% ~ 80%** of the Limit.
+2.  **Reserve CoW Space**: Redis Forks a child process during RDB/AOF Rewrite. If there are heavy write updates, OS Copy-on-Write mechanisms duplicate memory pages; in extreme cases, memory usage can double from 8GB to 16GB.
+3.  **Overcommit Config**: Ensure host `vm.overcommit_memory = 1` to allow kernel forks without requesting equivalent physical memory (relying on CoW), preventing fork failures.
+
+> [!INFO]
+>
+> **Resource Reservation Formula**: `Container_Memory_Limit` ≈ `Redis_MaxMemory` / 0.7
+> * Example: To store 8GB data, configure Container Memory Limit to 10GB ~ 12GB, leaving 2GB+ for CoW and fragmentation overhead.
+
+### CPU Resources
+
+Redis core command execution is single-threaded, but persistence (Fork) and other operations require child processes. Therefore, allocate **at least 2 Cores** per Redis instance:
+*   **Core 1**: Handles main thread requests and commands.
+*   **Core 2**: Handles persistence fork, background tasks, and system overhead.
+
+#### Multi-threading
+
+Redis 6.0+ introduced multi-threaded I/O (disabled by default) to overcome single-thread network I/O bottlenecks.
+
+*   **When to Enable?**
+    *   **Bottleneck Analysis**: When Redis CPU usage nears 100% and analysis shows time spent on Kernel State Network I/O (System CPU) rather than user-space command execution.
+    *   **Traffic Profile**: Typically beneficial when single instance QPS > 80,000 or network traffic is huge (> 1GB/s).
+    *   **Resource Conditions**: Ensure node has sufficient CPU cores (at least 4 cores).
+
+*   **Configuration Best Practices**:
+    *   **Thread Count**: Recommend 4~8 I/O threads. Exceeding 8 threads rarely yields significant gain.
+    *   **Config Example**:
+        ```yaml
+        io-threads 4
+        io-threads-do-reads yes
+        ```
+    *   **Note**: Multi-threaded I/O only improves network throughput; it **does NOT** improve execution speed of single complex commands (e.g., `SORT`, `KEYS`).
+
+### Storage Planning
+
+#### Capacity Planning
+Persistence mode directly determines disk quota requirements. Refer to the following calculation formula:
+
+| Mode | Recommended Quota Formula | Details |
+| :--- | :--- | :--- |
+| **Diskless (Cache)** | `0` (No PVC) | Used as pure cache, no RDB/AOF. Logs collected via stdout in K8s, no persistence disk needed. |
+| **RDB (Snapshot)** | `MaxMemory * 2` | RDB uses CoW. During snapshot generation, both "old snapshot" and "new snapshot being written" exist on disk.<br />**Recommendation**: Reserve at least 2x memory space. |
+| **AOF (Append Only)** | `MaxMemory * 3` | AOF grows with write operations. Default config (`auto-aof-rewrite-percentage 100`) triggers rewrite when AOF reaches **2x** data size. Disk must hold:<br />1. Old AOF file (2x)<br />2. New AOF file from rewrite (1x)<br />**Peak total 3x**. Recommend reserving at least 3x space. |
+
+#### Performance Requirements
+*   **With AOF**: Disk performance is critical. Insufficient IOPS or high fsync latency will directly block the main thread (when `appendfsync everysec`).
+*   **Media**: Production environments strongly recommend SSD/NVMe local disks or high-performance cloud disks.
+
+### Parameter Configuration
+
+Alauda Cache Service for Redis OSS parameters are specified via Custom Resource (CR) fields.
+
+#### Built-in Templates
+
+Alauda Cache Service for Redis OSS provides multiple parameter templates for different business scenarios. Selection depends on the trade-off between persistence (Diskless/AOF/RDB) and performance.
+
+| Template Name | Description | Scenarios | Risks |
+| :--- | :--- | :--- | :--- |
+| **rdb-redis-&lt;version&gt;-&lt;sentinel\|cluster&gt;** | Enables RDB persistence, periodic snapshots to disk. | **Balanced**: Limited resources, balances performance/reliability, accepts minute-level data loss. | Data loss depends on `save` config, usually minute-level RPO. |
+| **aof-redis-&lt;version&gt;-&lt;sentinel\|cluster&gt;** | Enables AOF persistence, logs every write op. | **Secure**: Ample resources, high data security (second-level loss), slight performance compromise. | Frequent fsync requires high-performance storage, high IO pressure. |
+| **diskless-redis-&lt;version&gt;-&lt;sentinel\|cluster&gt;** | Disables persistence, pure in-memory. | **High-Perf Cache**: Acceleration only, data loss acceptable or rebuildable from source. | Restart or failure leads to **full data loss**. |
+
+> `<version>` represents Redis version, e.g., `6.0`, `7.2`.
+
+Key parameter differences:
+
+| Parameter | RDB Template | AOF Template | Diskless Template | Explanation |
+| :--- | :--- | :--- | :--- | :--- |
+| `appendonly` | `no` | `yes` | `no` | Enable AOF logging. |
+| `save` | `60 10000 300 100 600 1` | `""` (Disabled) | `""` (Disabled) | RDB snapshot triggers. |
+| `repl-diskless-sync` | `no` | `no` | `yes` | Master-replica full sync via socket without disk. |
+| `repl-diskless-sync-delay` | `5` | `5` | `0` | Delay for diskless sync; 0 for Diskless to speed up sync. |
+
+##### Persistence Selection Recommendations
+
+1.  **Pure Cache**: Choose **Diskless Template**. Data rebuildable, no overhead, best performance.
+2.  **General Business**: Choose **RDB Template**. Periodic snapshots provide minute-level RPO, moderate resource usage.
+3.  **Financial/High-Reliability**: Choose **AOF Template** with `appendfsync everysec` for second-level protection.
+
+> [!WARNING]
+> **Should RDB + AOF be enabled simultaneously?**
+>
+> Redis supports running RDB and AOF together, but it is **generally not recommended** in Kubernetes:
+> *   **Performance**: AOF fsync creates IO pressure; adding RDB fork + disk write significantly increases resource contention.
+> *   **Storage Doubling**: Requires space for both RDB snapshots and AOF files, complicating PVC planning.
+> *   **Recovery Priority**: Redis loads AOF first on start (more complete data); RDB acts only as backup, offering limited benefit.
+> *   **Platform Backup**: Alauda Cache Service for Redis OSS provides independent auto/manual backup, removing reliance on RDB snapshots for extra insurance.
+>
+> **Recommendation**: Choose **Single Persistence Mode** (RDB or AOF) based on needs, and use platform backup for disaster recovery. If mixed mode is necessary, ensure sufficient Storage IOPS (SSD) and reserve 5x data volume disk space.
+
+#### Parameter Update
+
+Redis parameters are categorized by application method:
+
+| Category | Parameters | Behavior |
+| :--- | :--- | :--- |
+| **Hot Update** | Most runtime params (`maxmemory`, `loglevel`, etc.) | **Immediate effect** after modification, no restart. |
+| **Restart Update** | `databases`, `rename-command`, `rdbchecksum`, `tcp-backlog`, `io-threads`, `io-threads-do-reads` | Requires **Instance Restart** to take effect. |
+| **Immutable** | `bind`, `protected-mode`, `port`, `supervised`, `pidfile`, `dir`, etc. | Managed by system, modification may cause anomalies. |
+
+> [!TIP]
+> Always assume data backup before modifying parameters requiring restart.
+
+#### Modification Examples
+
+**Update Data Node Parameters**: Configure via `spec.customConfig`.
+
+```bash
+# Example: Modify save strategy (Hot update)
+kubectl -n <namespace> patch redis <instance-name> --type=merge --patch='{"spec": {"customConfig": {"save":"600 1"}}}'
+```
+
+**Update Sentinel Node Parameters**: Configure via `spec.sentinel.monitorConfig`.
+> Currently supports `down-after-milliseconds`, `failover-timeout`, `parallel-syncs`.
+
+```bash
+# Example: Modify failover timeout
+kubectl -n <namespace> patch redis <instance-name> --type=merge --patch='{"spec": {"sentinel": {"monitorConfig": {"down-after-milliseconds":"30000"}}}}'
+```
+
+### Resource Specs
+
+Deploy resources according to your actual business scenario.
+
+#### Sentinel Mode Specs
+
+<table border="1">
+  <thead>
+    <tr>
+      <th>Persistence</th>
+      <th>Template</th>
+      <th>Instance Spec</th>
+      <th>Replica / Sentinel</th>
+      <th>Sentinel Pod</th>
+      <th>redis-exporter</th>
+      <th>redis (Spec)</th>
+      <th>Backup Pod</th>
+      <th>Total Resources</th>
+      <th>Storage Quota</th>
+      <th>Auto Backup (Keep 7)</th>
+      <th>Manual Backup (Keep 7)</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td rowspan="2">AOF</td>
+      <td>aof-redis-&lt;version&gt;-sentinel</td>
+      <td>2c4g</td>
+      <td rowspan="6">1 / 3</td>
+      <td rowspan="6">100m128Mi</td>
+      <td rowspan="6">100m200Mi</td>
+      <td>2c4g</td>
+      <td rowspan="6">Unlimited (Reserve resources)</td>
+      <td>4.5c4.8g</td>
+      <td rowspan="2" colspan="3">Evaluate based on actual write volume</td>
+    </tr>
+    <tr>
+      <td>aof-redis-&lt;version&gt;-sentinel</td>
+      <td>4c8g</td>
+      <td>4c8g</td>
+      <td>8.5c8.8g</td>
+    </tr>
+    <tr>
+      <td rowspan="2">RDB</td>
+      <td>rdb-redis-&lt;version&gt;-sentinel</td>
+      <td>2c4g</td>
+      <td>2c4g</td>
+      <td>4.5c4.8g</td>
+      <td>8G</td>
+      <td>28G</td>
+      <td>28G</td>
+    </tr>
+    <tr>
+      <td>rdb-redis-&lt;version&gt;-sentinel</td>
+      <td>4c8g</td>
+      <td>4c8g</td>
+      <td>8.5c8.64g</td>
+      <td>16G</td>
+      <td>56G</td>
+      <td>56G</td>
+    </tr>
+    <tr>
+      <td rowspan="2">Diskless</td>
+      <td>diskless-redis-&lt;version&gt;-sentinel</td>
+      <td>2c4g</td>
+      <td>2c4g</td>
+      <td>4.5c4.8g</td>
+      <td rowspan="2">/</td>
+      <td>28G</td>
+      <td>28G</td>
+    </tr>
+    <tr>
+      <td>diskless-redis-&lt;version&gt;-sentinel</td>
+      <td>4c8g</td>
+      <td>4c8g</td>
+      <td>8.5c8.8g</td>
+      <td>56G</td>
+      <td>56G</td>
+    </tr>
+  </tbody>
+</table>
+
+#### Cluster Mode Specs
+
+<table border="1">
+  <thead>
+    <tr>
+      <th>Persistence</th>
+      <th>Template</th>
+      <th>Instance Spec</th>
+      <th>Sharding / Replica</th>
+      <th>redis-exporter</th>
+      <th>redis (Spec)</th>
+      <th>Backup Pod</th>
+      <th>Total Resources</th>
+      <th>Storage Quota</th>
+      <th>Auto Backup (Keep 7)</th>
+      <th>Manual Backup (Keep 7)</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td rowspan="2">AOF</td>
+      <td>aof-redis-&lt;version&gt;-cluster</td>
+      <td>2c4g</td>
+      <td rowspan="6">3 / 1</td>
+      <td rowspan="6">100m300Mi</td>
+      <td>2c4g</td>
+      <td rowspan="6">Unlimited (Reserve resources)</td>
+      <td>12.6c25.8g</td>
+      <td rowspan="2" colspan="3">Evaluate based on actual write volume</td>
+    </tr>
+    <tr>
+      <td>aof-redis-&lt;version&gt;-cluster</td>
+      <td>4c8g</td>
+      <td>4c8g</td>
+      <td>24.6c49.8g</td>
+    </tr>
+    <tr>
+      <td rowspan="2">RDB</td>
+      <td>rdb-redis-&lt;version&gt;-cluster</td>
+      <td>2c4g</td>
+      <td>2c4g</td>
+      <td>12.6c25.8g</td>
+      <td>24G</td>
+      <td>84G</td>
+      <td>84G</td>
+    </tr>
+    <tr>
+      <td>rdb-redis-&lt;version&gt;-cluster</td>
+      <td>4c8g</td>
+      <td>4c8g</td>
+      <td>24.6c49.8g</td>
+      <td>48G</td>
+      <td>168G</td>
+      <td>168G</td>
+    </tr>
+    <tr>
+      <td rowspan="2">Diskless</td>
+      <td>diskless-redis-&lt;version&gt;-cluster</td>
+      <td>2c4g</td>
+      <td>2c4g</td>
+      <td>12.6c25.8g</td>
+      <td rowspan="2">/</td>
+      <td>84G</td>
+      <td>84G</td>
+    </tr>
+    <tr>
+      <td>diskless-redis-&lt;version&gt;-cluster</td>
+      <td>4c8g</td>
+      <td>4c8g</td>
+      <td>24.6c49.8g</td>
+      <td>168G</td>
+      <td>168G</td>
+    </tr>
+  </tbody>
+</table>
+
+> `<version>` represents Redis version, e.g., `6.0`, `7.2`.
+
+## Scheduling
+
+Alauda Cache Service for Redis OSS offers flexible scheduling strategies, supporting node selection, taint toleration, and various anti-affinity configurations to meet high availability needs in different resource environments.
+
+### Node Selection
+
+You can use the `spec.nodeSelector` field to specify which nodes Redis Pods should be scheduled on. This is typically used with Kubernetes Node Labels to isolate database workloads to dedicated node pools.
+
+> [!WARNING]
+> **Persistence Limitation**: If your Redis instance mounts **Non-Network Storage** (e.g., Local PV) PVCs, be cautious when updating `nodeSelector`. Since local data resides on specific nodes and cannot migrate with Pods, the updated `nodeSelector` set **MUST include the node where the Pod currently resides**. If the original node is excluded, the Pod will fail to access data or start. Network storage (Ceph RBD, NFS) follows the Pod and is not subject to this restriction.
+
+### Taint Toleration
+
+Use `spec.tolerations` to allow Redis Pods to tolerate node Taints. This allows deploying Redis on dedicated nodes with specific taints (e.g., `key=redis:NoSchedule`), preventing other non-critical workloads from preempting resources.
+
+### Anti-Affinity
+
+To prevent single points of failure, Alauda Cache Service for Redis OSS provides anti-affinity configuration. Configuration differs by architecture mode.
+
+> [!CAUTION]
+> **Immutable**: To ensure consistency and reliability, anti-affinity configurations (both `affinityPolicy` and `affinity`) **cannot be modified** after instance creation. Please plan ahead.
+
+#### Cluster Mode
+
+In Cluster mode, the system **prioritizes `spec.affinityPolicy`**. Alauda Cache Service for Redis OSS uses this enum to abstract complex topology rules, automatically generating affinity rules for each shard's StatefulSet.
+
+*   **Priority**: `spec.affinityPolicy` > `spec.affinity`.
+*   **If `affinityPolicy` is unset**: Alauda Cache Service for Redis OSS checks `spec.affinity`. If you need custom topology rules beyond the enums below, leave `affinityPolicy` empty and configure native `spec.affinity`.
+
+<table border="1">
+  <thead>
+    <tr>
+      <th>Policy Name</th>
+      <th>affinityPolicy Value</th>
+      <th>Behavior</th>
+      <th>Pros/Cons</th>
+      <th>Scenario</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><strong>All Pods Forced Anti-Affinity</strong></td>
+      <td><code>AntiAffinity</code></td>
+      <td>Forces <strong>ALL Pods</strong> in the cluster (including master/replicas of different shards) to be on different nodes. Fails if node count < total Pod count.</td>
+      <td>
+        <ul>
+          <li><strong>Pros</strong>: Highest disaster recovery, minimal single-node failure impact.</li>
+          <li><strong>Cons</strong>: extremely high resource requirement, Node count must be >= Total Pods.</li>
+        </ul>
+      </td>
+      <td><strong>Cluster Mode Core Business</strong><br />Ample resources, strict HA requirements.</td>
+    </tr>
+    <tr>
+      <td><strong>Shard Master-Replica Forced Anti-Affinity</strong></td>
+      <td><code>AntiAffinityInSharding</code></td>
+      <td>Forces <strong>Master and Replicas within same shard</strong> to be on different nodes. Pods from different shards can coexist.</td>
+      <td>
+        <ul>
+          <li><strong>Pros</strong>: Guarantees physical isolation of data replicas, preventing shard migration data loss.</li>
+          <li><strong>Cons</strong>: Scheduling fails if live nodes < replica count. Masters of different shards might land on same node (SPOF risk).</li>
+        </ul>
+      </td>
+      <td><strong>Production Standard</strong><br />Balances resource usage and data safety.</td>
+    </tr>
+    <tr>
+      <td><strong>Shard Master-Replica Soft Anti-Affinity</strong></td>
+      <td><code>SoftAntiAffinity</code></td>
+      <td><strong>Prioritizes</strong> spreading shard master/replicas. If impossible (e.g., insufficient nodes), <strong>allows</strong> scheduling on same node.</td>
+      <td>
+        <ul>
+          <li><strong>Pros</strong>: Highest deployment success rate, runs with limited resources.</li>
+          <li><strong>Cons</strong>: Master/Replica may share node in extreme cases, risking data loss.</li>
+        </ul>
+      </td>
+      <td><strong>Test/Dev Environments</strong><br />Or resource-constrained edge environments.</td>
+    </tr>
+  </tbody>
+</table>
+
+#### Sentinel Mode
+
+> **Important**
+> **Sentinel Mode does not support `spec.affinityPolicy`**.
+
+For Sentinel mode, Redis Data Nodes and Sentinel Nodes require separate Kubernetes native Affinity rules:
+
+*   **Redis Data Nodes**: Configured via **`spec.affinity`**.
+*   **Sentinel Nodes**: Configured via **`spec.sentinel.affinity`**.
+
+You need to manually write complete `Affinity` rules. Example for forcing anti-affinity for both Data and Sentinel nodes:
+
+```yaml
+spec:
+  affinity:
+    podAntiAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+      - labelSelector:
+          matchExpressions:
+          - key: app.kubernetes.io/component
+            operator: In
+            values:
+            - redis
+          - key: redisfailovers.databases.spotahome.com/name
+            operator: In
+            values:
+            - <instance name>
+        topologyKey: kubernetes.io/hostname 
+  sentinel:
+    affinity:
+      podAntiAffinity:
+        requiredDuringSchedulingIgnoredDuringExecution:
+        - labelSelector:
+            matchExpressions:
+            - key: app.kubernetes.io/component
+              operator: In
+              values:
+              - sentinel
+            - key: redissentinels.databases.spotahome.com/name
+              operator: In
+              values:
+              - <instance name>
+          topologyKey: kubernetes.io/hostname 
+```
+
+To force anti-affinity across ALL nodes (Data + Sentinel), refer to:
+
+```yaml
+spec:
+  affinity:
+    podAntiAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+      - labelSelector:
+          matchExpressions:
+          - key: middleware.instance/type
+            operator: In
+            values:
+            - redis-failover
+          - key: middleware.instance/name:
+            operator: In
+            values:
+            - <instance name>
+        topologyKey: kubernetes.io/hostname 
+  sentinel:
+    affinity:
+      podAntiAffinity:
+        requiredDuringSchedulingIgnoredDuringExecution:
+        - labelSelector:
+            matchExpressions:
+            - key: middleware.instance/type
+              operator: In
+              values:
+              - redis-failover
+            - key: middleware.instance/name:
+              operator: In
+              values:
+              - <instance name>
+          topologyKey: kubernetes.io/hostname 
+```
+
+## User Management
+
+Alauda Cache Service for Redis OSS (v6.0+) provides declarative user management via `RedisUser` CRD, supporting ACLs.
+
+> [!TIP]
+> **Compatibility**: Redis 5.0 only supports single-user auth; Redis 6.0+ implements full ACLs for multi-user/granular control.
+
+### Permission Profiles
+
+The platform pre-defines permission profiles for common scenarios:
+
+| Profile | ACL Rule | Explanation |
+| :--- | :--- | :--- |
+| **NotDangerous** | `+@all -@dangerous ~*` | Allows all commands except dangerous ones (e.g., `FLUSHDB`). |
+| **ReadWrite** | `-@all +@write +@read -@dangerous ~*` | Allows read/write, blocks dangerous ops. |
+| **ReadOnly** | `-@all +@read -keys ~*` | Allows read-only operations. |
+| **Administrator** | `+@all -acl ~*` | Admin privileges, allows all commands except ACL management. |
+
+For custom ACLs, see [Redis ACL Documentation](https://redis.io/topics/acl).
+
+### Security Mechanisms
+
+1.  **ACL Force Revocation**: All `RedisUser` creation/updates undergo Webhook validation to **force remove `acl` permissions**, preventing privilege escalation.
+2.  **Cluster Command Injection**: For **Cluster Mode**, Alauda Cache Service for Redis OSS automatically injects topology commands: `cluster|slots`, `cluster|nodes`, `cluster|info`, `cluster|keyslot`, `cluster|getkeysinslot`, `cluster|countkeysinslot` to ensure client awareness.
+3.  **6.0 -> 7.2 Upgrade Compatibility**: When upgrading 6.0 -> 7.2, the operator adds `&*` (Pub/Sub Channel) permission to ensure consistency with 7.x's new Channel ACLs.
+
+### System Account
+
+Each Redis instance automatically generates a system account named `operator`. Its roles include:
+
+1.  **Cluster Init**: Slot assignment, node joining.
+2.  **Config Simplification**: Unified system account reduces user configuration complexity.
+3.  **Operations**: Used for health checks, failovers, scaling.
+4.  **Avoid Restarts**: Password updates for business users don't affect this account, avoiding restarts.
+
+> [!CAUTION]
+> `operator` is a **Reserved System Account**:
+> *   **Complexity**: Random 64-char string (alphanumeric+special).
+> *   **Privilege**: Highest level (includes user management).
+> *   **Restriction**: **No online password update** and **DO NOT manually modify/delete**, as it may cause irreversible failure.
+
+### Production Best Practices
+
+1.  **App Isolation**: Create **independent user accounts** for each app/microservice. Avoid sharing accounts to enable auditing and isolation.
+2.  **Principle of Least Privilege**:
+    *   **Read-Only App**: Use `ReadOnly`.
+    *   **Read-Write App**: Use `ReadWrite`.
+    *   **Ops Tools**: Use `NotDangerous` or custom permissions.
+    *   **Avoid `Administrator`**: Unless absolutely necessary.
+3.  **Key Namespace Isolation**: Combine ACL Key patterns (e.g., `~app1:*`) to restrict apps to specific key prefixes.
+4.  **Password Rotation**: Establish mechanisms to regularly rotate app passwords.
+
+For operation steps, see [User Management Docs](../functions/20-user.mdx).
+
+## Client Access
+
+### Topology Discovery
+
+Both **Sentinel** and **Cluster** modes rely on clients actively discovering and connecting to data nodes, differing from traditional LB proxy modes:
+
+#### Sentinel Mode
+
+1.  Client connects to **Sentinel Node**.
+2.  Client sends `SENTINEL get-master-addr-by-name mymaster` to get Master **IP/Port**.
+3.  Client **directly connects** to Master.
+4.  On failover, Sentinel notifies client (or client polls) to switch to new Master.
+
+#### Cluster Mode
+
+1.  Client connects to any **Cluster Node**.
+2.  Sends `CLUSTER SLOTS` / `CLUSTER NODES` to get **Slot Distribution**.
+3.  Calculates hash slot for Key and **directly connects** to target node.
+4.  If slot migrates, node returns `MOVED`/`ASK`; client must refresh topology.
+
+Both protocols return **Real Node IPs**. If a reverse proxy (HAProxy/Nginx) is used, clients still get backend real IPs, which may be unreachable from outside the cluster.
+Thus, **Each Redis Pod needs an independent external address** (NodePort/LoadBalancer), not a single proxy address.
+
+### Network Access strategies
+
+Alauda Cache Service for Redis OSS supports multiple access methods:
+
+#### Sentinel Mode
+
+| Method | Recommended | Description |
+| :--- | :--- | :--- |
+| **ClusterIP** | ✅ **Internal Preferred** | Access Sentinel via K8s Service (port 26379). Clients auto-discover Master. Lowest latency, highest security. |
+| **LoadBalancer** | ✅ **External Preferred** | Exposes Sentinel via MetalLB/Cloud LB. Stable external entry, no port management. |
+| **NodePort** | ⚠️ External Backup | Exposes Sentinel via Node ports. Requires manual port management, risky, potential multi-NIC binding issues. |
+
+#### Cluster Mode
+
+| Method | Recommended | Description |
+| :--- | :--- | :--- |
+| **ClusterIP** | ✅ **Internal Preferred** | Access via K8s Service. Client must support Cluster protocol. |
+| **LoadBalancer** | ✅ **External Preferred** | Configure LB for each shard Master. Stable external access. Client must handle MOVED/ASK. |
+| **NodePort** | ⚠️ External Backup | Expose underlying Pod NodePorts. Client connects directly. Complex port management. |
+
+> [!WARNING]
+> **NodePort Notes**:
+> *   **Port Management**: Range limited (30000-32767), conflicts easy in multi-instance.
+> *   **Security**: Increases attack surface.
+> *   **Multi-NIC**: Redis binds default NIC; clients may fail to connect if IPs mismatch.
+> *   **No LB Proxy**: Sentinel/Cluster protocols require direct node connection; cannot be proxied by standard LBs.
+
+> [!INFO]
+> **Resource Usage**: LB/NodePort creates a **Service per Pod**.
+> *   **Sentinel** (1M1S + 3 Sentinels): Needs **8 NodePorts/LBs**.
+> *   **Cluster** (3 Shards x 1M1S): Needs **7 NodePorts/LBs**.
+
+### Code Examples
+
+We provide best practice examples for **go-redis**, **Jedis**, **Lettuce**, and **Redisson**:
+
+*   **Sentinel Access**: [How to Access Sentinel Instance](./access/10-sentinel.mdx)
+*   **Cluster Access**: [How to Access Cluster Instance](./access/20-cluster.mdx)
+
+> [!INFO]
+> **Master Group Name**: In Sentinel mode, the master name is fixed to `mymaster`.
+
+### Client Reliability Best Practices
+
+1.  **Timeouts**
+    *   **Connect Timeout**: distinct from Read Timeout. Recommend 1-3s.
+    *   **Read/Write Timeout**: Based on SLA, usually hundreds of ms.
+
+2.  **Retry Strategy**
+    *   **Exponential Backoff**: Do not retry immediately on failure; use backoff (100ms, 200ms...) to avoid retry storms.
+
+3.  **Connection Pooling**
+    *   **Reuse**: Always use pooling (JedisPool, go-redis Pool) to save handshake costs.
+    *   **Max Connections**: Set `MaxTotal` reasonably to avoid hitting Redis `maxclients`.
+
+4.  **Topology Refresh (Cluster)**
+    *   **Auto-Refresh**: Ensure client enables `MOVED/ASK` handling.
+    *   **Periodic Refresh**: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes.
+
+## Observability & Operations
+
+### Backup & Security
+
+The platform Backup Center provides convenient data management. You can backup instances, manage them centrally, and support S3 offloading. Support for restoring history to specific instances.
+
+See **[Backup & Restore](../functions/70-backup-restore.mdx)**.
+
+### Upgrade & Scaling
+
+#### Upgrade
+See **[Upgrade](../upgrade.mdx)**.
+
+#### Scaling Notes
+When changing specs (CPU/Mem) or expanding:
+1.  **Assess Resources**: Ensure cluster has capacity.
+2.  **Progressive**: Rolling updates to minimize interruption.
+3.  **Off-peak**: Execute during low traffic.
+
+> [!CAUTION]
+> When reducing replicas or specs, ensure current data/load fits new specs to avoid data loss/crash.
+
+### Monitoring
+
+Alauda Cache Service for Redis OSS has built-in metrics integrated with Prometheus.
+
+#### Built-in Metrics
+
+Variables `{{.namespace}}` and `{{.name}}` should be replaced with actual values.
+
+##### Key Hit Rate
+
+*   **Desc**: Cache hit rate.
+*   **Unit**: %
+*   **Expr**:
+    ```text
+    1/(1+(avg(irate(redis_keyspace_misses_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) by(namespace,service) / (avg(irate(redis_keyspace_hits_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) by(namespace,service)+1)))
+    ```
+
+##### Average Response Time
+
+*   **Desc**: Avg command latency. High = slow queries/bottleneck.
+*   **Unit**: s
+*   **Expr**:
+    ```text
+    avg((redis_commands_duration_seconds_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"} / redis_commands_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"})) by (namespace,service)
+    ```
+
+##### Role Switching
+
+*   **Desc**: Master-Replica switches in 5m. Non-zero = failover occurred.
+*   **Unit**: Count
+*   **Expr**:
+    ```text
+    sum by(namespace,service) (changes((sum by(namespace,service,pod)(redis_instance_info{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*",role="master"}) OR (sum by(namespace,service,pod)(redis_instance_info{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*",}) * 0))[5m:10s]))
+    ```
+
+##### Instance Status
+
+*   **Desc**: Health status. 0 = Abnormal.
+*   **Expr**:
+    ```text
+    ((count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster"}) % count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster",role="master"})) == bool 0 and count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster",role="master"}) >= bool 3) or (count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"rfr-({{.name}})",redisarch="sentinel",role="master"})) > bool 0
+    ```
+
+##### Node Input Bandwidth
+
+*   **Desc**: Peak ingress traffic.
+*   **Unit**: Bps
+*   **Expr**:
+    ```text
+    max by(namespace,service)(irate(redis_net_input_bytes_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m]))
+    ```
+
+##### Node Output Bandwidth
+
+*   **Desc**: Peak egress traffic.
+*   **Unit**: Bps
+*   **Expr**:
+    ```text
+    max by(namespace,service)(irate(redis_net_output_bytes_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m]))
+    ```
+
+##### Node Connections
+
+*   **Desc**: Peak client connections. Watch if near `maxclients`.
+*   **Unit**: Count
+*   **Expr**:
+    ```text
+    max by(namespace,service)(redis_connected_clients{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*"})
+    ```
+
+##### CPU Usage
+
+*   **Desc**: Node CPU usage. Sustained high = perf impact.
+*   **Unit**: %
+*   **Expr**:
+    ```text
+    avg by(namespace,pod_name)(irate(container_cpu_usage_seconds_total{namespace=~"{{.namespace}}",pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}[5m]))/avg by(namespace,pod_name)(container_spec_cpu_quota{namespace=~"{{.namespace}}",pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"})*100000
+    ```
+
+##### Memory Usage
+
+*   **Desc**: Node memory usage. >80% suggest scaling.
+*   **Unit**: %
+*   **Expr**:
+    ```text
+    avg by(namespace,pod_name)(container_memory_usage_bytes{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"} - container_memory_cache{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}) / avg by(namespace,pod_name)(container_spec_memory_limit_bytes{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"})
+    ```
+
+##### Storage Usage
+
+*   **Desc**: PVC usage. Full = persistence failure.
+*   **Unit**: %
+*   **Expr**:
+    ```text
+    avg(kubelet_volume_stats_used_bytes{namespace=~"{{.namespace}}",persistentvolumeclaim=~"redis-data-(drc|rfr)-({{.name}})-.*"}) by(namespace,persistentvolumeclaim) / avg(kubelet_volume_stats_capacity_bytes{namespace=~"{{.namespace}}",persistentvolumeclaim=~"redis-data-(drc|rfr)-({{.name}})-.*"}) by(namespace,persistentvolumeclaim)
+    ```
+
+#### Key Metrics & Alert Recommendations
+
+Recommended production alerts:
+
+| Metric | Threshold | Note |
+| :--- | :--- | :--- |
+| **Memory Usage** | > 80% | Risk of eviction/OOM. |
+| **CPU Usage** | > 80% (Sustained) | Latency spikes. |
+| **Hit Rate** | < 80% | Strategy issue or capacity missing. |
+| **Failovers** | > 0 | Check network/node health. |
+| **Connections** | Near maxclients | New connections rejected. |
+| **Storage Usage** | > 80% | Ensure space for AOF/RDB. |
+| **Response Time** | > 10ms | Slow queries/bottlenecks. |
+
+### Troubleshooting
+
+For specific issues, search the [Customer Portal](https://cloud.alauda.cn/kb).
+
+## References
+
+[https://redis.io/docs/](https://redis.io/docs/)
+
+[https://docs.redis.com/latest/ri/memory-optimizations/](https://docs.redis.com/latest/ri/memory-optimizations/)
+
+[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/)
\ No newline at end of file

From 4944d48ede4b0a611c318e0322ab046070ff2753 Mon Sep 17 00:00:00 2001
From: Seer <hfxia@alauda.io>
Date: Wed, 10 Dec 2025 18:04:36 +0800
Subject: [PATCH 3/9] Update docs/en/how_to/redis-best-practices.mdx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 docs/en/how_to/redis-best-practices.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx
index 40be42e..27e45f7 100644
--- a/docs/en/how_to/redis-best-practices.mdx
+++ b/docs/en/how_to/redis-best-practices.mdx
@@ -288,7 +288,7 @@ Deploy resources according to your actual business scenario.
       <td>rdb-redis-&lt;version&gt;-sentinel</td>
       <td>4c8g</td>
       <td>4c8g</td>
-      <td>8.5c8.64g</td>
+      <td>8.5c8.8g</td>
       <td>16G</td>
       <td>56G</td>
       <td>56G</td>

From 7c3144796b59c77b1313f91d2763ea5d7674d787 Mon Sep 17 00:00:00 2001
From: Seer <hfxia@alauda.io>
Date: Wed, 10 Dec 2025 18:04:57 +0800
Subject: [PATCH 4/9] Update docs/en/how_to/redis-best-practices.mdx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 docs/en/how_to/redis-best-practices.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx
index 27e45f7..b7cdb2d 100644
--- a/docs/en/how_to/redis-best-practices.mdx
+++ b/docs/en/how_to/redis-best-practices.mdx
@@ -685,7 +685,7 @@ We provide best practice examples for **go-redis**, **Jedis**, **Lettuce**, and
 
 4.  **Topology Refresh (Cluster)**
     *   **Auto-Refresh**: Ensure client enables `MOVED/ASK` handling.
-    *   **Periodic Refresh**: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes.
+    *   **Periodic refresh**: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes.
 
 ## Observability & Operations
 

From 6fda14c931c3c83e068b22e625e73c493e52b3b9 Mon Sep 17 00:00:00 2001
From: Seer <hfxia@alauda.io>
Date: Wed, 10 Dec 2025 18:05:13 +0800
Subject: [PATCH 5/9] Update docs/en/how_to/redis-best-practices.mdx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 docs/en/how_to/redis-best-practices.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx
index b7cdb2d..e45c988 100644
--- a/docs/en/how_to/redis-best-practices.mdx
+++ b/docs/en/how_to/redis-best-practices.mdx
@@ -628,7 +628,7 @@ Both **Sentinel** and **Cluster** modes rely on clients actively discovering and
 Both protocols return **Real Node IPs**. If a reverse proxy (HAProxy/Nginx) is used, clients still get backend real IPs, which may be unreachable from outside the cluster.
 Thus, **Each Redis Pod needs an independent external address** (NodePort/LoadBalancer), not a single proxy address.
 
-### Network Access strategies
+### Network Access Strategies
 
 Alauda Cache Service for Redis OSS supports multiple access methods:
 

From dca6a234f604a28a2ef50b187355f6660660ae87 Mon Sep 17 00:00:00 2001
From: hfxia <hfxia@alauda.io>
Date: Wed, 10 Dec 2025 18:06:27 +0800
Subject: [PATCH 6/9] fix: fix affinity label error

---
 docs/en/how_to/redis-best-practices.mdx | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx
index e45c988..4d8056b 100644
--- a/docs/en/how_to/redis-best-practices.mdx
+++ b/docs/en/how_to/redis-best-practices.mdx
@@ -498,7 +498,7 @@ spec:
             operator: In
             values:
             - <instance name>
-        topologyKey: kubernetes.io/hostname 
+        topologyKey: kubernetes.io/hostname
   sentinel:
     affinity:
       podAntiAffinity:
@@ -513,7 +513,7 @@ spec:
               operator: In
               values:
               - <instance name>
-          topologyKey: kubernetes.io/hostname 
+          topologyKey: kubernetes.io/hostname
 ```
 
 To force anti-affinity across ALL nodes (Data + Sentinel), refer to:
@@ -529,11 +529,11 @@ spec:
             operator: In
             values:
             - redis-failover
-          - key: middleware.instance/name:
+          - key: middleware.instance/name
             operator: In
             values:
             - <instance name>
-        topologyKey: kubernetes.io/hostname 
+        topologyKey: kubernetes.io/hostname
   sentinel:
     affinity:
       podAntiAffinity:
@@ -544,11 +544,11 @@ spec:
               operator: In
               values:
               - redis-failover
-            - key: middleware.instance/name:
+            - key: middleware.instance/name
               operator: In
               values:
               - <instance name>
-          topologyKey: kubernetes.io/hostname 
+          topologyKey: kubernetes.io/hostname
 ```
 
 ## User Management
@@ -830,4 +830,4 @@ For specific issues, search the [Customer Portal](https://cloud.alauda.cn/kb).
 
 [https://docs.redis.com/latest/ri/memory-optimizations/](https://docs.redis.com/latest/ri/memory-optimizations/)
 
-[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/)
\ No newline at end of file
+[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/)

From a2114a31df412806b42d0accede1a416e6aa833b Mon Sep 17 00:00:00 2001
From: Seer <hfxia@alauda.io>
Date: Wed, 10 Dec 2025 18:06:56 +0800
Subject: [PATCH 7/9] Update docs/en/how_to/redis-best-practices.mdx

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---
 docs/en/how_to/redis-best-practices.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx
index 4d8056b..0ec4002 100644
--- a/docs/en/how_to/redis-best-practices.mdx
+++ b/docs/en/how_to/redis-best-practices.mdx
@@ -684,7 +684,7 @@ We provide best practice examples for **go-redis**, **Jedis**, **Lettuce**, and
     *   **Max Connections**: Set `MaxTotal` reasonably to avoid hitting Redis `maxclients`.
 
 4.  **Topology Refresh (Cluster)**
-    *   **Auto-Refresh**: Ensure client enables `MOVED/ASK` handling.
+    *   **Auto-refresh**: Ensure client enables `MOVED/ASK` handling.
     *   **Periodic refresh**: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes.
 
 ## Observability & Operations

From 40f6e1ba9368f5c32ba62bcd1a69afcc6c1078d2 Mon Sep 17 00:00:00 2001
From: hfxia <hfxia@alauda.io>
Date: Thu, 11 Dec 2025 10:07:10 +0800
Subject: [PATCH 8/9] fix: fix lint

---
 docs/en/how_to/redis-best-practices.mdx | 42 ++++++++++++-------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx
index 0ec4002..93dcf02 100644
--- a/docs/en/how_to/redis-best-practices.mdx
+++ b/docs/en/how_to/redis-best-practices.mdx
@@ -265,21 +265,21 @@ Deploy resources according to your actual business scenario.
       <td rowspan="6">100m200Mi</td>
       <td>2c4g</td>
       <td rowspan="6">Unlimited (Reserve resources)</td>
-      <td>4.5c4.8g</td>
+      <td>4.5c4.8G</td>
       <td rowspan="2" colspan="3">Evaluate based on actual write volume</td>
     </tr>
     <tr>
       <td>aof-redis-&lt;version&gt;-sentinel</td>
       <td>4c8g</td>
       <td>4c8g</td>
-      <td>8.5c8.8g</td>
+      <td>8.5c8.8G</td>
     </tr>
     <tr>
       <td rowspan="2">RDB</td>
       <td>rdb-redis-&lt;version&gt;-sentinel</td>
       <td>2c4g</td>
       <td>2c4g</td>
-      <td>4.5c4.8g</td>
+      <td>4.5c4.8G</td>
       <td>8G</td>
       <td>28G</td>
       <td>28G</td>
@@ -288,7 +288,7 @@ Deploy resources according to your actual business scenario.
       <td>rdb-redis-&lt;version&gt;-sentinel</td>
       <td>4c8g</td>
       <td>4c8g</td>
-      <td>8.5c8.8g</td>
+      <td>8.5c8.8G</td>
       <td>16G</td>
       <td>56G</td>
       <td>56G</td>
@@ -298,7 +298,7 @@ Deploy resources according to your actual business scenario.
       <td>diskless-redis-&lt;version&gt;-sentinel</td>
       <td>2c4g</td>
       <td>2c4g</td>
-      <td>4.5c4.8g</td>
+      <td>4.5c4.8G</td>
       <td rowspan="2">/</td>
       <td>28G</td>
       <td>28G</td>
@@ -307,7 +307,7 @@ Deploy resources according to your actual business scenario.
       <td>diskless-redis-&lt;version&gt;-sentinel</td>
       <td>4c8g</td>
       <td>4c8g</td>
-      <td>8.5c8.8g</td>
+      <td>8.5c8.8G</td>
       <td>56G</td>
       <td>56G</td>
     </tr>
@@ -341,21 +341,21 @@ Deploy resources according to your actual business scenario.
       <td rowspan="6">100m300Mi</td>
       <td>2c4g</td>
       <td rowspan="6">Unlimited (Reserve resources)</td>
-      <td>12.6c25.8g</td>
+      <td>12.6c25.8G</td>
       <td rowspan="2" colspan="3">Evaluate based on actual write volume</td>
     </tr>
     <tr>
       <td>aof-redis-&lt;version&gt;-cluster</td>
       <td>4c8g</td>
       <td>4c8g</td>
-      <td>24.6c49.8g</td>
+      <td>24.6c49.8G</td>
     </tr>
     <tr>
       <td rowspan="2">RDB</td>
       <td>rdb-redis-&lt;version&gt;-cluster</td>
       <td>2c4g</td>
       <td>2c4g</td>
-      <td>12.6c25.8g</td>
+      <td>12.6c25.8G</td>
       <td>24G</td>
       <td>84G</td>
       <td>84G</td>
@@ -364,7 +364,7 @@ Deploy resources according to your actual business scenario.
       <td>rdb-redis-&lt;version&gt;-cluster</td>
       <td>4c8g</td>
       <td>4c8g</td>
-      <td>24.6c49.8g</td>
+      <td>24.6c49.8G</td>
       <td>48G</td>
       <td>168G</td>
       <td>168G</td>
@@ -374,7 +374,7 @@ Deploy resources according to your actual business scenario.
       <td>diskless-redis-&lt;version&gt;-cluster</td>
       <td>2c4g</td>
       <td>2c4g</td>
-      <td>12.6c25.8g</td>
+      <td>12.6c25.8G</td>
       <td rowspan="2">/</td>
       <td>84G</td>
       <td>84G</td>
@@ -383,7 +383,7 @@ Deploy resources according to your actual business scenario.
       <td>diskless-redis-&lt;version&gt;-cluster</td>
       <td>4c8g</td>
       <td>4c8g</td>
-      <td>24.6c49.8g</td>
+      <td>24.6c49.8G</td>
       <td>168G</td>
       <td>168G</td>
     </tr>
@@ -435,7 +435,7 @@ In Cluster mode, the system **prioritizes `spec.affinityPolicy`**. Alauda Cache
     <tr>
       <td><strong>All Pods Forced Anti-Affinity</strong></td>
       <td><code>AntiAffinity</code></td>
-      <td>Forces <strong>ALL Pods</strong> in the cluster (including master/replicas of different shards) to be on different nodes. Fails if node count < total Pod count.</td>
+      <td>Forces <strong>ALL Pods</strong> in the cluster (including primary/replicas of different shards) to be on different nodes. Fails if node count < total Pod count.</td>
       <td>
         <ul>
           <li><strong>Pros</strong>: Highest disaster recovery, minimal single-node failure impact.</li>
@@ -445,25 +445,25 @@ In Cluster mode, the system **prioritizes `spec.affinityPolicy`**. Alauda Cache
       <td><strong>Cluster Mode Core Business</strong><br />Ample resources, strict HA requirements.</td>
     </tr>
     <tr>
-      <td><strong>Shard Master-Replica Forced Anti-Affinity</strong></td>
+      <td><strong>Shard Primary-Replica Forced Anti-Affinity</strong></td>
       <td><code>AntiAffinityInSharding</code></td>
-      <td>Forces <strong>Master and Replicas within same shard</strong> to be on different nodes. Pods from different shards can coexist.</td>
+      <td>Forces <strong>Primary and Replicas within same shard</strong> to be on different nodes. Pods from different shards can coexist.</td>
       <td>
         <ul>
           <li><strong>Pros</strong>: Guarantees physical isolation of data replicas, preventing shard migration data loss.</li>
-          <li><strong>Cons</strong>: Scheduling fails if live nodes < replica count. Masters of different shards might land on same node (SPOF risk).</li>
+          <li><strong>Cons</strong>: Scheduling fails if live nodes < replica count. Primaries of different shards might land on same node (single point of failure risk).</li>
         </ul>
       </td>
       <td><strong>Production Standard</strong><br />Balances resource usage and data safety.</td>
     </tr>
     <tr>
-      <td><strong>Shard Master-Replica Soft Anti-Affinity</strong></td>
+      <td><strong>Shard Primary-Replica Soft Anti-Affinity</strong></td>
       <td><code>SoftAntiAffinity</code></td>
-      <td><strong>Prioritizes</strong> spreading shard master/replicas. If impossible (e.g., insufficient nodes), <strong>allows</strong> scheduling on same node.</td>
+      <td><strong>Prioritizes</strong> spreading shard primary/replicas. If impossible (e.g., insufficient nodes), <strong>allows</strong> scheduling on same node.</td>
       <td>
         <ul>
           <li><strong>Pros</strong>: Highest deployment success rate, runs with limited resources.</li>
-          <li><strong>Cons</strong>: Master/Replica may share node in extreme cases, risking data loss.</li>
+          <li><strong>Cons</strong>: Primary/Replica may share node in extreme cases, risking data loss.</li>
         </ul>
       </td>
       <td><strong>Test/Dev Environments</strong><br />Or resource-constrained edge environments.</td>
@@ -657,8 +657,8 @@ Alauda Cache Service for Redis OSS supports multiple access methods:
 
 > [!INFO]
 > **Resource Usage**: LB/NodePort creates a **Service per Pod**.
-> *   **Sentinel** (1M1S + 3 Sentinels): Needs **8 NodePorts/LBs**.
-> *   **Cluster** (3 Shards x 1M1S): Needs **7 NodePorts/LBs**.
+> *   **Sentinel** (1P1R + 3 Sentinels): Needs **8 NodePorts/LBs**.
+> *   **Cluster** (3 Shards x 1P1R): Needs **7 NodePorts/LBs**.
 
 ### Code Examples
 

From 4793fdb2bfa18c48c68e86a618dca31cf3e86767 Mon Sep 17 00:00:00 2001
From: hfxia <hfxia@alauda.io>
Date: Thu, 11 Dec 2025 11:32:25 +0800
Subject: [PATCH 9/9] fix: updated the references

---
 ...redis-best-practices.mdx => best-practices.mdx} | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)
 rename docs/en/how_to/{redis-best-practices.mdx => best-practices.mdx} (98%)

diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/best-practices.mdx
similarity index 98%
rename from docs/en/how_to/redis-best-practices.mdx
rename to docs/en/how_to/best-practices.mdx
index 93dcf02..74aa922 100644
--- a/docs/en/how_to/redis-best-practices.mdx
+++ b/docs/en/how_to/best-practices.mdx
@@ -1,9 +1,9 @@
 ---
 weight: 5
-title: Redis Best Practices
+title: Best Practices
 ---
 
-# Redis Best Practices
+# Best Practices
 
 ## Overview
 
@@ -826,8 +826,8 @@ For specific issues, search the [Customer Portal](https://cloud.alauda.cn/kb).
 
 ## References
 
-[https://redis.io/docs/](https://redis.io/docs/)
-
-[https://docs.redis.com/latest/ri/memory-optimizations/](https://docs.redis.com/latest/ri/memory-optimizations/)
-
-[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/)
+* [High availability with Redis Sentinel](https://redis.io/docs/latest/operate/oss_and_stack/management/sentinel/)
+* [Redis cluster specification](https://redis.io/docs/latest/operate/oss_and_stack/reference/cluster-spec/)
+* [Redis persistence](https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/)
+* [Scale with Redis Cluster](https://redis.io/docs/latest/operate/oss_and_stack/management/scaling/)
+* [Optimizing Redis](https://redis.io/docs/latest/operate/oss_and_stack/management/optimization/)