From 1171f3458e1a3c2211155e9d136e105dc9e4f431 Mon Sep 17 00:00:00 2001 From: hfxia Date: Tue, 9 Dec 2025 16:14:58 +0800 Subject: [PATCH 1/9] backup --- docs/en/how_to/90-redis-references.mdx | 401 +++++++++++++++++++++++++ 1 file changed, 401 insertions(+) create mode 100644 docs/en/how_to/90-redis-references.mdx diff --git a/docs/en/how_to/90-redis-references.mdx b/docs/en/how_to/90-redis-references.mdx new file mode 100644 index 0000000..5f37e50 --- /dev/null +++ b/docs/en/how_to/90-redis-references.mdx @@ -0,0 +1,401 @@ +# Redis 最佳实践 + +# 1. 概述 (Overview) + +Redis 作为云原生架构中事实标准的缓存与键值存储组件,承载着业务高并发读写与低延迟的核心需求。在 Kubernetes 容器化环境下运行有状态(Stateful)的 Redis 服务,面临着与传统物理机环境截然不同的挑战,包括**存储持久化的稳定性**、**网络拓扑的动态变化**以及**资源隔离与调度**等问题。 + +本最佳实践文档旨在为生产环境下的 Redis 部署提供一套标准化的参考指南。涵盖从**架构选型**、**资源规划**、**客户端对接**到**可观测性运维**的全生命周期管理规范。通过遵循本指南,旨在帮助用户构建一个**高可用 (High Availability)**、**高性能 (High Performance)** 且**易维护 (Maintainability)** 的企业级 Redis 数据服务。 + +# 2. 架构设计 (Architecture Design) + +全栈云原生开放平台根据客户业务规模与 SLA 需求,提供两种标准的 Redis 管理架构: + +## 2.1 高可用架构:哨兵模式 (HA: Sentinel) +**定位:经典高可用架构,适用于中小规模业务。** + +哨兵模式基于 Redis 原生的主从复制机制。通过部署独立的 Sentinel 进程组监控主从节点状态,在主节点故障时自动执行 Failover(故障转移)并通知客户端。 +* **优点**:架构简单,运维成熟,对客户端协议要求较低。 +* **局限**:写能力受限于单节点,无法水平扩展存储容量。 + +## 2.2 分布式架构:集群模式 (Distributed: Cluster) +**定位:分布式分片架构,适用于大规模高并发业务。** + +集群模式通过 Hash Slot(哈希槽)将数据自动分片到多个节点,实现存储容量和读写性能的水平扩展(Scale-out)。 +* **优点**:真正的高可用分布式存储,支持动态扩缩容(Resharding)。 +* **局限**:客户端协议复杂,特定多 Key 命令(如 `MGET`)受限于 Slot 分布。 + +## 2.3 选型指南 (Selection Guide) + +在选择 Redis 架构时,需要考虑业务对可用性、扩展性和复杂度的要求。 + +| 特性 | 哨兵模式 (Sentinel) | 集群模式 (Cluster) | +| :--- | :--- | :--- | +| **适用场景** | 中小型业务,读多写少,数据量适中 | 大型业务,高并发读写,海量数据 | +| **高可用性** | 通过哨兵监控和自动故障转移实现 | 通过节点自动故障检测和恢复实现 | +| **扩展性** | 垂直扩展(升级规格),水平扩展仅限读能力 | 水平扩展(读写能力),支持动态扩缩容 | +| **读写分离** | 支持(需要客户端配合或中间件) | 不支持(通常直接连接分片主节点) | +| **数据分片** | 无(单节点存储全量数据) | 有(数据自动分片到多个节点) | +| **运维复杂度** | 较低,架构简单 | 较高,涉及分片、哈希槽、数据迁移等 | +| **网络限制** | 需要客户端支持 Sentinel 协议 | 需要客户端支持 Cluster 协议 | + +**建议:** +* 如果数据量较小(单节点内存可容纳),且主要追求简单稳定,**哨兵模式**是首选。 +* 如果数据量巨大或写入压力极高,单节点无法支撑,选择**集群模式**。 + +# 3. 部署配置 (Deployment Configuration) + +## 3.1 实例部署 (Deployment) + +在 **Data Services** 页面,单击 **Redis** ,选择您的 项目 及 命名空间 后,单击 创建 **Redis** 实例 ,根据实际业务场 +景配置实例,推荐您使用最新的 Redis 6.0 版本。 + +## 3.2 参数模板选择 (Parameter Templates) + +Redis 实例的参数配置通过 Custom Resource (CR) 的字段进行指定。 + +* **数据节点参数**:通过 `spec.customConfig` 字段配置。例如设置 `save` 策略: + ```yaml + spec: + customConfig: + save: "600 1" + ``` +* **哨兵节点参数**(仅哨兵模式):通过 `spec.sentinel.monitorConfig` 字段配置。支持 `down-after-milliseconds`, `failover-timeout`, `parallel-syncs` 等关键参数。 + +选择模板时,应根据业务对持久化(AOF vs RDB)和性能的权衡来决定。 + +## 3.3 资源规格配置 (Resource Specs) + +您可以根据业务场景的实际情况,部署资源规格。 + +| 集群架构 | 持久化 | 模板 | 实例规格 | Replica | Sentinel | Sharding | Sentinel容器组 | redis-exporter | redis (实例规格) | 备份容器 (backup) | 资源总量 | 实例存储配额 | 自动备份 (保留7份) | 手动备份 (保留7份) | +|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| +| Redis Sentinel | AOF | aof-redis-6.0-sentinel | 2c4g | 1 | 3 | / | 100m128Mi | 100m200Mi | 2c4g | 不限制(需要预留资源) | 4.5c4.8g | 需要根据实际写入量评估 | | | +| | | aof-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.8g | | | | +| | RDB | rdb-redis-6.0-sentinel | 2c4g | | | | | | 2c4g | | 4.5c4.8g | 8G | 28G | 28G | +| | | rdb-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.64g | 16G | 56G | 56G | +| | Diskless | diskless-redis-6.0-sentinel | 2c4g | | | | | | 2c4g | | 4.5c4.8g | / | 28G | 28G | +| | | diskless-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.8g | | 56G | 56G | +| Redis Cluster | AOF | aof-redis-6.0-cluster | 2c4g | / | 3 | / | | 100m300Mi | 2c4g | | 12.6c25.8g | 需要根据实际写入量评估 | | | +| | | aof-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | | | | +| | RDB | rdb-redis-6.0-cluster | 2c4g | | | | | | 2c4g | | 12.6c25.8g | 24G | 84G | 84G | +| | | rdb-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | 48G | 168G | 168G | +| | Diskless | diskless-redis-6.0-cluster | 2c4g | | | | | | 2c4g | | 12.6c25.8g | / | 84G | 84G | +| | | diskless-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | | 168G | 168G | + +## 3.4 调度策略配置 (Scheduling) + +Redis 集群模式提供三种推荐反亲和策略,所有 Pod 强制反亲和、分片内主从强制反亲和以及分片内主从尽量 +反亲和。 + +**所有 Pod 强制反亲和** + +Redis Cluster 实例会将 Redis 强制分布到不同的节点上。如图所示,假设有三个节点,每个节点上可以 +调度三个 Pod。然而,在本例中,Redis Cluster 策略被设置为强制 Pod 反亲和性。由于这个设置,当尝 +试在相同节点上调度不同的 Redis 分片副本时,部署失败。 + +在可视化图中,三个主节点(master)分布在三个不同的节点上,每个节点有三个可用的 Pod 位置。同 +时,三个从节点(replica)都无法被调度,因为它们与已存在的主节点存在反亲和性约束,会导致部署 +失败。 + +**分片内主从强制反亲和** + +分片内主从强制反亲和策略会将同一个分片的 Redis 节点分布到不同的节点上,确保即使某个节点出现 +故障,数据仍然具有完整性。 + +如图所示,假设有三个节点,每个节点可以调度三个 Pod。在 Node2 和 Node3 上的 Pod 位置已经被占 +满,而 Node1 上仍有空余的 Pod 资源。然而,由于 Redis shard-2 主节点(master)位于 Node1 上,并 +且分片内主从节点被设置为强制反亲和性,Redis shard-2 从节点(replica)不能被调度到 Node1 上。这 +导致 Redis shard-2 从节点无法找到可用的节点进行调度,从而导致部署失败。 + +**分片内主从尽量反亲和** + +在 Redis Cluster 中,为了提高数据的可靠性和容错能力,我们可以采用“分片内主从尽量反亲和”的策 +略。这意味着同一个分片内的主从节点尽量被调度到不同的节点上。这样,即使某个节点出现故障,数 +据仍然具有完整性。 + +采用“分片内主从尽量反亲和”的策略时,调度器会优先尝试将主从节点部署在不同的节点上。然而,如果 +无法找到符合条件的节点,调度器仍然会允许将主从节点部署在同一个节点上。这种策略在节点资源有 +限的情况下,可以确保 Redis 集群的正常部署和运行,同时在尽量遵循反亲和原则的基础上,提高数据 +的可靠性。 + +如图,Node1 在资源不足的情况下部署了 redis shard-2 的主从节点。 + +您可以根据下表,对您的实际业务情景设置集群模式 Redis 的反亲和策略。 + +|所有 Pod 之间强制要求在
不同节点上部署,不允许在
同一节点上部署。|可以最大程度地
保证负载均衡和
系统的高可用
性。|节点资源有限时可
能导致部署失败。|保障
自愈| +|---|---|---|---| +|同一个分片内的主从节点必
须部署在不同节点上。|保证分片内主从
节点故障隔离,
提高数据可靠
性。|节点资源有限时可
能导致部署失败。|有可
能自
愈| + +在哨兵模式下,默认采用分片内主从尽量反亲和策略。这意味着同一个分片内的主从节点会尽量部署在不同的 +节点上,但在资源不足的情况下,也允许主从节点部署在同一个节点上。采用这种策略的目的是为了在保证数 +据可靠性的同时,适应有限的节点资源。 + +根据上文描述,分片内主从尽量反亲和策略在资源有限的情况下,可以在保证集群正常部署的基础上,尽可能 +提高数据可靠性。这种策略在某些情况下可以实现单点故障的自愈。然而,在同一个节点上部署主从节点时, +数据可靠性相对较低。因此,在实际应用中,我们需要根据实际情况和需求来选择合适的策略,以实现 Redis +高可用性和故障自动切换的目标。 + +# 4. 资源规划 (Resource Planning) + +## 4.1 内存规格 (Memory) + +Redis 采用快照机制将内存中的数据异步复制到磁盘上进行长期存储。这种机制使得 Redis 保持了高性能,但 +也存在在快照之间可能丢失数据的风险。 + +在 Kubernetes 容器化环境下,我们建议采用分级内存管理策略: +* **✅ 标准规格 (< 8GB)**:**强烈推荐**。能够保证极低的 Fork 延迟和快速的故障恢复(RTO < 60s),是最稳健的生产选择。 +* **⚠️ 高性能规格 (8GB - 16GB)**:**可接受**。需配合高性能宿主机且**必须关闭 THP**。在此规格下,Fork 虽然可控,但在高负载时仍可能造成 100ms 级别的抖动。 +* **❌ 高风险规格 (> 16GB)**:**不推荐**。单点故障影响面过大,且全量同步极易打满网络带宽。建议水平拆分为 Cluster 模式。 + +### 为什么限制 8GB?(技术深究) +虽然物理机时代单实例常运行 32GB+,但在云原生环境下,8GB 限制是基于以下核心技术原理的“黄金法则”: + +1. **Fork 阻塞与页表复制 (The Fork Blocking)** + * Redis 执行 RDB/AOF Rewrite 时调用 `fork()`。虽然内存页是 CoW(写时复制),但**进程页表 (Page Table) 必须全量拷贝**,此过程会**阻塞主线程**。 + * *估算公式*:10GB 内存 ≈ 20MB 页表 ≈ 10~50ms 阻塞(视虚拟化层开销而定)。一旦超过 8GB,分钟级的阻塞风险将呈指数级上升,严重影响 SLA。 + +2. **故障恢复效率 (RTO)** + * 容器重启加载 RDB 是**单线程 CPU 密集型任务**(对象反序列化)。测试表明,受限于 CPU 解析效率,加载 8GB 数据通常需 **30-50秒**(即使使用 SSD)。若维持在 32GB,重启时间可能长达数分钟,这与 Kubernetes "快速自愈" 的设计理念背道而驰。 + +### 内存配置最佳实践 (MaxMemory vs Limit) + +为了避免 Redis 在持久化期间因内存膨胀而导致 OOM(OOM Kill),需严格遵循以下配置原则: + +1. **设置 MaxMemory**: 不要将 `maxmemory` 设置为容器内存 Limit 的 100%。建议设置为 Limit 的 **70% ~ 80%**。 +2. **预留 CoW 空间**: Redis 在 RDB/AOF Rewrite 时会 Fork 子进程。若此时有大量写入更新,OS 的写时复制 (Copy-on-Write) 机制会导致内存页被复制,极端情况下内存占用可能由 8GB 翻倍至 16GB。 +3. **Overcommit 配置**: 确保宿主机开启 `vm.overcommit_memory = 1`,允许内核在 fork 时无需申请等量物理内存(依赖 CoW),防止 fork 失败。 + +> [!IMPORTANT] +> **资源预留公式**: `Container_Memory_Limit` ≈ `Redis_MaxMemory` / 0.7 +> * 示例: 如果希望 Redis 存储 8GB 数据,容器内存 Limit 建议配置为 10GB ~ 12GB,留出 2GB+ 给 CoW 和碎片开销。 + +## 4.2 计算资源 (Compute) + +### 基础建议:2 Core +Redis 核心命令执行是单线程的,但持久化(Fork)等操作需要子进程。因此,建议为每个 Redis 实例分配 **至少 2 Core** 的 CPU 资源: +* **Core 1**: 处理主线程的请求与命令。 +* **Core 2**: 应对持久化 fork、后台任务以及系统开销。 + +### 多线程 (Multi-threading) +Redis 6.0+ 引入了多线程 I/O(默认关闭)。 +* **作用**: 利用多核处理网络数据的读写(Socket I/O),优化单核 IO 限制。 +* **局限性**: 命令执行仍然是单线程的。因此,增加 CPU 核心数只能提升网络吞吐,无法提升命令处理速度(Latency)。 +* **配置建议**: 如果网络是瓶颈(如大流量场景),可开启多线程,但建议总核心数配置不超过 4~8 Core,过多的核心并不能带来线性提升。 + +## 4.3 存储规划 (Storage) + +### 容量规划 (Capacity) +持久化模式直接决定了磁盘空间的配额需求,请参考以下计算公式: + +| 模式 | 建议磁盘配额公式 | 详细说明 | +| :--- | :--- | :--- | +| **Diskless (纯缓存)** | `0` (无需 PVC) | 仅作为纯缓存使用,不开启 RDB/AOF。K8s 环境下日志通过标准输出 (stdout) 采集,无需挂载持久化数据盘。 | +| **RDB (快照)** | `MaxMemory * 2` | RDB 也是写时复制(CoW)。在生成新快照时,磁盘上同时存在“旧快照”和“正在写入的新快照”。
**建议**:至少预留 2 倍内存空间。 | +| **AOF (追加写)** | `MaxMemory * 3` | AOF 文件会随着写操作不断增长。默认配置下 (`auto-aof-rewrite-percentage 100`),AOF 文件会增长到数据量的 **2倍** 才触发重写。此时磁盘需同时容纳:
1. 旧 AOF 文件 (2x)
2. 重写生成的新 AOF 文件 (1x)
**峰值总计 3x**。建议预留至少 3 倍空间以防磁盘写满。 | + +### 性能要求 (IOPS & Latency) +* **AOF 开启时**: 磁盘性能至关重要。如果磁盘 IOPS 不足或 fsync 延迟过高,会直接阻塞主线程(当 `appendfsync everysec` 时)。 +* **推荐介质**: 生产环境强烈建议使用 SSD/NVMe 本地盘或高性能云盘。 + +## 4.4 内核参数调优 (Kernel Tuning) + +为了确保 Redis 在生产环境中的稳定性和高性能,建议在 Kubernetes 节点层面进行以下内核参数优化: + +1. **内存分配控制 (`vm.overcommit_memory`)** + * **建议值**: `1` + * **说明**: 设置为 `1` (Always) 确保内核在 Redis 执行 Fork 操作(RDB 快照/AOF 重写)时允许分配内存,即使物理内存看似不足。这能有效防止因内存分配失败导致的持久化失败。 + +2. **连接队列大小 (`net.core.somaxconn`)** + * **建议值**: `2048` 或更高 + * **说明**: Redis 默认的 tcp-backlog 为 511。在高并发场景下,应将系统的 `net.core.somaxconn` 提高,以避免客户端连接请求被丢弃。 + +3. **透明大页 (Transparent Huge Pages, THP)** + * **建议操作**: **禁用** (`never`) + * **说明**: THP 会导致 Redis 在发生内存分配时产生显著的延迟尖峰,尤其是在 Fork 后的写时复制(CoW)期间。建议在宿主机或启动脚本中禁用。 + + +# 5. 客户端接入 (Client Integration) + +## 5.1 网络访问策略 (Network Access) + +您可以根据下表,选择适合实际业务尝尽的访问方式。 + +|架构|访问方式|推荐|特性说明| +|---|---|---|---| +|哨兵
模式|集群内访
问|哨兵访问地址|用于从集群内连接哨兵(哨兵会去发现底层的数据节
点)。提供包括哨兵节点对应的内部路由的名称、IP地
址和端口号。 端口号默认为 26379。| +|集群
模式|集群外访
问—无代理|集群模式暴露 底层数据节
点(Pod)的 NodePort|暴露底层数据节点 的 NodePort,直连底层数据节点,
Pod本身不具备高可用性,但可以写多个节点地址。
应用连接时需要写多个 节点IP+端口,部分节点宕机
时,连接仍保留。| + +> [!TIP] +> **外网访问推荐**:对于生产环境的集群外访问,**强烈推荐使用 LoadBalancer 方式**。 +> * **LoadBalancer**:为 Redis 实例配置外部负载均衡器,提供稳定的访问入口。 +> * **NodePort 限制**:虽然支持 NodePort,但客户端需要直接连接到具体的节点端口。在多网卡环境或网络策略复杂的场景下,直接使用 NodePort 可能会因为网络接口绑定问题导致连接失败(Redis Sentinel/Cluster 默认绑定默认网卡)。且 NodePort 不支持通过额外的负载均衡器进行代理(因为 Sentinel/Cluster 协议要求客户端直连感知拓扑)。 + +## 5.2 代码接入示例 (Code Examples) + +以下是使用 **Go (go-redis/v9)** 客户端连接 Redis 的示例代码: + +#### 哨兵模式 (Sentinel Mode) + +哨兵模式下,客户端连接 Sentinel 节点,通过 `mymaster`(默认主节点组名)发现主节点。 + +```go +package main + +import ( + "context" + "fmt" + "time" + "github.com/redis/go-redis/v9" +) + +func main() { + client := redis.NewFailoverClient(&redis.FailoverOptions{ + SentinelAddrs: []string{":26379"}, // 替换为 Sentinel 地址 + MasterName: "mymaster", // 固定名称 + Password: "", // Redis 密码 + DialTimeout: 3 * time.Second, + ReadTimeout: 5 * time.Second, + WriteTimeout: 10 * time.Second, + PoolSize: 100, + }) + defer client.Close() + + if val, err := client.Get(context.TODO(), "test").Result(); err != nil { + panic(err) + } else { + fmt.Println(val) + } +} +``` + +#### 集群模式 (Cluster Mode) + +集群模式下,客户端连接任一集群节点获取拓扑。建议配置多个种子节点地址。 + +```go +package main + +import ( + "context" + "fmt" + "time" + "github.com/redis/go-redis/v9" +) + +func main() { + client := redis.NewClusterClient(&redis.ClusterOptions{ + Addrs: []string{":6379"}, // 替换为集群节点地址 + Password: "", + DialTimeout: 3 * time.Second, + ReadTimeout: 5 * time.Second, + WriteTimeout: 10 * time.Second, + PoolSize: 100, + }) + defer client.Close() + + if val, err := client.Get(context.TODO(), "test").Result(); err != nil { + panic(err) + } else { + fmt.Println(val) + } +} +``` + +## 5.3 客户端可靠性最佳实践 + +除了正确的连接配置,客户端的健壮性设计对于保障业务稳定性至关重要: + +1. **超时设置 (Timeouts)** + * **连接超时 (Connect Timeout)**: 建议设为 1-3秒,避免网络不可达时长时间阻塞。 + * **读写超时 (Read/Write Timeout)**: 根据业务 SLA 设置,通常只有几百毫秒。不要设置过长,防止慢查询拖垮应用线程池。 + +2. **重试策略 (Retry Strategy)** + * **指数退避 (Exponential Backoff)**: 遇到网络抖动或主从切换时,不要立即重试,而应使用指数退避算法(如 wait 100ms, 200ms, 400ms...),避免重试风暴加剧 Redis 负载。 + +3. **连接池 (Connection Pooling)** + * **预热与复用**: 务必使用连接池(如 JedisPool, go-redis Pool)。Redis 连接握手成本较高,频繁创建/销毁连接会严重影响性能。 + * **最大连接数**: 根据客户端并发量合理设置 `MaxTotal`,避免超过 Redis 的 `maxclients` 限制。 + +4. **拓扑刷新机制 (Topology Refresh) - 集群模式关键** + * **自动刷新**: 确保客户端开启了基于 `MOVED/ASK` 转向错误的自动拓扑更新功能(绝大多数 Smart Client 默认开启)。 + * **周期性刷新**: 在拓扑变化频繁(如弹性伸缩)或网络不稳定的场景下,建议配置客户端进行周期性拓扑刷新(例如每 60 秒),以主动感知节点变更,减少请求重定向带来的延迟。 + + + + + +# 6. 可观测性与运维 (Observability & Operations) + +## 6.1 数据安全与备份 + +平台备份中心为您提供便捷的数据备份和管理解决方案,您可以在各实例下进行备份,并在备份中心中统一管 +理。此外,备份中心支持外接S3存储,可实现安全备份和恢复。您还可以恢复历史备份到特定实例,以及查看 +和删除备份记录。 + +## 6.3 认证与权限管理 + +Redis Operator(6.0+ 版本)支持通过 `RedisUser` CRD 进行声明式用户管理,支持 ACL 权限控制。 + +**管理操作:** +1. **查看用户**:查看 `RedisUser` 资源。 + ```bash + kubectl -n get RedisUser + ``` +2. **修改密码**:密码存储在 Secret 中。找到 `RedisUser` 关联的 `passwordSecrets`,修改对应的 Secret 内容(Base64 编码)。修改后,Operator 会自动同步,状态暂时变为 Pending,同步完成后变回 Success。 +3. **修改权限**:直接编辑 `RedisUser` 的 `spec.aclRules` 字段。例如:`+@all ~*` 表示所有权限。 +4. **创建新用户**:创建一个 `RedisUser` yaml 文件,指定 `redisName` 关联到 Redis 实例。 + +## 6.4 变更与维护 + +### 更新注意事项 +在对 Redis 实例进行规格变更(CPU/Memory)或扩容时: +1. **资源评估**:确保集群有足够的剩余资源。 +2. **渐进式执行**:Operator 会尽量逐个 Pod 进行滚动更新,以减少服务中断。 +3. **避开高峰期**:对于大数据量实例,建议在业务低峰期进行变更。 + +> [!CAUTION] +> 减少副本数或降低规格时,务必确认当前数据量和负载可以被新规格承载,否则可能导致数据丢失或服务崩溃。 + +### 重启注意事项 +通过管理控制台或 kubectl 触发重启时: +1. **逐个重启**:Pod 会被逐个重启(Rolling Restart)。 +2. **服务中断**:虽然是滚动重启,但在主从切换或节点重连期间,客户端可能会遇到短暂的连接断开或超时,客户端应具备重试机制。 +3. **前置条件**:实例必须处于 `Running` 状态才能执行重启操作。 + +## 6.5 监控告警体系 + +### 关键监控指标 + +生产环境建议重点关注以下指标,并配置相应的告警规则: + +| 类别 | 指标名称 (Prometheus) | 含义 | 告警阈值建议 | +| :--- | :--- | :--- | :--- | +| **性能** | `redis_cpu_user_seconds_total` | CPU 使用率 | 持续 > 80% (单核) | +| | `redis_memory_used_bytes` | 内存使用量 | > 80% limit | +| **可靠性** | `redis_connected_clients` | 连接客户端数 | 接近 maxclients | +| | `redis_evicted_keys_total` | 驱逐 Key 数量 | > 0 (若非预期) | +| | `redis_rejected_connections_total` | 拒绝连接数 | > 0 | +| **缓存效果** | **Hit Rate** (计算得出) | 缓存命中率 | < 80% (视业务而定) | + +**缓存命中率计算 (PromQL 示例):** +```promql +rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) +``` + +## 6.6 故障排查指南 + +您可以在 [Customer Portal](https://cloud.alauda.cn/knowledges) 查询您遇到的问题。 + +查询关键词:Redis、集群模式、监控、崩溃恢复、污点容忍、部署优化…… + + + +# 7. 其他 + +## 7.1 参考资料 + +[https://docs.redis.com/latest/ri/memory-optimizations/](https://docs.redis.com/latest/ri/memory-optimizations/) + +[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/) + +[https://redis.io/docs/](https://redis.io/docs/) From ad7182dabee379e257d1333aec20f9ba1eb70f2f Mon Sep 17 00:00:00 2001 From: hfxia Date: Wed, 10 Dec 2025 17:56:55 +0800 Subject: [PATCH 2/9] feat: added redis best practices --- docs/en/how_to/90-redis-references.mdx | 401 ------------ docs/en/how_to/redis-best-practices.mdx | 833 ++++++++++++++++++++++++ 2 files changed, 833 insertions(+), 401 deletions(-) delete mode 100644 docs/en/how_to/90-redis-references.mdx create mode 100644 docs/en/how_to/redis-best-practices.mdx diff --git a/docs/en/how_to/90-redis-references.mdx b/docs/en/how_to/90-redis-references.mdx deleted file mode 100644 index 5f37e50..0000000 --- a/docs/en/how_to/90-redis-references.mdx +++ /dev/null @@ -1,401 +0,0 @@ -# Redis 最佳实践 - -# 1. 概述 (Overview) - -Redis 作为云原生架构中事实标准的缓存与键值存储组件,承载着业务高并发读写与低延迟的核心需求。在 Kubernetes 容器化环境下运行有状态(Stateful)的 Redis 服务,面临着与传统物理机环境截然不同的挑战,包括**存储持久化的稳定性**、**网络拓扑的动态变化**以及**资源隔离与调度**等问题。 - -本最佳实践文档旨在为生产环境下的 Redis 部署提供一套标准化的参考指南。涵盖从**架构选型**、**资源规划**、**客户端对接**到**可观测性运维**的全生命周期管理规范。通过遵循本指南,旨在帮助用户构建一个**高可用 (High Availability)**、**高性能 (High Performance)** 且**易维护 (Maintainability)** 的企业级 Redis 数据服务。 - -# 2. 架构设计 (Architecture Design) - -全栈云原生开放平台根据客户业务规模与 SLA 需求,提供两种标准的 Redis 管理架构: - -## 2.1 高可用架构:哨兵模式 (HA: Sentinel) -**定位:经典高可用架构,适用于中小规模业务。** - -哨兵模式基于 Redis 原生的主从复制机制。通过部署独立的 Sentinel 进程组监控主从节点状态,在主节点故障时自动执行 Failover(故障转移)并通知客户端。 -* **优点**:架构简单,运维成熟,对客户端协议要求较低。 -* **局限**:写能力受限于单节点,无法水平扩展存储容量。 - -## 2.2 分布式架构:集群模式 (Distributed: Cluster) -**定位:分布式分片架构,适用于大规模高并发业务。** - -集群模式通过 Hash Slot(哈希槽)将数据自动分片到多个节点,实现存储容量和读写性能的水平扩展(Scale-out)。 -* **优点**:真正的高可用分布式存储,支持动态扩缩容(Resharding)。 -* **局限**:客户端协议复杂,特定多 Key 命令(如 `MGET`)受限于 Slot 分布。 - -## 2.3 选型指南 (Selection Guide) - -在选择 Redis 架构时,需要考虑业务对可用性、扩展性和复杂度的要求。 - -| 特性 | 哨兵模式 (Sentinel) | 集群模式 (Cluster) | -| :--- | :--- | :--- | -| **适用场景** | 中小型业务,读多写少,数据量适中 | 大型业务,高并发读写,海量数据 | -| **高可用性** | 通过哨兵监控和自动故障转移实现 | 通过节点自动故障检测和恢复实现 | -| **扩展性** | 垂直扩展(升级规格),水平扩展仅限读能力 | 水平扩展(读写能力),支持动态扩缩容 | -| **读写分离** | 支持(需要客户端配合或中间件) | 不支持(通常直接连接分片主节点) | -| **数据分片** | 无(单节点存储全量数据) | 有(数据自动分片到多个节点) | -| **运维复杂度** | 较低,架构简单 | 较高,涉及分片、哈希槽、数据迁移等 | -| **网络限制** | 需要客户端支持 Sentinel 协议 | 需要客户端支持 Cluster 协议 | - -**建议:** -* 如果数据量较小(单节点内存可容纳),且主要追求简单稳定,**哨兵模式**是首选。 -* 如果数据量巨大或写入压力极高,单节点无法支撑,选择**集群模式**。 - -# 3. 部署配置 (Deployment Configuration) - -## 3.1 实例部署 (Deployment) - -在 **Data Services** 页面,单击 **Redis** ,选择您的 项目 及 命名空间 后,单击 创建 **Redis** 实例 ,根据实际业务场 -景配置实例,推荐您使用最新的 Redis 6.0 版本。 - -## 3.2 参数模板选择 (Parameter Templates) - -Redis 实例的参数配置通过 Custom Resource (CR) 的字段进行指定。 - -* **数据节点参数**:通过 `spec.customConfig` 字段配置。例如设置 `save` 策略: - ```yaml - spec: - customConfig: - save: "600 1" - ``` -* **哨兵节点参数**(仅哨兵模式):通过 `spec.sentinel.monitorConfig` 字段配置。支持 `down-after-milliseconds`, `failover-timeout`, `parallel-syncs` 等关键参数。 - -选择模板时,应根据业务对持久化(AOF vs RDB)和性能的权衡来决定。 - -## 3.3 资源规格配置 (Resource Specs) - -您可以根据业务场景的实际情况,部署资源规格。 - -| 集群架构 | 持久化 | 模板 | 实例规格 | Replica | Sentinel | Sharding | Sentinel容器组 | redis-exporter | redis (实例规格) | 备份容器 (backup) | 资源总量 | 实例存储配额 | 自动备份 (保留7份) | 手动备份 (保留7份) | -|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| -| Redis Sentinel | AOF | aof-redis-6.0-sentinel | 2c4g | 1 | 3 | / | 100m128Mi | 100m200Mi | 2c4g | 不限制(需要预留资源) | 4.5c4.8g | 需要根据实际写入量评估 | | | -| | | aof-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.8g | | | | -| | RDB | rdb-redis-6.0-sentinel | 2c4g | | | | | | 2c4g | | 4.5c4.8g | 8G | 28G | 28G | -| | | rdb-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.64g | 16G | 56G | 56G | -| | Diskless | diskless-redis-6.0-sentinel | 2c4g | | | | | | 2c4g | | 4.5c4.8g | / | 28G | 28G | -| | | diskless-redis-6.0-sentinel | 4c8g | | | | | | 4c8g | | 8.5c8.8g | | 56G | 56G | -| Redis Cluster | AOF | aof-redis-6.0-cluster | 2c4g | / | 3 | / | | 100m300Mi | 2c4g | | 12.6c25.8g | 需要根据实际写入量评估 | | | -| | | aof-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | | | | -| | RDB | rdb-redis-6.0-cluster | 2c4g | | | | | | 2c4g | | 12.6c25.8g | 24G | 84G | 84G | -| | | rdb-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | 48G | 168G | 168G | -| | Diskless | diskless-redis-6.0-cluster | 2c4g | | | | | | 2c4g | | 12.6c25.8g | / | 84G | 84G | -| | | diskless-redis-6.0-cluster | 4c8g | | | | | | 4c8g | | 24.6c49.8g | | 168G | 168G | - -## 3.4 调度策略配置 (Scheduling) - -Redis 集群模式提供三种推荐反亲和策略,所有 Pod 强制反亲和、分片内主从强制反亲和以及分片内主从尽量 -反亲和。 - -**所有 Pod 强制反亲和** - -Redis Cluster 实例会将 Redis 强制分布到不同的节点上。如图所示,假设有三个节点,每个节点上可以 -调度三个 Pod。然而,在本例中,Redis Cluster 策略被设置为强制 Pod 反亲和性。由于这个设置,当尝 -试在相同节点上调度不同的 Redis 分片副本时,部署失败。 - -在可视化图中,三个主节点(master)分布在三个不同的节点上,每个节点有三个可用的 Pod 位置。同 -时,三个从节点(replica)都无法被调度,因为它们与已存在的主节点存在反亲和性约束,会导致部署 -失败。 - -**分片内主从强制反亲和** - -分片内主从强制反亲和策略会将同一个分片的 Redis 节点分布到不同的节点上,确保即使某个节点出现 -故障,数据仍然具有完整性。 - -如图所示,假设有三个节点,每个节点可以调度三个 Pod。在 Node2 和 Node3 上的 Pod 位置已经被占 -满,而 Node1 上仍有空余的 Pod 资源。然而,由于 Redis shard-2 主节点(master)位于 Node1 上,并 -且分片内主从节点被设置为强制反亲和性,Redis shard-2 从节点(replica)不能被调度到 Node1 上。这 -导致 Redis shard-2 从节点无法找到可用的节点进行调度,从而导致部署失败。 - -**分片内主从尽量反亲和** - -在 Redis Cluster 中,为了提高数据的可靠性和容错能力,我们可以采用“分片内主从尽量反亲和”的策 -略。这意味着同一个分片内的主从节点尽量被调度到不同的节点上。这样,即使某个节点出现故障,数 -据仍然具有完整性。 - -采用“分片内主从尽量反亲和”的策略时,调度器会优先尝试将主从节点部署在不同的节点上。然而,如果 -无法找到符合条件的节点,调度器仍然会允许将主从节点部署在同一个节点上。这种策略在节点资源有 -限的情况下,可以确保 Redis 集群的正常部署和运行,同时在尽量遵循反亲和原则的基础上,提高数据 -的可靠性。 - -如图,Node1 在资源不足的情况下部署了 redis shard-2 的主从节点。 - -您可以根据下表,对您的实际业务情景设置集群模式 Redis 的反亲和策略。 - -|所有 Pod 之间强制要求在
不同节点上部署,不允许在
同一节点上部署。|可以最大程度地
保证负载均衡和
系统的高可用
性。|节点资源有限时可
能导致部署失败。|保障
自愈| -|---|---|---|---| -|同一个分片内的主从节点必
须部署在不同节点上。|保证分片内主从
节点故障隔离,
提高数据可靠
性。|节点资源有限时可
能导致部署失败。|有可
能自
愈| - -在哨兵模式下,默认采用分片内主从尽量反亲和策略。这意味着同一个分片内的主从节点会尽量部署在不同的 -节点上,但在资源不足的情况下,也允许主从节点部署在同一个节点上。采用这种策略的目的是为了在保证数 -据可靠性的同时,适应有限的节点资源。 - -根据上文描述,分片内主从尽量反亲和策略在资源有限的情况下,可以在保证集群正常部署的基础上,尽可能 -提高数据可靠性。这种策略在某些情况下可以实现单点故障的自愈。然而,在同一个节点上部署主从节点时, -数据可靠性相对较低。因此,在实际应用中,我们需要根据实际情况和需求来选择合适的策略,以实现 Redis -高可用性和故障自动切换的目标。 - -# 4. 资源规划 (Resource Planning) - -## 4.1 内存规格 (Memory) - -Redis 采用快照机制将内存中的数据异步复制到磁盘上进行长期存储。这种机制使得 Redis 保持了高性能,但 -也存在在快照之间可能丢失数据的风险。 - -在 Kubernetes 容器化环境下,我们建议采用分级内存管理策略: -* **✅ 标准规格 (< 8GB)**:**强烈推荐**。能够保证极低的 Fork 延迟和快速的故障恢复(RTO < 60s),是最稳健的生产选择。 -* **⚠️ 高性能规格 (8GB - 16GB)**:**可接受**。需配合高性能宿主机且**必须关闭 THP**。在此规格下,Fork 虽然可控,但在高负载时仍可能造成 100ms 级别的抖动。 -* **❌ 高风险规格 (> 16GB)**:**不推荐**。单点故障影响面过大,且全量同步极易打满网络带宽。建议水平拆分为 Cluster 模式。 - -### 为什么限制 8GB?(技术深究) -虽然物理机时代单实例常运行 32GB+,但在云原生环境下,8GB 限制是基于以下核心技术原理的“黄金法则”: - -1. **Fork 阻塞与页表复制 (The Fork Blocking)** - * Redis 执行 RDB/AOF Rewrite 时调用 `fork()`。虽然内存页是 CoW(写时复制),但**进程页表 (Page Table) 必须全量拷贝**,此过程会**阻塞主线程**。 - * *估算公式*:10GB 内存 ≈ 20MB 页表 ≈ 10~50ms 阻塞(视虚拟化层开销而定)。一旦超过 8GB,分钟级的阻塞风险将呈指数级上升,严重影响 SLA。 - -2. **故障恢复效率 (RTO)** - * 容器重启加载 RDB 是**单线程 CPU 密集型任务**(对象反序列化)。测试表明,受限于 CPU 解析效率,加载 8GB 数据通常需 **30-50秒**(即使使用 SSD)。若维持在 32GB,重启时间可能长达数分钟,这与 Kubernetes "快速自愈" 的设计理念背道而驰。 - -### 内存配置最佳实践 (MaxMemory vs Limit) - -为了避免 Redis 在持久化期间因内存膨胀而导致 OOM(OOM Kill),需严格遵循以下配置原则: - -1. **设置 MaxMemory**: 不要将 `maxmemory` 设置为容器内存 Limit 的 100%。建议设置为 Limit 的 **70% ~ 80%**。 -2. **预留 CoW 空间**: Redis 在 RDB/AOF Rewrite 时会 Fork 子进程。若此时有大量写入更新,OS 的写时复制 (Copy-on-Write) 机制会导致内存页被复制,极端情况下内存占用可能由 8GB 翻倍至 16GB。 -3. **Overcommit 配置**: 确保宿主机开启 `vm.overcommit_memory = 1`,允许内核在 fork 时无需申请等量物理内存(依赖 CoW),防止 fork 失败。 - -> [!IMPORTANT] -> **资源预留公式**: `Container_Memory_Limit` ≈ `Redis_MaxMemory` / 0.7 -> * 示例: 如果希望 Redis 存储 8GB 数据,容器内存 Limit 建议配置为 10GB ~ 12GB,留出 2GB+ 给 CoW 和碎片开销。 - -## 4.2 计算资源 (Compute) - -### 基础建议:2 Core -Redis 核心命令执行是单线程的,但持久化(Fork)等操作需要子进程。因此,建议为每个 Redis 实例分配 **至少 2 Core** 的 CPU 资源: -* **Core 1**: 处理主线程的请求与命令。 -* **Core 2**: 应对持久化 fork、后台任务以及系统开销。 - -### 多线程 (Multi-threading) -Redis 6.0+ 引入了多线程 I/O(默认关闭)。 -* **作用**: 利用多核处理网络数据的读写(Socket I/O),优化单核 IO 限制。 -* **局限性**: 命令执行仍然是单线程的。因此,增加 CPU 核心数只能提升网络吞吐,无法提升命令处理速度(Latency)。 -* **配置建议**: 如果网络是瓶颈(如大流量场景),可开启多线程,但建议总核心数配置不超过 4~8 Core,过多的核心并不能带来线性提升。 - -## 4.3 存储规划 (Storage) - -### 容量规划 (Capacity) -持久化模式直接决定了磁盘空间的配额需求,请参考以下计算公式: - -| 模式 | 建议磁盘配额公式 | 详细说明 | -| :--- | :--- | :--- | -| **Diskless (纯缓存)** | `0` (无需 PVC) | 仅作为纯缓存使用,不开启 RDB/AOF。K8s 环境下日志通过标准输出 (stdout) 采集,无需挂载持久化数据盘。 | -| **RDB (快照)** | `MaxMemory * 2` | RDB 也是写时复制(CoW)。在生成新快照时,磁盘上同时存在“旧快照”和“正在写入的新快照”。
**建议**:至少预留 2 倍内存空间。 | -| **AOF (追加写)** | `MaxMemory * 3` | AOF 文件会随着写操作不断增长。默认配置下 (`auto-aof-rewrite-percentage 100`),AOF 文件会增长到数据量的 **2倍** 才触发重写。此时磁盘需同时容纳:
1. 旧 AOF 文件 (2x)
2. 重写生成的新 AOF 文件 (1x)
**峰值总计 3x**。建议预留至少 3 倍空间以防磁盘写满。 | - -### 性能要求 (IOPS & Latency) -* **AOF 开启时**: 磁盘性能至关重要。如果磁盘 IOPS 不足或 fsync 延迟过高,会直接阻塞主线程(当 `appendfsync everysec` 时)。 -* **推荐介质**: 生产环境强烈建议使用 SSD/NVMe 本地盘或高性能云盘。 - -## 4.4 内核参数调优 (Kernel Tuning) - -为了确保 Redis 在生产环境中的稳定性和高性能,建议在 Kubernetes 节点层面进行以下内核参数优化: - -1. **内存分配控制 (`vm.overcommit_memory`)** - * **建议值**: `1` - * **说明**: 设置为 `1` (Always) 确保内核在 Redis 执行 Fork 操作(RDB 快照/AOF 重写)时允许分配内存,即使物理内存看似不足。这能有效防止因内存分配失败导致的持久化失败。 - -2. **连接队列大小 (`net.core.somaxconn`)** - * **建议值**: `2048` 或更高 - * **说明**: Redis 默认的 tcp-backlog 为 511。在高并发场景下,应将系统的 `net.core.somaxconn` 提高,以避免客户端连接请求被丢弃。 - -3. **透明大页 (Transparent Huge Pages, THP)** - * **建议操作**: **禁用** (`never`) - * **说明**: THP 会导致 Redis 在发生内存分配时产生显著的延迟尖峰,尤其是在 Fork 后的写时复制(CoW)期间。建议在宿主机或启动脚本中禁用。 - - -# 5. 客户端接入 (Client Integration) - -## 5.1 网络访问策略 (Network Access) - -您可以根据下表,选择适合实际业务尝尽的访问方式。 - -|架构|访问方式|推荐|特性说明| -|---|---|---|---| -|哨兵
模式|集群内访
问|哨兵访问地址|用于从集群内连接哨兵(哨兵会去发现底层的数据节
点)。提供包括哨兵节点对应的内部路由的名称、IP地
址和端口号。 端口号默认为 26379。| -|集群
模式|集群外访
问—无代理|集群模式暴露 底层数据节
点(Pod)的 NodePort|暴露底层数据节点 的 NodePort,直连底层数据节点,
Pod本身不具备高可用性,但可以写多个节点地址。
应用连接时需要写多个 节点IP+端口,部分节点宕机
时,连接仍保留。| - -> [!TIP] -> **外网访问推荐**:对于生产环境的集群外访问,**强烈推荐使用 LoadBalancer 方式**。 -> * **LoadBalancer**:为 Redis 实例配置外部负载均衡器,提供稳定的访问入口。 -> * **NodePort 限制**:虽然支持 NodePort,但客户端需要直接连接到具体的节点端口。在多网卡环境或网络策略复杂的场景下,直接使用 NodePort 可能会因为网络接口绑定问题导致连接失败(Redis Sentinel/Cluster 默认绑定默认网卡)。且 NodePort 不支持通过额外的负载均衡器进行代理(因为 Sentinel/Cluster 协议要求客户端直连感知拓扑)。 - -## 5.2 代码接入示例 (Code Examples) - -以下是使用 **Go (go-redis/v9)** 客户端连接 Redis 的示例代码: - -#### 哨兵模式 (Sentinel Mode) - -哨兵模式下,客户端连接 Sentinel 节点,通过 `mymaster`(默认主节点组名)发现主节点。 - -```go -package main - -import ( - "context" - "fmt" - "time" - "github.com/redis/go-redis/v9" -) - -func main() { - client := redis.NewFailoverClient(&redis.FailoverOptions{ - SentinelAddrs: []string{":26379"}, // 替换为 Sentinel 地址 - MasterName: "mymaster", // 固定名称 - Password: "", // Redis 密码 - DialTimeout: 3 * time.Second, - ReadTimeout: 5 * time.Second, - WriteTimeout: 10 * time.Second, - PoolSize: 100, - }) - defer client.Close() - - if val, err := client.Get(context.TODO(), "test").Result(); err != nil { - panic(err) - } else { - fmt.Println(val) - } -} -``` - -#### 集群模式 (Cluster Mode) - -集群模式下,客户端连接任一集群节点获取拓扑。建议配置多个种子节点地址。 - -```go -package main - -import ( - "context" - "fmt" - "time" - "github.com/redis/go-redis/v9" -) - -func main() { - client := redis.NewClusterClient(&redis.ClusterOptions{ - Addrs: []string{":6379"}, // 替换为集群节点地址 - Password: "", - DialTimeout: 3 * time.Second, - ReadTimeout: 5 * time.Second, - WriteTimeout: 10 * time.Second, - PoolSize: 100, - }) - defer client.Close() - - if val, err := client.Get(context.TODO(), "test").Result(); err != nil { - panic(err) - } else { - fmt.Println(val) - } -} -``` - -## 5.3 客户端可靠性最佳实践 - -除了正确的连接配置,客户端的健壮性设计对于保障业务稳定性至关重要: - -1. **超时设置 (Timeouts)** - * **连接超时 (Connect Timeout)**: 建议设为 1-3秒,避免网络不可达时长时间阻塞。 - * **读写超时 (Read/Write Timeout)**: 根据业务 SLA 设置,通常只有几百毫秒。不要设置过长,防止慢查询拖垮应用线程池。 - -2. **重试策略 (Retry Strategy)** - * **指数退避 (Exponential Backoff)**: 遇到网络抖动或主从切换时,不要立即重试,而应使用指数退避算法(如 wait 100ms, 200ms, 400ms...),避免重试风暴加剧 Redis 负载。 - -3. **连接池 (Connection Pooling)** - * **预热与复用**: 务必使用连接池(如 JedisPool, go-redis Pool)。Redis 连接握手成本较高,频繁创建/销毁连接会严重影响性能。 - * **最大连接数**: 根据客户端并发量合理设置 `MaxTotal`,避免超过 Redis 的 `maxclients` 限制。 - -4. **拓扑刷新机制 (Topology Refresh) - 集群模式关键** - * **自动刷新**: 确保客户端开启了基于 `MOVED/ASK` 转向错误的自动拓扑更新功能(绝大多数 Smart Client 默认开启)。 - * **周期性刷新**: 在拓扑变化频繁(如弹性伸缩)或网络不稳定的场景下,建议配置客户端进行周期性拓扑刷新(例如每 60 秒),以主动感知节点变更,减少请求重定向带来的延迟。 - - - - - -# 6. 可观测性与运维 (Observability & Operations) - -## 6.1 数据安全与备份 - -平台备份中心为您提供便捷的数据备份和管理解决方案,您可以在各实例下进行备份,并在备份中心中统一管 -理。此外,备份中心支持外接S3存储,可实现安全备份和恢复。您还可以恢复历史备份到特定实例,以及查看 -和删除备份记录。 - -## 6.3 认证与权限管理 - -Redis Operator(6.0+ 版本)支持通过 `RedisUser` CRD 进行声明式用户管理,支持 ACL 权限控制。 - -**管理操作:** -1. **查看用户**:查看 `RedisUser` 资源。 - ```bash - kubectl -n get RedisUser - ``` -2. **修改密码**:密码存储在 Secret 中。找到 `RedisUser` 关联的 `passwordSecrets`,修改对应的 Secret 内容(Base64 编码)。修改后,Operator 会自动同步,状态暂时变为 Pending,同步完成后变回 Success。 -3. **修改权限**:直接编辑 `RedisUser` 的 `spec.aclRules` 字段。例如:`+@all ~*` 表示所有权限。 -4. **创建新用户**:创建一个 `RedisUser` yaml 文件,指定 `redisName` 关联到 Redis 实例。 - -## 6.4 变更与维护 - -### 更新注意事项 -在对 Redis 实例进行规格变更(CPU/Memory)或扩容时: -1. **资源评估**:确保集群有足够的剩余资源。 -2. **渐进式执行**:Operator 会尽量逐个 Pod 进行滚动更新,以减少服务中断。 -3. **避开高峰期**:对于大数据量实例,建议在业务低峰期进行变更。 - -> [!CAUTION] -> 减少副本数或降低规格时,务必确认当前数据量和负载可以被新规格承载,否则可能导致数据丢失或服务崩溃。 - -### 重启注意事项 -通过管理控制台或 kubectl 触发重启时: -1. **逐个重启**:Pod 会被逐个重启(Rolling Restart)。 -2. **服务中断**:虽然是滚动重启,但在主从切换或节点重连期间,客户端可能会遇到短暂的连接断开或超时,客户端应具备重试机制。 -3. **前置条件**:实例必须处于 `Running` 状态才能执行重启操作。 - -## 6.5 监控告警体系 - -### 关键监控指标 - -生产环境建议重点关注以下指标,并配置相应的告警规则: - -| 类别 | 指标名称 (Prometheus) | 含义 | 告警阈值建议 | -| :--- | :--- | :--- | :--- | -| **性能** | `redis_cpu_user_seconds_total` | CPU 使用率 | 持续 > 80% (单核) | -| | `redis_memory_used_bytes` | 内存使用量 | > 80% limit | -| **可靠性** | `redis_connected_clients` | 连接客户端数 | 接近 maxclients | -| | `redis_evicted_keys_total` | 驱逐 Key 数量 | > 0 (若非预期) | -| | `redis_rejected_connections_total` | 拒绝连接数 | > 0 | -| **缓存效果** | **Hit Rate** (计算得出) | 缓存命中率 | < 80% (视业务而定) | - -**缓存命中率计算 (PromQL 示例):** -```promql -rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) -``` - -## 6.6 故障排查指南 - -您可以在 [Customer Portal](https://cloud.alauda.cn/knowledges) 查询您遇到的问题。 - -查询关键词:Redis、集群模式、监控、崩溃恢复、污点容忍、部署优化…… - - - -# 7. 其他 - -## 7.1 参考资料 - -[https://docs.redis.com/latest/ri/memory-optimizations/](https://docs.redis.com/latest/ri/memory-optimizations/) - -[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/) - -[https://redis.io/docs/](https://redis.io/docs/) diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx new file mode 100644 index 0000000..40be42e --- /dev/null +++ b/docs/en/how_to/redis-best-practices.mdx @@ -0,0 +1,833 @@ +--- +weight: 5 +title: Redis Best Practices +--- + +# Redis Best Practices + +## Overview + +As the de facto standard for caching and key-value storage in cloud-native architectures, Redis handles core requirements for high-concurrency read/write operations and low latency. Running stateful Redis services in a Kubernetes containerized environment presents challenges distinct from traditional physical machine environments, including **persistence stability**, **dynamic network topology changes**, and **resource isolation and scheduling**. + +This Best Practices document aims to provide a standardized reference guide for Redis deployments in production environments. It covers the full lifecycle management from **architecture selection**, **resource planning**, **client integration** to **observability and operations**. By following this guide, users can build an enterprise-class Redis data service that is **High Availability (HA)**, **High Performance**, and **Maintainability**. + +## Architecture Selection + +The Full Stack Cloud Native Open Platform offers two standard Redis management architectures based on customer business scale and SLA requirements: + +### Sentinel Mode + +**Positioning: Classic High Availability Architecture, suitable for small to medium-scale businesses.** + +Sentinel mode is based on Redis's native master-replica replication mechanism. By deploying independent Sentinel process groups to monitor the status of master and replica nodes, it automatically executes Failover and notifies clients when the master node fails. +* **Pros**: Simple architecture, mature operations, lower requirements for client protocols. +* **Cons**: Write capacity is limited to a single node; storage capacity cannot scale horizontally. + +### Cluster Mode + +**Positioning: Distributed Sharding Architecture, suitable for large-scale high-concurrency businesses.** + +Cluster mode automatically shards data across multiple nodes using Hash Slots, enabling horizontal scaling (Scale-out) of storage capacity and read/write performance. +* **Pros**: True high availability distributed storage, supports dynamic Resharding. +* **Cons**: Complex client protocol; specific multi-key commands (e.g., `MGET`) are restricted by Slot distribution. + +### Selection Guide + +When selecting a Redis architecture, consider business requirements for availability, scalability, and complexity. + +| Feature | Sentinel Mode | Cluster Mode | +| :--- | :--- | :--- | +| **Scenarios** | Small/Medium business, Read-heavy/Write-light, moderate data | Large business, High concurrency R/W, massive data | +| **High Availability** | Via Sentinel monitoring and auto-failover | Via node auto-failure detection and recovery | +| **Scalability** | Vertical (Scale-up), Horizontal (Read-only) | Horizontal (R/W), supports dynamic resharding | +| **Read/Write Separation** | Supported (Client support required) | Supported (Usually direct connection to shard master, client support required) | +| **Data Sharding** | None (Single node stores full data) | Yes (Data auto-sharded across multiple nodes) | +| **Ops Complexity** | Lower, simple architecture | Higher, involves sharding, hash slots, migration | +| **Network Constraints** | Requires client support for Sentinel protocol | Requires client support for Cluster protocol | + +**Recommendations:** +* If data volume is small (fits in single node memory) and simplicity/stability is priority, **Sentinel Mode** is preferred. +* If data volume is massive or write pressure is extremely high and cannot be supported by a single node, choose **Cluster Mode**. + +## Version Selection + +Alauda Cache Service for Redis OSS currently supports `5.0`, `6.0`, and `7.2` stable versions. All three versions have undergone complete automated testing and production verification. + +**For new deployments, we strongly recommend choosing Redis `7.2`:** + +1. **Lifecycle** + * **`5.0` / `6.0`**: Community versions are End of Life (EOL) and no longer receive new features or security patches. Recommended only for compatibility with legacy applications. + * **`7.2`**: As the current Long Term Support (LTS) version, it has the longest lifecycle, ensuring operational stability and security updates for years to come. + +2. **Compatibility** + * Redis `7.2` maintains high compatibility with `5.0` and `6.0` data commands. Most business code can migrate smoothly without modification. + * *Note*: RDB persistence file format (v11) is not backward compatible (i.e., RDB generated by `7.2` cannot be loaded by `6.0`), but this does not affect new services. + +3. **Key Features** + * **ACL v2**: Provides granular access control (Key-based permission selectors), significantly enhancing security in multi-tenant environments. + * **Redis Functions**: Introduces Server-side Scripting standards, resolving issues with Lua script loss and replication, keeping logic closer to data. + * **Sharded Pub/Sub**: Resolves network storm issues caused by Pub/Sub broadcasting in Cluster mode, significantly improving messaging scalability via sharding. + * **Performance Optimization**: Deep optimizations in data structures (especially Sorted Sets) and memory management provide higher throughput and lower latency. + +> For more details on Redis 7.2 features, please refer to the official [Redis 7.2 Release Notes](https://github.com/redis/redis/blob/7.2/00-RELEASENOTES). + +## Resource Planning + +### Kernel Tuning + +To ensure stability and high performance in production, the following kernel parameter optimizations are recommended at the Kubernetes node level: + +1. **Memory Allocation (`vm.overcommit_memory`)** + * **Recommended**: `1` + * **Explanation**: Setting to `1` (Always) ensures the kernel allows memory allocation during Redis Fork operations (RDB snapshot/AOF rewrite), even if physical memory appears insufficient. This effectively prevents persistence failures due to allocation errors. + +2. **Connection Queue (`net.core.somaxconn`)** + * **Recommended**: `2048` or higher + * **Explanation**: Redis default tcp-backlog is 511. In high concurrency scenarios, system `net.core.somaxconn` should be increased to avoid dropping client connection requests. + +3. **Transparent Huge Pages (THP)** + * **Action**: **Disable** (`never`) + * **Explanation**: THP causes significant latency spikes during memory allocation in Redis, especially during Copy-on-Write (CoW) after Fork. It is recommended to disable this on the host or via startup scripts. + +### Memory Specifications + +Redis uses a snapshot mechanism to asynchronously replicate in-memory data to disk for long-term storage. This keeps Redis high-performing but carries a risk of data loss between snapshots. + +In Kubernetes containerized environments, we recommend a tiered memory management strategy: +* **✅ Standard Specs (< 8GB)**: **Strongly Recommended**. Ensures extremely low Fork latency and fast failure recovery (RTO < 60s); the most robust production choice. +* **⚠️ High-Performance Specs (8GB - 16GB)**: **Acceptable**. Requires high-performance host and **THP must be disabled**. Fork is controllable but may cause ~100ms jitter under high load. +* **❌ High-Risk Specs (> 16GB)**: **Not Recommended**. Single point of failure impact is too large, and full synchronization can easily saturate network bandwidth. Recommend horizontal splitting into Cluster mode. + +#### Why Limit to 8GB? +While single instances on physical machines often run 32GB+, the 8GB limit in cloud-native environments is based on the "Golden Rule" of these core technologies: + +1. **Fork Blocking & Page Table Copy** + * Redis calls `fork()` during RDB/AOF Rewrite. Although memory pages are CoW, **Process Page Tables must be fully copied**, blocking the **main thread**. + * *Estimation*: 10GB memory ≈ 20MB page table ≈ 10~50ms blocking (depending on virtualization overhead). Exceeding 8GB increases blocking risk exponentially, impacting SLA. + +2. **Failure Recovery Efficiency (RTO)** + * Container restart loading RDB is a **single-threaded CPU-bound task** (object deserialization). Tests show loading 8GB data takes **30-50s** (even with SSD). Maintaining 32GB could result in multi-minute start times, contradicting K8s "fast self-healing" philosophy. + +#### Memory Configuration Best Practices + +To avoid OOM (OOM Kill) during persistence due to memory expansion, strict adherence to these principles is required: + +1. **Set MaxMemory**: Do not set `maxmemory` to 100% of the container Memory Limit. Recommend setting to **70% ~ 80%** of the Limit. +2. **Reserve CoW Space**: Redis Forks a child process during RDB/AOF Rewrite. If there are heavy write updates, OS Copy-on-Write mechanisms duplicate memory pages; in extreme cases, memory usage can double from 8GB to 16GB. +3. **Overcommit Config**: Ensure host `vm.overcommit_memory = 1` to allow kernel forks without requesting equivalent physical memory (relying on CoW), preventing fork failures. + +> [!INFO] +> +> **Resource Reservation Formula**: `Container_Memory_Limit` ≈ `Redis_MaxMemory` / 0.7 +> * Example: To store 8GB data, configure Container Memory Limit to 10GB ~ 12GB, leaving 2GB+ for CoW and fragmentation overhead. + +### CPU Resources + +Redis core command execution is single-threaded, but persistence (Fork) and other operations require child processes. Therefore, allocate **at least 2 Cores** per Redis instance: +* **Core 1**: Handles main thread requests and commands. +* **Core 2**: Handles persistence fork, background tasks, and system overhead. + +#### Multi-threading + +Redis 6.0+ introduced multi-threaded I/O (disabled by default) to overcome single-thread network I/O bottlenecks. + +* **When to Enable?** + * **Bottleneck Analysis**: When Redis CPU usage nears 100% and analysis shows time spent on Kernel State Network I/O (System CPU) rather than user-space command execution. + * **Traffic Profile**: Typically beneficial when single instance QPS > 80,000 or network traffic is huge (> 1GB/s). + * **Resource Conditions**: Ensure node has sufficient CPU cores (at least 4 cores). + +* **Configuration Best Practices**: + * **Thread Count**: Recommend 4~8 I/O threads. Exceeding 8 threads rarely yields significant gain. + * **Config Example**: + ```yaml + io-threads 4 + io-threads-do-reads yes + ``` + * **Note**: Multi-threaded I/O only improves network throughput; it **does NOT** improve execution speed of single complex commands (e.g., `SORT`, `KEYS`). + +### Storage Planning + +#### Capacity Planning +Persistence mode directly determines disk quota requirements. Refer to the following calculation formula: + +| Mode | Recommended Quota Formula | Details | +| :--- | :--- | :--- | +| **Diskless (Cache)** | `0` (No PVC) | Used as pure cache, no RDB/AOF. Logs collected via stdout in K8s, no persistence disk needed. | +| **RDB (Snapshot)** | `MaxMemory * 2` | RDB uses CoW. During snapshot generation, both "old snapshot" and "new snapshot being written" exist on disk.
**Recommendation**: Reserve at least 2x memory space. | +| **AOF (Append Only)** | `MaxMemory * 3` | AOF grows with write operations. Default config (`auto-aof-rewrite-percentage 100`) triggers rewrite when AOF reaches **2x** data size. Disk must hold:
1. Old AOF file (2x)
2. New AOF file from rewrite (1x)
**Peak total 3x**. Recommend reserving at least 3x space. | + +#### Performance Requirements +* **With AOF**: Disk performance is critical. Insufficient IOPS or high fsync latency will directly block the main thread (when `appendfsync everysec`). +* **Media**: Production environments strongly recommend SSD/NVMe local disks or high-performance cloud disks. + +### Parameter Configuration + +Alauda Cache Service for Redis OSS parameters are specified via Custom Resource (CR) fields. + +#### Built-in Templates + +Alauda Cache Service for Redis OSS provides multiple parameter templates for different business scenarios. Selection depends on the trade-off between persistence (Diskless/AOF/RDB) and performance. + +| Template Name | Description | Scenarios | Risks | +| :--- | :--- | :--- | :--- | +| **rdb-redis-<version>-<sentinel\|cluster>** | Enables RDB persistence, periodic snapshots to disk. | **Balanced**: Limited resources, balances performance/reliability, accepts minute-level data loss. | Data loss depends on `save` config, usually minute-level RPO. | +| **aof-redis-<version>-<sentinel\|cluster>** | Enables AOF persistence, logs every write op. | **Secure**: Ample resources, high data security (second-level loss), slight performance compromise. | Frequent fsync requires high-performance storage, high IO pressure. | +| **diskless-redis-<version>-<sentinel\|cluster>** | Disables persistence, pure in-memory. | **High-Perf Cache**: Acceleration only, data loss acceptable or rebuildable from source. | Restart or failure leads to **full data loss**. | + +> `` represents Redis version, e.g., `6.0`, `7.2`. + +Key parameter differences: + +| Parameter | RDB Template | AOF Template | Diskless Template | Explanation | +| :--- | :--- | :--- | :--- | :--- | +| `appendonly` | `no` | `yes` | `no` | Enable AOF logging. | +| `save` | `60 10000 300 100 600 1` | `""` (Disabled) | `""` (Disabled) | RDB snapshot triggers. | +| `repl-diskless-sync` | `no` | `no` | `yes` | Master-replica full sync via socket without disk. | +| `repl-diskless-sync-delay` | `5` | `5` | `0` | Delay for diskless sync; 0 for Diskless to speed up sync. | + +##### Persistence Selection Recommendations + +1. **Pure Cache**: Choose **Diskless Template**. Data rebuildable, no overhead, best performance. +2. **General Business**: Choose **RDB Template**. Periodic snapshots provide minute-level RPO, moderate resource usage. +3. **Financial/High-Reliability**: Choose **AOF Template** with `appendfsync everysec` for second-level protection. + +> [!WARNING] +> **Should RDB + AOF be enabled simultaneously?** +> +> Redis supports running RDB and AOF together, but it is **generally not recommended** in Kubernetes: +> * **Performance**: AOF fsync creates IO pressure; adding RDB fork + disk write significantly increases resource contention. +> * **Storage Doubling**: Requires space for both RDB snapshots and AOF files, complicating PVC planning. +> * **Recovery Priority**: Redis loads AOF first on start (more complete data); RDB acts only as backup, offering limited benefit. +> * **Platform Backup**: Alauda Cache Service for Redis OSS provides independent auto/manual backup, removing reliance on RDB snapshots for extra insurance. +> +> **Recommendation**: Choose **Single Persistence Mode** (RDB or AOF) based on needs, and use platform backup for disaster recovery. If mixed mode is necessary, ensure sufficient Storage IOPS (SSD) and reserve 5x data volume disk space. + +#### Parameter Update + +Redis parameters are categorized by application method: + +| Category | Parameters | Behavior | +| :--- | :--- | :--- | +| **Hot Update** | Most runtime params (`maxmemory`, `loglevel`, etc.) | **Immediate effect** after modification, no restart. | +| **Restart Update** | `databases`, `rename-command`, `rdbchecksum`, `tcp-backlog`, `io-threads`, `io-threads-do-reads` | Requires **Instance Restart** to take effect. | +| **Immutable** | `bind`, `protected-mode`, `port`, `supervised`, `pidfile`, `dir`, etc. | Managed by system, modification may cause anomalies. | + +> [!TIP] +> Always assume data backup before modifying parameters requiring restart. + +#### Modification Examples + +**Update Data Node Parameters**: Configure via `spec.customConfig`. + +```bash +# Example: Modify save strategy (Hot update) +kubectl -n patch redis --type=merge --patch='{"spec": {"customConfig": {"save":"600 1"}}}' +``` + +**Update Sentinel Node Parameters**: Configure via `spec.sentinel.monitorConfig`. +> Currently supports `down-after-milliseconds`, `failover-timeout`, `parallel-syncs`. + +```bash +# Example: Modify failover timeout +kubectl -n patch redis --type=merge --patch='{"spec": {"sentinel": {"monitorConfig": {"down-after-milliseconds":"30000"}}}}' +``` + +### Resource Specs + +Deploy resources according to your actual business scenario. + +#### Sentinel Mode Specs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PersistenceTemplateInstance SpecReplica / SentinelSentinel Podredis-exporterredis (Spec)Backup PodTotal ResourcesStorage QuotaAuto Backup (Keep 7)Manual Backup (Keep 7)
AOFaof-redis-<version>-sentinel2c4g1 / 3100m128Mi100m200Mi2c4gUnlimited (Reserve resources)4.5c4.8gEvaluate based on actual write volume
aof-redis-<version>-sentinel4c8g4c8g8.5c8.8g
RDBrdb-redis-<version>-sentinel2c4g2c4g4.5c4.8g8G28G28G
rdb-redis-<version>-sentinel4c8g4c8g8.5c8.64g16G56G56G
Disklessdiskless-redis-<version>-sentinel2c4g2c4g4.5c4.8g/28G28G
diskless-redis-<version>-sentinel4c8g4c8g8.5c8.8g56G56G
+ +#### Cluster Mode Specs + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PersistenceTemplateInstance SpecSharding / Replicaredis-exporterredis (Spec)Backup PodTotal ResourcesStorage QuotaAuto Backup (Keep 7)Manual Backup (Keep 7)
AOFaof-redis-<version>-cluster2c4g3 / 1100m300Mi2c4gUnlimited (Reserve resources)12.6c25.8gEvaluate based on actual write volume
aof-redis-<version>-cluster4c8g4c8g24.6c49.8g
RDBrdb-redis-<version>-cluster2c4g2c4g12.6c25.8g24G84G84G
rdb-redis-<version>-cluster4c8g4c8g24.6c49.8g48G168G168G
Disklessdiskless-redis-<version>-cluster2c4g2c4g12.6c25.8g/84G84G
diskless-redis-<version>-cluster4c8g4c8g24.6c49.8g168G168G
+ +> `` represents Redis version, e.g., `6.0`, `7.2`. + +## Scheduling + +Alauda Cache Service for Redis OSS offers flexible scheduling strategies, supporting node selection, taint toleration, and various anti-affinity configurations to meet high availability needs in different resource environments. + +### Node Selection + +You can use the `spec.nodeSelector` field to specify which nodes Redis Pods should be scheduled on. This is typically used with Kubernetes Node Labels to isolate database workloads to dedicated node pools. + +> [!WARNING] +> **Persistence Limitation**: If your Redis instance mounts **Non-Network Storage** (e.g., Local PV) PVCs, be cautious when updating `nodeSelector`. Since local data resides on specific nodes and cannot migrate with Pods, the updated `nodeSelector` set **MUST include the node where the Pod currently resides**. If the original node is excluded, the Pod will fail to access data or start. Network storage (Ceph RBD, NFS) follows the Pod and is not subject to this restriction. + +### Taint Toleration + +Use `spec.tolerations` to allow Redis Pods to tolerate node Taints. This allows deploying Redis on dedicated nodes with specific taints (e.g., `key=redis:NoSchedule`), preventing other non-critical workloads from preempting resources. + +### Anti-Affinity + +To prevent single points of failure, Alauda Cache Service for Redis OSS provides anti-affinity configuration. Configuration differs by architecture mode. + +> [!CAUTION] +> **Immutable**: To ensure consistency and reliability, anti-affinity configurations (both `affinityPolicy` and `affinity`) **cannot be modified** after instance creation. Please plan ahead. + +#### Cluster Mode + +In Cluster mode, the system **prioritizes `spec.affinityPolicy`**. Alauda Cache Service for Redis OSS uses this enum to abstract complex topology rules, automatically generating affinity rules for each shard's StatefulSet. + +* **Priority**: `spec.affinityPolicy` > `spec.affinity`. +* **If `affinityPolicy` is unset**: Alauda Cache Service for Redis OSS checks `spec.affinity`. If you need custom topology rules beyond the enums below, leave `affinityPolicy` empty and configure native `spec.affinity`. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Policy NameaffinityPolicy ValueBehaviorPros/ConsScenario
All Pods Forced Anti-AffinityAntiAffinityForces ALL Pods in the cluster (including master/replicas of different shards) to be on different nodes. Fails if node count < total Pod count. +
    +
  • Pros: Highest disaster recovery, minimal single-node failure impact.
  • +
  • Cons: extremely high resource requirement, Node count must be >= Total Pods.
  • +
+
Cluster Mode Core Business
Ample resources, strict HA requirements.
Shard Master-Replica Forced Anti-AffinityAntiAffinityInShardingForces Master and Replicas within same shard to be on different nodes. Pods from different shards can coexist. +
    +
  • Pros: Guarantees physical isolation of data replicas, preventing shard migration data loss.
  • +
  • Cons: Scheduling fails if live nodes < replica count. Masters of different shards might land on same node (SPOF risk).
  • +
+
Production Standard
Balances resource usage and data safety.
Shard Master-Replica Soft Anti-AffinitySoftAntiAffinityPrioritizes spreading shard master/replicas. If impossible (e.g., insufficient nodes), allows scheduling on same node. +
    +
  • Pros: Highest deployment success rate, runs with limited resources.
  • +
  • Cons: Master/Replica may share node in extreme cases, risking data loss.
  • +
+
Test/Dev Environments
Or resource-constrained edge environments.
+ +#### Sentinel Mode + +> **Important** +> **Sentinel Mode does not support `spec.affinityPolicy`**. + +For Sentinel mode, Redis Data Nodes and Sentinel Nodes require separate Kubernetes native Affinity rules: + +* **Redis Data Nodes**: Configured via **`spec.affinity`**. +* **Sentinel Nodes**: Configured via **`spec.sentinel.affinity`**. + +You need to manually write complete `Affinity` rules. Example for forcing anti-affinity for both Data and Sentinel nodes: + +```yaml +spec: + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchExpressions: + - key: app.kubernetes.io/component + operator: In + values: + - redis + - key: redisfailovers.databases.spotahome.com/name + operator: In + values: + - + topologyKey: kubernetes.io/hostname + sentinel: + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchExpressions: + - key: app.kubernetes.io/component + operator: In + values: + - sentinel + - key: redissentinels.databases.spotahome.com/name + operator: In + values: + - + topologyKey: kubernetes.io/hostname +``` + +To force anti-affinity across ALL nodes (Data + Sentinel), refer to: + +```yaml +spec: + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchExpressions: + - key: middleware.instance/type + operator: In + values: + - redis-failover + - key: middleware.instance/name: + operator: In + values: + - + topologyKey: kubernetes.io/hostname + sentinel: + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchExpressions: + - key: middleware.instance/type + operator: In + values: + - redis-failover + - key: middleware.instance/name: + operator: In + values: + - + topologyKey: kubernetes.io/hostname +``` + +## User Management + +Alauda Cache Service for Redis OSS (v6.0+) provides declarative user management via `RedisUser` CRD, supporting ACLs. + +> [!TIP] +> **Compatibility**: Redis 5.0 only supports single-user auth; Redis 6.0+ implements full ACLs for multi-user/granular control. + +### Permission Profiles + +The platform pre-defines permission profiles for common scenarios: + +| Profile | ACL Rule | Explanation | +| :--- | :--- | :--- | +| **NotDangerous** | `+@all -@dangerous ~*` | Allows all commands except dangerous ones (e.g., `FLUSHDB`). | +| **ReadWrite** | `-@all +@write +@read -@dangerous ~*` | Allows read/write, blocks dangerous ops. | +| **ReadOnly** | `-@all +@read -keys ~*` | Allows read-only operations. | +| **Administrator** | `+@all -acl ~*` | Admin privileges, allows all commands except ACL management. | + +For custom ACLs, see [Redis ACL Documentation](https://redis.io/topics/acl). + +### Security Mechanisms + +1. **ACL Force Revocation**: All `RedisUser` creation/updates undergo Webhook validation to **force remove `acl` permissions**, preventing privilege escalation. +2. **Cluster Command Injection**: For **Cluster Mode**, Alauda Cache Service for Redis OSS automatically injects topology commands: `cluster|slots`, `cluster|nodes`, `cluster|info`, `cluster|keyslot`, `cluster|getkeysinslot`, `cluster|countkeysinslot` to ensure client awareness. +3. **6.0 -> 7.2 Upgrade Compatibility**: When upgrading 6.0 -> 7.2, the operator adds `&*` (Pub/Sub Channel) permission to ensure consistency with 7.x's new Channel ACLs. + +### System Account + +Each Redis instance automatically generates a system account named `operator`. Its roles include: + +1. **Cluster Init**: Slot assignment, node joining. +2. **Config Simplification**: Unified system account reduces user configuration complexity. +3. **Operations**: Used for health checks, failovers, scaling. +4. **Avoid Restarts**: Password updates for business users don't affect this account, avoiding restarts. + +> [!CAUTION] +> `operator` is a **Reserved System Account**: +> * **Complexity**: Random 64-char string (alphanumeric+special). +> * **Privilege**: Highest level (includes user management). +> * **Restriction**: **No online password update** and **DO NOT manually modify/delete**, as it may cause irreversible failure. + +### Production Best Practices + +1. **App Isolation**: Create **independent user accounts** for each app/microservice. Avoid sharing accounts to enable auditing and isolation. +2. **Principle of Least Privilege**: + * **Read-Only App**: Use `ReadOnly`. + * **Read-Write App**: Use `ReadWrite`. + * **Ops Tools**: Use `NotDangerous` or custom permissions. + * **Avoid `Administrator`**: Unless absolutely necessary. +3. **Key Namespace Isolation**: Combine ACL Key patterns (e.g., `~app1:*`) to restrict apps to specific key prefixes. +4. **Password Rotation**: Establish mechanisms to regularly rotate app passwords. + +For operation steps, see [User Management Docs](../functions/20-user.mdx). + +## Client Access + +### Topology Discovery + +Both **Sentinel** and **Cluster** modes rely on clients actively discovering and connecting to data nodes, differing from traditional LB proxy modes: + +#### Sentinel Mode + +1. Client connects to **Sentinel Node**. +2. Client sends `SENTINEL get-master-addr-by-name mymaster` to get Master **IP/Port**. +3. Client **directly connects** to Master. +4. On failover, Sentinel notifies client (or client polls) to switch to new Master. + +#### Cluster Mode + +1. Client connects to any **Cluster Node**. +2. Sends `CLUSTER SLOTS` / `CLUSTER NODES` to get **Slot Distribution**. +3. Calculates hash slot for Key and **directly connects** to target node. +4. If slot migrates, node returns `MOVED`/`ASK`; client must refresh topology. + +Both protocols return **Real Node IPs**. If a reverse proxy (HAProxy/Nginx) is used, clients still get backend real IPs, which may be unreachable from outside the cluster. +Thus, **Each Redis Pod needs an independent external address** (NodePort/LoadBalancer), not a single proxy address. + +### Network Access strategies + +Alauda Cache Service for Redis OSS supports multiple access methods: + +#### Sentinel Mode + +| Method | Recommended | Description | +| :--- | :--- | :--- | +| **ClusterIP** | ✅ **Internal Preferred** | Access Sentinel via K8s Service (port 26379). Clients auto-discover Master. Lowest latency, highest security. | +| **LoadBalancer** | ✅ **External Preferred** | Exposes Sentinel via MetalLB/Cloud LB. Stable external entry, no port management. | +| **NodePort** | ⚠️ External Backup | Exposes Sentinel via Node ports. Requires manual port management, risky, potential multi-NIC binding issues. | + +#### Cluster Mode + +| Method | Recommended | Description | +| :--- | :--- | :--- | +| **ClusterIP** | ✅ **Internal Preferred** | Access via K8s Service. Client must support Cluster protocol. | +| **LoadBalancer** | ✅ **External Preferred** | Configure LB for each shard Master. Stable external access. Client must handle MOVED/ASK. | +| **NodePort** | ⚠️ External Backup | Expose underlying Pod NodePorts. Client connects directly. Complex port management. | + +> [!WARNING] +> **NodePort Notes**: +> * **Port Management**: Range limited (30000-32767), conflicts easy in multi-instance. +> * **Security**: Increases attack surface. +> * **Multi-NIC**: Redis binds default NIC; clients may fail to connect if IPs mismatch. +> * **No LB Proxy**: Sentinel/Cluster protocols require direct node connection; cannot be proxied by standard LBs. + +> [!INFO] +> **Resource Usage**: LB/NodePort creates a **Service per Pod**. +> * **Sentinel** (1M1S + 3 Sentinels): Needs **8 NodePorts/LBs**. +> * **Cluster** (3 Shards x 1M1S): Needs **7 NodePorts/LBs**. + +### Code Examples + +We provide best practice examples for **go-redis**, **Jedis**, **Lettuce**, and **Redisson**: + +* **Sentinel Access**: [How to Access Sentinel Instance](./access/10-sentinel.mdx) +* **Cluster Access**: [How to Access Cluster Instance](./access/20-cluster.mdx) + +> [!INFO] +> **Master Group Name**: In Sentinel mode, the master name is fixed to `mymaster`. + +### Client Reliability Best Practices + +1. **Timeouts** + * **Connect Timeout**: distinct from Read Timeout. Recommend 1-3s. + * **Read/Write Timeout**: Based on SLA, usually hundreds of ms. + +2. **Retry Strategy** + * **Exponential Backoff**: Do not retry immediately on failure; use backoff (100ms, 200ms...) to avoid retry storms. + +3. **Connection Pooling** + * **Reuse**: Always use pooling (JedisPool, go-redis Pool) to save handshake costs. + * **Max Connections**: Set `MaxTotal` reasonably to avoid hitting Redis `maxclients`. + +4. **Topology Refresh (Cluster)** + * **Auto-Refresh**: Ensure client enables `MOVED/ASK` handling. + * **Periodic Refresh**: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes. + +## Observability & Operations + +### Backup & Security + +The platform Backup Center provides convenient data management. You can backup instances, manage them centrally, and support S3 offloading. Support for restoring history to specific instances. + +See **[Backup & Restore](../functions/70-backup-restore.mdx)**. + +### Upgrade & Scaling + +#### Upgrade +See **[Upgrade](../upgrade.mdx)**. + +#### Scaling Notes +When changing specs (CPU/Mem) or expanding: +1. **Assess Resources**: Ensure cluster has capacity. +2. **Progressive**: Rolling updates to minimize interruption. +3. **Off-peak**: Execute during low traffic. + +> [!CAUTION] +> When reducing replicas or specs, ensure current data/load fits new specs to avoid data loss/crash. + +### Monitoring + +Alauda Cache Service for Redis OSS has built-in metrics integrated with Prometheus. + +#### Built-in Metrics + +Variables `{{.namespace}}` and `{{.name}}` should be replaced with actual values. + +##### Key Hit Rate + +* **Desc**: Cache hit rate. +* **Unit**: % +* **Expr**: + ```text + 1/(1+(avg(irate(redis_keyspace_misses_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) by(namespace,service) / (avg(irate(redis_keyspace_hits_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) by(namespace,service)+1))) + ``` + +##### Average Response Time + +* **Desc**: Avg command latency. High = slow queries/bottleneck. +* **Unit**: s +* **Expr**: + ```text + avg((redis_commands_duration_seconds_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"} / redis_commands_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"})) by (namespace,service) + ``` + +##### Role Switching + +* **Desc**: Master-Replica switches in 5m. Non-zero = failover occurred. +* **Unit**: Count +* **Expr**: + ```text + sum by(namespace,service) (changes((sum by(namespace,service,pod)(redis_instance_info{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*",role="master"}) OR (sum by(namespace,service,pod)(redis_instance_info{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*",}) * 0))[5m:10s])) + ``` + +##### Instance Status + +* **Desc**: Health status. 0 = Abnormal. +* **Expr**: + ```text + ((count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster"}) % count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster",role="master"})) == bool 0 and count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"{{.name}}",redisarch="cluster",role="master"}) >= bool 3) or (count by(namespace,service)(redis_instance_info{namespace=~"{{.namespace}}",service=~"rfr-({{.name}})",redisarch="sentinel",role="master"})) > bool 0 + ``` + +##### Node Input Bandwidth + +* **Desc**: Peak ingress traffic. +* **Unit**: Bps +* **Expr**: + ```text + max by(namespace,service)(irate(redis_net_input_bytes_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) + ``` + +##### Node Output Bandwidth + +* **Desc**: Peak egress traffic. +* **Unit**: Bps +* **Expr**: + ```text + max by(namespace,service)(irate(redis_net_output_bytes_total{namespace=~"{{.namespace}}", pod=~"(drc|rfr)-({{.name}})-.*"}[5m])) + ``` + +##### Node Connections + +* **Desc**: Peak client connections. Watch if near `maxclients`. +* **Unit**: Count +* **Expr**: + ```text + max by(namespace,service)(redis_connected_clients{namespace=~"{{.namespace}}",pod=~"(drc|rfr)-({{.name}})-.*"}) + ``` + +##### CPU Usage + +* **Desc**: Node CPU usage. Sustained high = perf impact. +* **Unit**: % +* **Expr**: + ```text + avg by(namespace,pod_name)(irate(container_cpu_usage_seconds_total{namespace=~"{{.namespace}}",pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}[5m]))/avg by(namespace,pod_name)(container_spec_cpu_quota{namespace=~"{{.namespace}}",pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"})*100000 + ``` + +##### Memory Usage + +* **Desc**: Node memory usage. >80% suggest scaling. +* **Unit**: % +* **Expr**: + ```text + avg by(namespace,pod_name)(container_memory_usage_bytes{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"} - container_memory_cache{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}) / avg by(namespace,pod_name)(container_spec_memory_limit_bytes{namespace=~"{{.namespace}}", pod_name=~"(drc|rfr)-({{.name}})-.*",container_name="redis"}) + ``` + +##### Storage Usage + +* **Desc**: PVC usage. Full = persistence failure. +* **Unit**: % +* **Expr**: + ```text + avg(kubelet_volume_stats_used_bytes{namespace=~"{{.namespace}}",persistentvolumeclaim=~"redis-data-(drc|rfr)-({{.name}})-.*"}) by(namespace,persistentvolumeclaim) / avg(kubelet_volume_stats_capacity_bytes{namespace=~"{{.namespace}}",persistentvolumeclaim=~"redis-data-(drc|rfr)-({{.name}})-.*"}) by(namespace,persistentvolumeclaim) + ``` + +#### Key Metrics & Alert Recommendations + +Recommended production alerts: + +| Metric | Threshold | Note | +| :--- | :--- | :--- | +| **Memory Usage** | > 80% | Risk of eviction/OOM. | +| **CPU Usage** | > 80% (Sustained) | Latency spikes. | +| **Hit Rate** | < 80% | Strategy issue or capacity missing. | +| **Failovers** | > 0 | Check network/node health. | +| **Connections** | Near maxclients | New connections rejected. | +| **Storage Usage** | > 80% | Ensure space for AOF/RDB. | +| **Response Time** | > 10ms | Slow queries/bottlenecks. | + +### Troubleshooting + +For specific issues, search the [Customer Portal](https://cloud.alauda.cn/kb). + +## References + +[https://redis.io/docs/](https://redis.io/docs/) + +[https://docs.redis.com/latest/ri/memory-optimizations/](https://docs.redis.com/latest/ri/memory-optimizations/) + +[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/) \ No newline at end of file From 4944d48ede4b0a611c318e0322ab046070ff2753 Mon Sep 17 00:00:00 2001 From: Seer Date: Wed, 10 Dec 2025 18:04:36 +0800 Subject: [PATCH 3/9] Update docs/en/how_to/redis-best-practices.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/en/how_to/redis-best-practices.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx index 40be42e..27e45f7 100644 --- a/docs/en/how_to/redis-best-practices.mdx +++ b/docs/en/how_to/redis-best-practices.mdx @@ -288,7 +288,7 @@ Deploy resources according to your actual business scenario. rdb-redis-<version>-sentinel 4c8g 4c8g - 8.5c8.64g + 8.5c8.8g 16G 56G 56G From 7c3144796b59c77b1313f91d2763ea5d7674d787 Mon Sep 17 00:00:00 2001 From: Seer Date: Wed, 10 Dec 2025 18:04:57 +0800 Subject: [PATCH 4/9] Update docs/en/how_to/redis-best-practices.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/en/how_to/redis-best-practices.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx index 27e45f7..b7cdb2d 100644 --- a/docs/en/how_to/redis-best-practices.mdx +++ b/docs/en/how_to/redis-best-practices.mdx @@ -685,7 +685,7 @@ We provide best practice examples for **go-redis**, **Jedis**, **Lettuce**, and 4. **Topology Refresh (Cluster)** * **Auto-Refresh**: Ensure client enables `MOVED/ASK` handling. - * **Periodic Refresh**: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes. + * **Periodic refresh**: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes. ## Observability & Operations From 6fda14c931c3c83e068b22e625e73c493e52b3b9 Mon Sep 17 00:00:00 2001 From: Seer Date: Wed, 10 Dec 2025 18:05:13 +0800 Subject: [PATCH 5/9] Update docs/en/how_to/redis-best-practices.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/en/how_to/redis-best-practices.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx index b7cdb2d..e45c988 100644 --- a/docs/en/how_to/redis-best-practices.mdx +++ b/docs/en/how_to/redis-best-practices.mdx @@ -628,7 +628,7 @@ Both **Sentinel** and **Cluster** modes rely on clients actively discovering and Both protocols return **Real Node IPs**. If a reverse proxy (HAProxy/Nginx) is used, clients still get backend real IPs, which may be unreachable from outside the cluster. Thus, **Each Redis Pod needs an independent external address** (NodePort/LoadBalancer), not a single proxy address. -### Network Access strategies +### Network Access Strategies Alauda Cache Service for Redis OSS supports multiple access methods: From dca6a234f604a28a2ef50b187355f6660660ae87 Mon Sep 17 00:00:00 2001 From: hfxia Date: Wed, 10 Dec 2025 18:06:27 +0800 Subject: [PATCH 6/9] fix: fix affinity label error --- docs/en/how_to/redis-best-practices.mdx | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx index e45c988..4d8056b 100644 --- a/docs/en/how_to/redis-best-practices.mdx +++ b/docs/en/how_to/redis-best-practices.mdx @@ -498,7 +498,7 @@ spec: operator: In values: - - topologyKey: kubernetes.io/hostname + topologyKey: kubernetes.io/hostname sentinel: affinity: podAntiAffinity: @@ -513,7 +513,7 @@ spec: operator: In values: - - topologyKey: kubernetes.io/hostname + topologyKey: kubernetes.io/hostname ``` To force anti-affinity across ALL nodes (Data + Sentinel), refer to: @@ -529,11 +529,11 @@ spec: operator: In values: - redis-failover - - key: middleware.instance/name: + - key: middleware.instance/name operator: In values: - - topologyKey: kubernetes.io/hostname + topologyKey: kubernetes.io/hostname sentinel: affinity: podAntiAffinity: @@ -544,11 +544,11 @@ spec: operator: In values: - redis-failover - - key: middleware.instance/name: + - key: middleware.instance/name operator: In values: - - topologyKey: kubernetes.io/hostname + topologyKey: kubernetes.io/hostname ``` ## User Management @@ -830,4 +830,4 @@ For specific issues, search the [Customer Portal](https://cloud.alauda.cn/kb). [https://docs.redis.com/latest/ri/memory-optimizations/](https://docs.redis.com/latest/ri/memory-optimizations/) -[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/) \ No newline at end of file +[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/) From a2114a31df412806b42d0accede1a416e6aa833b Mon Sep 17 00:00:00 2001 From: Seer Date: Wed, 10 Dec 2025 18:06:56 +0800 Subject: [PATCH 7/9] Update docs/en/how_to/redis-best-practices.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- docs/en/how_to/redis-best-practices.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx index 4d8056b..0ec4002 100644 --- a/docs/en/how_to/redis-best-practices.mdx +++ b/docs/en/how_to/redis-best-practices.mdx @@ -684,7 +684,7 @@ We provide best practice examples for **go-redis**, **Jedis**, **Lettuce**, and * **Max Connections**: Set `MaxTotal` reasonably to avoid hitting Redis `maxclients`. 4. **Topology Refresh (Cluster)** - * **Auto-Refresh**: Ensure client enables `MOVED/ASK` handling. + * **Auto-refresh**: Ensure client enables `MOVED/ASK` handling. * **Periodic refresh**: In unstable/scaling environments, configure periodic refresh (e.g., 60s) to proactively detect changes. ## Observability & Operations From 40f6e1ba9368f5c32ba62bcd1a69afcc6c1078d2 Mon Sep 17 00:00:00 2001 From: hfxia Date: Thu, 11 Dec 2025 10:07:10 +0800 Subject: [PATCH 8/9] fix: fix lint --- docs/en/how_to/redis-best-practices.mdx | 42 ++++++++++++------------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/redis-best-practices.mdx index 0ec4002..93dcf02 100644 --- a/docs/en/how_to/redis-best-practices.mdx +++ b/docs/en/how_to/redis-best-practices.mdx @@ -265,21 +265,21 @@ Deploy resources according to your actual business scenario. 100m200Mi 2c4g Unlimited (Reserve resources) - 4.5c4.8g + 4.5c4.8G Evaluate based on actual write volume aof-redis-<version>-sentinel 4c8g 4c8g - 8.5c8.8g + 8.5c8.8G RDB rdb-redis-<version>-sentinel 2c4g 2c4g - 4.5c4.8g + 4.5c4.8G 8G 28G 28G @@ -288,7 +288,7 @@ Deploy resources according to your actual business scenario. rdb-redis-<version>-sentinel 4c8g 4c8g - 8.5c8.8g + 8.5c8.8G 16G 56G 56G @@ -298,7 +298,7 @@ Deploy resources according to your actual business scenario. diskless-redis-<version>-sentinel 2c4g 2c4g - 4.5c4.8g + 4.5c4.8G / 28G 28G @@ -307,7 +307,7 @@ Deploy resources according to your actual business scenario. diskless-redis-<version>-sentinel 4c8g 4c8g - 8.5c8.8g + 8.5c8.8G 56G 56G @@ -341,21 +341,21 @@ Deploy resources according to your actual business scenario. 100m300Mi 2c4g Unlimited (Reserve resources) - 12.6c25.8g + 12.6c25.8G Evaluate based on actual write volume aof-redis-<version>-cluster 4c8g 4c8g - 24.6c49.8g + 24.6c49.8G RDB rdb-redis-<version>-cluster 2c4g 2c4g - 12.6c25.8g + 12.6c25.8G 24G 84G 84G @@ -364,7 +364,7 @@ Deploy resources according to your actual business scenario. rdb-redis-<version>-cluster 4c8g 4c8g - 24.6c49.8g + 24.6c49.8G 48G 168G 168G @@ -374,7 +374,7 @@ Deploy resources according to your actual business scenario. diskless-redis-<version>-cluster 2c4g 2c4g - 12.6c25.8g + 12.6c25.8G / 84G 84G @@ -383,7 +383,7 @@ Deploy resources according to your actual business scenario. diskless-redis-<version>-cluster 4c8g 4c8g - 24.6c49.8g + 24.6c49.8G 168G 168G @@ -435,7 +435,7 @@ In Cluster mode, the system **prioritizes `spec.affinityPolicy`**. Alauda Cache All Pods Forced Anti-Affinity AntiAffinity - Forces ALL Pods in the cluster (including master/replicas of different shards) to be on different nodes. Fails if node count < total Pod count. + Forces ALL Pods in the cluster (including primary/replicas of different shards) to be on different nodes. Fails if node count < total Pod count.
  • Pros: Highest disaster recovery, minimal single-node failure impact.
  • @@ -445,25 +445,25 @@ In Cluster mode, the system **prioritizes `spec.affinityPolicy`**. Alauda Cache Cluster Mode Core Business
    Ample resources, strict HA requirements. - Shard Master-Replica Forced Anti-Affinity + Shard Primary-Replica Forced Anti-Affinity AntiAffinityInSharding - Forces Master and Replicas within same shard to be on different nodes. Pods from different shards can coexist. + Forces Primary and Replicas within same shard to be on different nodes. Pods from different shards can coexist.
    • Pros: Guarantees physical isolation of data replicas, preventing shard migration data loss.
    • -
    • Cons: Scheduling fails if live nodes < replica count. Masters of different shards might land on same node (SPOF risk).
    • +
    • Cons: Scheduling fails if live nodes < replica count. Primaries of different shards might land on same node (single point of failure risk).
    Production Standard
    Balances resource usage and data safety. - Shard Master-Replica Soft Anti-Affinity + Shard Primary-Replica Soft Anti-Affinity SoftAntiAffinity - Prioritizes spreading shard master/replicas. If impossible (e.g., insufficient nodes), allows scheduling on same node. + Prioritizes spreading shard primary/replicas. If impossible (e.g., insufficient nodes), allows scheduling on same node.
    • Pros: Highest deployment success rate, runs with limited resources.
    • -
    • Cons: Master/Replica may share node in extreme cases, risking data loss.
    • +
    • Cons: Primary/Replica may share node in extreme cases, risking data loss.
    Test/Dev Environments
    Or resource-constrained edge environments. @@ -657,8 +657,8 @@ Alauda Cache Service for Redis OSS supports multiple access methods: > [!INFO] > **Resource Usage**: LB/NodePort creates a **Service per Pod**. -> * **Sentinel** (1M1S + 3 Sentinels): Needs **8 NodePorts/LBs**. -> * **Cluster** (3 Shards x 1M1S): Needs **7 NodePorts/LBs**. +> * **Sentinel** (1P1R + 3 Sentinels): Needs **8 NodePorts/LBs**. +> * **Cluster** (3 Shards x 1P1R): Needs **7 NodePorts/LBs**. ### Code Examples From 4793fdb2bfa18c48c68e86a618dca31cf3e86767 Mon Sep 17 00:00:00 2001 From: hfxia Date: Thu, 11 Dec 2025 11:32:25 +0800 Subject: [PATCH 9/9] fix: updated the references --- ...redis-best-practices.mdx => best-practices.mdx} | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) rename docs/en/how_to/{redis-best-practices.mdx => best-practices.mdx} (98%) diff --git a/docs/en/how_to/redis-best-practices.mdx b/docs/en/how_to/best-practices.mdx similarity index 98% rename from docs/en/how_to/redis-best-practices.mdx rename to docs/en/how_to/best-practices.mdx index 93dcf02..74aa922 100644 --- a/docs/en/how_to/redis-best-practices.mdx +++ b/docs/en/how_to/best-practices.mdx @@ -1,9 +1,9 @@ --- weight: 5 -title: Redis Best Practices +title: Best Practices --- -# Redis Best Practices +# Best Practices ## Overview @@ -826,8 +826,8 @@ For specific issues, search the [Customer Portal](https://cloud.alauda.cn/kb). ## References -[https://redis.io/docs/](https://redis.io/docs/) - -[https://docs.redis.com/latest/ri/memory-optimizations/](https://docs.redis.com/latest/ri/memory-optimizations/) - -[https://architecturenotes.co/redis/](https://architecturenotes.co/redis/) +* [High availability with Redis Sentinel](https://redis.io/docs/latest/operate/oss_and_stack/management/sentinel/) +* [Redis cluster specification](https://redis.io/docs/latest/operate/oss_and_stack/reference/cluster-spec/) +* [Redis persistence](https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/) +* [Scale with Redis Cluster](https://redis.io/docs/latest/operate/oss_and_stack/management/scaling/) +* [Optimizing Redis](https://redis.io/docs/latest/operate/oss_and_stack/management/optimization/)