kserve · Billy99 · Dec 12, 2025 · spolti · Dec 15, 2025 · Billy99
diff --git a/docs/concepts/resources/index.md b/docs/concepts/resources/index.md
@@ -46,7 +46,7 @@ Manage model storage and access patterns:
 
 ### LocalModel & LocalModelNode
 Enables local model caching and management:
-- **[Concepts](../../model-serving/generative-inference/modelcache/localmodel.md)**: Overview of local model caching in KServe.
+- **[Concepts](../../model-serving/predictive-inference/modelcache/localmodel.md)**: Overview of local model caching in KServe.
 - **[LocalModelCache](../../reference/crd-api.mdx)**: CRD that Defines local model caching requirements and policies
 - **[LocalModelNode](../../reference/crd-api.mdx)**: CRD that handles Node-level model caching management
 - **[LocalModelNodeGroup](../../reference/crd-api.mdx)**: CRD for Grouping of local model nodes for management and orchestration of cached models

diff --git a/docs/getting-started/genai-first-isvc.md b/docs/getting-started/genai-first-isvc.md
@@ -319,5 +319,4 @@ Now that you have successfully deployed a generative AI service using KServe, yo
 - 📖 **[Supported Tasks](../model-serving/generative-inference/overview.md#supported-generative-tasks)** - Discover the various tasks that KServe can handle.
 - 📖 **[Autoscaling](../model-serving/generative-inference/autoscaling/autoscaling.md)**: Automatically scale your service based on traffic and resource usage / metrics.
 - 📖 **[KV Cache Offloading](../model-serving/generative-inference/kvcache-offloading/kvcache-offloading.md)** - Learn how to offload key-value caches to external storage for improved performance and reduced latency.
-- 📖 **[Model Caching](../model-serving/generative-inference/modelcache/localmodel.md)** - Learn how to cache models for faster startup time.
 - 📖 **[Token Rate Limiting](../model-serving/generative-inference/ai-gateway/envoy-ai-gateway.md)** - Rate limit users based on token usage.
diff --git a/docs/getting-started/predictive-first-isvc.md b/docs/getting-started/predictive-first-isvc.md
@@ -297,3 +297,4 @@ Now that you have successfully deployed your first Predictive InferenceService,
 - 📖 **[Supported Frameworks](../model-serving/predictive-inference/frameworks/overview.md)** - Explore Supported Frameworks.
 - 📖 **[Batch InferenceService](../model-serving/predictive-inference/batcher/batcher.md)** - Deploy your first Batch InferenceService.
 - 📖 **[Canary Deployments](../model-serving/predictive-inference/rollout-strategies/canary-example.md)**: Gradually roll out new model versions to test their performance before full deployment.
+- 📖 **[Model Caching](../model-serving/predictive-inference/modelcache/localmodel.md)** - Learn how to cache models for faster startup time.
diff --git a/docs/intro.md b/docs/intro.md
@@ -35,7 +35,6 @@ Enterprise authentication, network policies, and compliance features built-in. D
 #### Generative Inference Benefits
 ✅ **LLM Multi-framework Support** - Deploy LLMs from Hugging Face, vLLM, and custom generative models  
 ✅ **OpenAI-Compatible APIs** - Chat completion, completion, streaming, and embedding endpoints  
-✅ **LocalModelCache for LLMs** - Cache large models locally to reduce startup time from 15-20 minutes to ~1 minute  
 ✅ **KV Cache Offloading** - Optimized memory management for long conversations and large contexts  
 ✅ **Multi-node Inference** - Distributed LLM serving  
 ✅ **Envoy AI Gateway Integration** - Enterprise-grade API management and routing for AI workloads  
@@ -50,6 +49,7 @@ Enterprise authentication, network policies, and compliance features built-in. D
 ✅ **Real-time Scoring** - Low-latency prediction serving for real-time applications  
 ✅ **Production ML Monitoring** - Comprehensive observability, drift detection, and explainability  
 ✅ **Standard Inference Protocols** - Support for Open Inference Protocol (V1/V2) across frameworks
+✅ **LocalModelCache for LLMs** - Cache large models locally to reduce startup time from 15-20 minutes to ~1 minute  
 
 #### Universal Benefits (Both Inference Types)
 ✅ **Serverless Inference Workloads** - Automatic scaling including scale-to-zero on both CPU and GPU  

diff --git a/docs/model-serving/generative-inference/overview.md b/docs/model-serving/generative-inference/overview.md
@@ -156,7 +156,6 @@ The following examples demonstrate how to deploy and perform inference using the
 
 ## Advanced Features
 The Hugging Face runtime supports several advanced features to enhance model serving capabilities:
-- [**Model Caching**](./modelcache/localmodel.md): Cache models on local storage for faster loading and reduced latency. This is particularly useful for large models that are frequently accessed.
 - [**KV Cache Offloading**](./kvcache-offloading/kvcache-offloading.md): Offload key-value caches to CPU memory to reduce GPU memory usage, allowing larger models to be served on GPUs with limited memory.
 - [**Distributed LLM Serving**](./multi-node/multi-node.md): Scale model serving across multiple nodes and GPUs for high throughput and low latency. This is useful for serving large models or handling high request volumes.
 - [**AI Gateway**](./ai-gateway/envoy-ai-gateway.md): Use the AI Gateway to manage rate-limiting based on tokens and route requests to different models, providing a unified API for various generative tasks.

diff --git a/docs/model-serving/generative-inference/sdk-integration/sdk-integration.md b/docs/model-serving/generative-inference/sdk-integration/sdk-integration.md
@@ -276,7 +276,6 @@ After integrating your LLM with an SDK, consider exploring:
 
 1. **Advanced serving options** like [multi-node inference](../multi-node/multi-node.md) for large models
 2. **Exploring other inference tasks** such as [text-to-text generation](../tasks/text2text-generation/text2text-generation.md) and [embeddings](../tasks/embedding/embedding.md)
-3. **Optimizing performance** with features like [model caching](../modelcache/localmodel.md) and [KV cache offloading](../kvcache-offloading/kvcache-offloading.md)
-4. **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../autoscaling/autoscaling.md) 
+3. **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../autoscaling/autoscaling.md) 
 
 By connecting your KServe-deployed models with these popular SDKs, you can quickly build sophisticated AI applications while maintaining control over your model infrastructure.
diff --git a/docs/model-serving/generative-inference/tasks/embedding/embedding.md b/docs/model-serving/generative-inference/tasks/embedding/embedding.md
@@ -262,7 +262,7 @@ Once you've successfully deployed your embedding model, consider:
 
 - **Advanced serving options** like [multi-node inference](../../multi-node/multi-node.md) for large models
 - **Exploring other inference tasks** such as [text-to-text generation](../text2text-generation/text2text-generation.md) and [reranking](../reranking/rerank.md)
-- **Optimizing performance** with features like [model caching](../../modelcache/localmodel.md) and [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
+- **Optimizing performance** with features like [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
 - **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../../autoscaling/autoscaling.md)
 - **Token based rate limiting** to control usage with [AI Gateway](../../ai-gateway/envoy-ai-gateway.md) for serving models.
 

diff --git a/docs/model-serving/generative-inference/tasks/reranking/rerank.md b/docs/model-serving/generative-inference/tasks/reranking/rerank.md
@@ -221,7 +221,7 @@ Once you've successfully deployed your reranker model, consider:
 
 - **Advanced serving options** like [multi-node inference](../../multi-node/multi-node.md) for large models
 - **Exploring other inference tasks** such as [text-to-text generation](../text2text-generation/text2text-generation.md) and [embedding](../embedding/embedding.md)
-- **Optimizing performance** with features like [model caching](../../modelcache/localmodel.md) and [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
+- **Optimizing performance** with features like [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
 - **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../../autoscaling/autoscaling.md)
 - **Token based rate limiting** to control usage with [AI Gateway](../../ai-gateway/envoy-ai-gateway.md) for serving models.
 

diff --git a/docs/model-serving/generative-inference/tasks/text-generation/text-generation.md b/docs/model-serving/generative-inference/tasks/text-generation/text-generation.md
@@ -326,7 +326,7 @@ Once you've successfully deployed your text generation model, consider:
 
 - **Advanced serving options** like [multi-node inference](../../multi-node/multi-node.md) for large models
 - **Exploring other inference tasks** such as [text-to-text generation](../text2text-generation/text2text-generation.md) and [embedding](../embedding/embedding.md)
-- **Optimizing performance** with features like [model caching](../../modelcache/localmodel.md) and [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
+- **Optimizing performance** with features like [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
 - **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../../autoscaling/autoscaling.md)
 - **Token based rate limiting** to control usage with [AI Gateway](../../ai-gateway/envoy-ai-gateway.md) for serving models.
 

diff --git a/...serving/generative-inference/tasks/text2text-generation/text2text-generation.md b/...serving/generative-inference/tasks/text2text-generation/text2text-generation.md
@@ -202,7 +202,7 @@ Once you've successfully deployed your text generation model, consider:
 
 - **Advanced serving options** like [multi-node inference](../../multi-node/multi-node.md) for large models
 - **Exploring other inference tasks** such as [reranking](../reranking/rerank.md) and [embedding](../embedding/embedding.md)
-- **Optimizing performance** with features like [model caching](../../modelcache/localmodel.md) and [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
+- **Optimizing performance** with features like [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
 - **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../../autoscaling/autoscaling.md)
 - **Token based rate limiting** to control usage with [AI Gateway](../../ai-gateway/envoy-ai-gateway.md) for serving models.
 

diff --git a/docs/model-serving/predictive-inference/frameworks/overview.md b/docs/model-serving/predictive-inference/frameworks/overview.md
@@ -130,3 +130,4 @@ spec:
 - Learn about [custom model serving](https://github.com/kserve/kserve/tree/master/docs/samples/v1beta1/custom) 
 - Check out the [sample implementations](https://github.com/kserve/kserve/tree/master/docs/samples/v1beta1) for hands-on tutorials
 - Read the [KServe developer guide](https://github.com/kserve/kserve/blob/master/docs/DEVELOPER_GUIDE.md)
+- Optimizing performance with features like [model caching](../modelcache/localmodel.md)
diff --git a/...rative-inference/modelcache/localmodel.md → ...ictive-inference/modelcache/localmodel.md b/...rative-inference/modelcache/localmodel.md → ...ictive-inference/modelcache/localmodel.md
diff --git a/sidebars.ts b/sidebars.ts
@@ -113,7 +113,6 @@ const sidebars: SidebarsConfig = {
             },
             "model-serving/generative-inference/sdk-integration/sdk-integration",
             "model-serving/generative-inference/kvcache-offloading/kvcache-offloading",
-            "model-serving/generative-inference/modelcache/localmodel",
             "model-serving/generative-inference/autoscaling/autoscaling",
             "model-serving/generative-inference/multi-node/multi-node",
             "model-serving/generative-inference/ai-gateway/envoy-ai-gateway",
@@ -181,6 +180,7 @@ const sidebars: SidebarsConfig = {
                 "model-serving/predictive-inference/transformers/feast-feature-store/feast-feature-store",
               ]
             },
+            "model-serving/predictive-inference/modelcache/localmodel",
             {
               type: 'category',
               label: 'Model Explainability',

diff --git a/src/components/HomepageBenefits/index.tsx b/src/components/HomepageBenefits/index.tsx
@@ -31,10 +31,6 @@ export default function HomepageBenefits() {
                     <h4>🚅 GPU Acceleration</h4>
                     <p>High-performance serving with GPU support and optimized memory management for large models</p>
                   </div>
-                  <div className={styles.benefitCard}>
-                    <h4>💾 Model Caching</h4>
-                    <p>Intelligent model caching to reduce loading times and improve response latency for frequently used models</p>
-                  </div>
                   <div className={styles.benefitCard}>
                     <h4>🗂️ KV Cache Offloading</h4>
                     <p>Advanced memory management with KV cache offloading to CPU/disk for handling longer sequences efficiently</p>
@@ -71,6 +67,10 @@ export default function HomepageBenefits() {
                     <h4>⚡ Auto-scaling</h4>
                     <p>Request-based autoscaling with scale-to-zero for predictive workloads</p>
                   </div>
+                  <div className={styles.benefitCard}>
+                    <h4>💾 Model Caching</h4>
+                    <p>Intelligent model caching to reduce loading times and improve response latency for frequently used models</p>
+                  </div>
                   <div className={styles.benefitCard}>
                     <h4>🔍 Model Explainability</h4>
                     <p>Built-in support for model explanations and feature attribution to understand prediction reasoning</p>