Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/concepts/resources/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Manage model storage and access patterns:

### LocalModel & LocalModelNode
Enables local model caching and management:
- **[Concepts](../../model-serving/generative-inference/modelcache/localmodel.md)**: Overview of local model caching in KServe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ModelCache works for both scenarios; however, it is more useful for generative AI, where the models can get very big, no?

Copy link
Author

@Billy99 Billy99 Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you read the feature description, it only describes InferenceService CRD, which is Predictive AI.
https://kserve.github.io/website/docs/model-serving/generative-inference/modelcache/localmodel

I recall in a previous discussion that it may work longer term with the LLMInferenceService CRD, but not there yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run generative with the InferenceService type; the LLMINferenceService is designed for deployments that use LLM-D.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me reread some of the docs with that perspective.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Evolution: Dual-Track Strategy:

Strategic Separation:

  • InferenceService: Remains the standard for Predictive AI (classification, regression, recommendations)
  • LLMInferenceService: Dedicated to Generative AI with specialized optimizations
  • Can you use InferenceService for LLMs? Yes, but only for basic single-node deployments. Advanced features like prefill-decode separation, multi-node orchestration, and intelligent scheduling are not available

So that's clear that InferenceService can be used for both Predictive AI and Generative AI, so I misunderstood. However, it makes it sound like "You can use for Generative AI but why would you ever because you lose all these features?".

And the QuickStart makes it sound like it's one or the other, which is probably where I got my original thought that InferenceService was only for Predictive AI.

Welcome to the KServe Quickstart Guide! This guide will help you set up a KServe Quickstart environment for testing and experimentation. KServe provides two deployment paths based on your use case:

  • Generative AI (LLMInferenceService): For Large Language Models and generative AI workload
  • Predictive AI (InferenceService): For traditional ML models and predictive inference workloads

So the description for this PR is wrong, Local Model Cache can be used for both Generative AI and Predictive AI. Totally fine closing this PR. But is it true that it is probably used more for Predictive AI? If so, does it make sense to move the Local Cache description to Predictive AI? If that's not the case, to remove confusion, maybe it should be moved up a level so it's common or under Model Storage, instead of tucked under Generative AI? Once again, happy to close this PR off, just brainstorming.

- **[Concepts](../../model-serving/predictive-inference/modelcache/localmodel.md)**: Overview of local model caching in KServe.
- **[LocalModelCache](../../reference/crd-api.mdx)**: CRD that Defines local model caching requirements and policies
- **[LocalModelNode](../../reference/crd-api.mdx)**: CRD that handles Node-level model caching management
- **[LocalModelNodeGroup](../../reference/crd-api.mdx)**: CRD for Grouping of local model nodes for management and orchestration of cached models
Expand Down
1 change: 0 additions & 1 deletion docs/getting-started/genai-first-isvc.md
Original file line number Diff line number Diff line change
Expand Up @@ -319,5 +319,4 @@ Now that you have successfully deployed a generative AI service using KServe, yo
- 📖 **[Supported Tasks](../model-serving/generative-inference/overview.md#supported-generative-tasks)** - Discover the various tasks that KServe can handle.
- 📖 **[Autoscaling](../model-serving/generative-inference/autoscaling/autoscaling.md)**: Automatically scale your service based on traffic and resource usage / metrics.
- 📖 **[KV Cache Offloading](../model-serving/generative-inference/kvcache-offloading/kvcache-offloading.md)** - Learn how to offload key-value caches to external storage for improved performance and reduced latency.
- 📖 **[Model Caching](../model-serving/generative-inference/modelcache/localmodel.md)** - Learn how to cache models for faster startup time.
- 📖 **[Token Rate Limiting](../model-serving/generative-inference/ai-gateway/envoy-ai-gateway.md)** - Rate limit users based on token usage.
1 change: 1 addition & 0 deletions docs/getting-started/predictive-first-isvc.md
Original file line number Diff line number Diff line change
Expand Up @@ -297,3 +297,4 @@ Now that you have successfully deployed your first Predictive InferenceService,
- 📖 **[Supported Frameworks](../model-serving/predictive-inference/frameworks/overview.md)** - Explore Supported Frameworks.
- 📖 **[Batch InferenceService](../model-serving/predictive-inference/batcher/batcher.md)** - Deploy your first Batch InferenceService.
- 📖 **[Canary Deployments](../model-serving/predictive-inference/rollout-strategies/canary-example.md)**: Gradually roll out new model versions to test their performance before full deployment.
- 📖 **[Model Caching](../model-serving/predictive-inference/modelcache/localmodel.md)** - Learn how to cache models for faster startup time.
2 changes: 1 addition & 1 deletion docs/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ Enterprise authentication, network policies, and compliance features built-in. D
#### Generative Inference Benefits
✅ **LLM Multi-framework Support** - Deploy LLMs from Hugging Face, vLLM, and custom generative models
✅ **OpenAI-Compatible APIs** - Chat completion, completion, streaming, and embedding endpoints
✅ **LocalModelCache for LLMs** - Cache large models locally to reduce startup time from 15-20 minutes to ~1 minute
✅ **KV Cache Offloading** - Optimized memory management for long conversations and large contexts
✅ **Multi-node Inference** - Distributed LLM serving
✅ **Envoy AI Gateway Integration** - Enterprise-grade API management and routing for AI workloads
Expand All @@ -50,6 +49,7 @@ Enterprise authentication, network policies, and compliance features built-in. D
✅ **Real-time Scoring** - Low-latency prediction serving for real-time applications
✅ **Production ML Monitoring** - Comprehensive observability, drift detection, and explainability
✅ **Standard Inference Protocols** - Support for Open Inference Protocol (V1/V2) across frameworks
✅ **LocalModelCache for LLMs** - Cache large models locally to reduce startup time from 15-20 minutes to ~1 minute

#### Universal Benefits (Both Inference Types)
✅ **Serverless Inference Workloads** - Automatic scaling including scale-to-zero on both CPU and GPU
Expand Down
1 change: 0 additions & 1 deletion docs/model-serving/generative-inference/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,6 @@ The following examples demonstrate how to deploy and perform inference using the

## Advanced Features
The Hugging Face runtime supports several advanced features to enhance model serving capabilities:
- [**Model Caching**](./modelcache/localmodel.md): Cache models on local storage for faster loading and reduced latency. This is particularly useful for large models that are frequently accessed.
- [**KV Cache Offloading**](./kvcache-offloading/kvcache-offloading.md): Offload key-value caches to CPU memory to reduce GPU memory usage, allowing larger models to be served on GPUs with limited memory.
- [**Distributed LLM Serving**](./multi-node/multi-node.md): Scale model serving across multiple nodes and GPUs for high throughput and low latency. This is useful for serving large models or handling high request volumes.
- [**AI Gateway**](./ai-gateway/envoy-ai-gateway.md): Use the AI Gateway to manage rate-limiting based on tokens and route requests to different models, providing a unified API for various generative tasks.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,6 @@ After integrating your LLM with an SDK, consider exploring:

1. **Advanced serving options** like [multi-node inference](../multi-node/multi-node.md) for large models
2. **Exploring other inference tasks** such as [text-to-text generation](../tasks/text2text-generation/text2text-generation.md) and [embeddings](../tasks/embedding/embedding.md)
3. **Optimizing performance** with features like [model caching](../modelcache/localmodel.md) and [KV cache offloading](../kvcache-offloading/kvcache-offloading.md)
4. **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../autoscaling/autoscaling.md)
3. **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../autoscaling/autoscaling.md)

By connecting your KServe-deployed models with these popular SDKs, you can quickly build sophisticated AI applications while maintaining control over your model infrastructure.
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,7 @@ Once you've successfully deployed your embedding model, consider:

- **Advanced serving options** like [multi-node inference](../../multi-node/multi-node.md) for large models
- **Exploring other inference tasks** such as [text-to-text generation](../text2text-generation/text2text-generation.md) and [reranking](../reranking/rerank.md)
- **Optimizing performance** with features like [model caching](../../modelcache/localmodel.md) and [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
- **Optimizing performance** with features like [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
- **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../../autoscaling/autoscaling.md)
- **Token based rate limiting** to control usage with [AI Gateway](../../ai-gateway/envoy-ai-gateway.md) for serving models.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ Once you've successfully deployed your reranker model, consider:

- **Advanced serving options** like [multi-node inference](../../multi-node/multi-node.md) for large models
- **Exploring other inference tasks** such as [text-to-text generation](../text2text-generation/text2text-generation.md) and [embedding](../embedding/embedding.md)
- **Optimizing performance** with features like [model caching](../../modelcache/localmodel.md) and [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
- **Optimizing performance** with features like [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
- **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../../autoscaling/autoscaling.md)
- **Token based rate limiting** to control usage with [AI Gateway](../../ai-gateway/envoy-ai-gateway.md) for serving models.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ Once you've successfully deployed your text generation model, consider:

- **Advanced serving options** like [multi-node inference](../../multi-node/multi-node.md) for large models
- **Exploring other inference tasks** such as [text-to-text generation](../text2text-generation/text2text-generation.md) and [embedding](../embedding/embedding.md)
- **Optimizing performance** with features like [model caching](../../modelcache/localmodel.md) and [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
- **Optimizing performance** with features like [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
- **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../../autoscaling/autoscaling.md)
- **Token based rate limiting** to control usage with [AI Gateway](../../ai-gateway/envoy-ai-gateway.md) for serving models.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@ Once you've successfully deployed your text generation model, consider:

- **Advanced serving options** like [multi-node inference](../../multi-node/multi-node.md) for large models
- **Exploring other inference tasks** such as [reranking](../reranking/rerank.md) and [embedding](../embedding/embedding.md)
- **Optimizing performance** with features like [model caching](../../modelcache/localmodel.md) and [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
- **Optimizing performance** with features like [KV cache offloading](../../kvcache-offloading/kvcache-offloading.md)
- **Auto-scaling** your inference services based on traffic patterns using [KServe's auto-scaling capabilities](../../autoscaling/autoscaling.md)
- **Token based rate limiting** to control usage with [AI Gateway](../../ai-gateway/envoy-ai-gateway.md) for serving models.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -130,3 +130,4 @@ spec:
- Learn about [custom model serving](https://github.com/kserve/kserve/tree/master/docs/samples/v1beta1/custom)
- Check out the [sample implementations](https://github.com/kserve/kserve/tree/master/docs/samples/v1beta1) for hands-on tutorials
- Read the [KServe developer guide](https://github.com/kserve/kserve/blob/master/docs/DEVELOPER_GUIDE.md)
- Optimizing performance with features like [model caching](../modelcache/localmodel.md)
2 changes: 1 addition & 1 deletion sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,6 @@ const sidebars: SidebarsConfig = {
},
"model-serving/generative-inference/sdk-integration/sdk-integration",
"model-serving/generative-inference/kvcache-offloading/kvcache-offloading",
"model-serving/generative-inference/modelcache/localmodel",
"model-serving/generative-inference/autoscaling/autoscaling",
"model-serving/generative-inference/multi-node/multi-node",
"model-serving/generative-inference/ai-gateway/envoy-ai-gateway",
Expand Down Expand Up @@ -181,6 +180,7 @@ const sidebars: SidebarsConfig = {
"model-serving/predictive-inference/transformers/feast-feature-store/feast-feature-store",
]
},
"model-serving/predictive-inference/modelcache/localmodel",
{
type: 'category',
label: 'Model Explainability',
Expand Down
8 changes: 4 additions & 4 deletions src/components/HomepageBenefits/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,6 @@ export default function HomepageBenefits() {
<h4>🚅 GPU Acceleration</h4>
<p>High-performance serving with GPU support and optimized memory management for large models</p>
</div>
<div className={styles.benefitCard}>
<h4>💾 Model Caching</h4>
<p>Intelligent model caching to reduce loading times and improve response latency for frequently used models</p>
</div>
<div className={styles.benefitCard}>
<h4>🗂️ KV Cache Offloading</h4>
<p>Advanced memory management with KV cache offloading to CPU/disk for handling longer sequences efficiently</p>
Expand Down Expand Up @@ -71,6 +67,10 @@ export default function HomepageBenefits() {
<h4>⚡ Auto-scaling</h4>
<p>Request-based autoscaling with scale-to-zero for predictive workloads</p>
</div>
<div className={styles.benefitCard}>
<h4>💾 Model Caching</h4>
<p>Intelligent model caching to reduce loading times and improve response latency for frequently used models</p>
</div>
<div className={styles.benefitCard}>
<h4>🔍 Model Explainability</h4>
<p>Built-in support for model explanations and feature attribution to understand prediction reasoning</p>
Expand Down