Portkey-AI · mintlify · Feb 17, 2026
diff --git a/integrations/llms/vertex-ai.mdx b/integrations/llms/vertex-ai.mdx
@@ -195,56 +195,23 @@ This route only works with Claude models. For other models, use the standard Ope
 Portkey supports the [Google Vertex AI CountTokens API](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/model-reference/count-tokens) to estimate token usage before sending requests. Check out the count-tokens guide for more details.
 </Card>
 
-## Explicit context caching
+## Vertex AI context caching
 
-Vertex AI supports [context caching](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-create) to reduce costs and latency for repeated prompts with large amounts of context. You can explicitly create a cache and then reference it in subsequent inference requests.
+Vertex AI supports [context caching](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-create) to reduce costs and latency for repeated prompts with large amounts of context. You can create a cache and then reference it in subsequent inference requests.
 
-### Step 1: Create a context cache
-
-Use the Vertex AI `cachedContents` endpoint through Portkey to create a cache:
-
-```sh cURL
-curl --location 'https://api.portkey.ai/v1/projects/{{YOUR_PROJECT_ID}}/locations/{{LOCATION}}/cachedContents' \
---header 'x-portkey-provider: {{@my-vertex-ai-provider}}' \
---header 'Content-Type: application/json' \
---header 'x-portkey-api-key: {{your_api_key}}' \
---header 'x-portkey-custom-host: https://aiplatform.googleapis.com/v1' \
---data '{
-  "model": "projects/{{YOUR_PROJECT_ID}}/locations/{{LOCATION}}/publishers/google/models/{{MODEL_ID}}",
-  "displayName": "{{my-cache-display-name}}",
-  "contents": [{
-    "role": "user",
-      "parts": [{
-        "text": "This is sample text to demonstrate explicit caching. (you need a minimum of 1024 tokens)"
-      }]
-  },
-  {
-    "role": "model",
-      "parts": [{
-        "text": "thankyou I am your helpful assistant"
-      }]
-  }]
-}'
-```
-
-**Request variables:**
+<Info>
+**This is Vertex AI's native context caching** - a feature specific to Gemini models on Vertex AI. This is different from [Portkey's gateway caching](/product/ai-gateway/cache-simple-and-semantic) which provides simple and semantic caching modes at the Portkey layer for any provider.
 
-| Variable | Description |
-|----------|-------------|
-| `YOUR_PROJECT_ID` | Your Google Cloud project ID. |
-| `LOCATION` | The region where your model is deployed (e.g., `us-central1`). |
-| `MODEL_ID` | The model identifier (e.g., `gemini-1.5-pro-001`). |
-| `my-cache-display-name` | A unique name to identify your cache. |
-| `your_api_key` | Your Portkey API key. |
-| `@my-vertex-ai-provider` | Your Vertex AI provider slug from Portkey's Model Catalog. |
+Use Vertex AI context caching when you need provider-native cache management with TTL controls. Use Portkey's gateway caching for cross-provider caching without provider-specific setup.
+</Info>
 
 <Note>
-Context caching requires a minimum of 1024 tokens in the cached content. The cache has a default TTL (time-to-live) which you can configure using the `ttl` parameter.
+Context caching on Vertex AI is only available for **Gemini models**. It is not supported for Anthropic, Meta, or other models hosted on Vertex AI.
 </Note>
 
-### Step 2: Use the cache in inference requests
+### Use case 1: Using existing context caches
 
-Once the cache is created, reference it in your chat completion requests using the `cached_content` parameter:
+If you have already created a context cache using Vertex AI's APIs or console, you can reference it in your Portkey requests using the `cached_content` parameter.
 
 <Tabs>
   <Tab title="cURL">
@@ -308,6 +275,64 @@ console.log(completion);
 The model and region used in the inference request must match the model and region used when creating the cache.
 </Warning>
 
+### Use case 2: Creating new context caches
+
+Use Portkey's proxy capability with the `x-portkey-custom-host` header to call Vertex AI's native caching endpoints directly. This allows you to create and manage caches through Portkey while leveraging Vertex AI's native caching infrastructure.
+
+```sh cURL
+curl --location 'https://api.portkey.ai/v1/projects/{{YOUR_PROJECT_ID}}/locations/{{LOCATION}}/cachedContents' \
+--header 'x-portkey-provider: {{@my-vertex-ai-provider}}' \
+--header 'Content-Type: application/json' \
+--header 'x-portkey-api-key: {{your_api_key}}' \
+--header 'x-portkey-custom-host: https://aiplatform.googleapis.com/v1' \
+--data '{
+  "model": "projects/{{YOUR_PROJECT_ID}}/locations/{{LOCATION}}/publishers/google/models/{{MODEL_ID}}",
+  "displayName": "{{my-cache-display-name}}",
+  "contents": [{
+    "role": "user",
+      "parts": [{
+        "text": "This is sample text to demonstrate explicit caching. (you need a minimum of 1024 tokens)"
+      }]
+  },
+  {
+    "role": "model",
+      "parts": [{
+        "text": "thankyou I am your helpful assistant"
+      }]
+  }]
+}'
+```
+
+**Request variables:**
+
+| Variable | Description |
+|----------|-------------|
+| `YOUR_PROJECT_ID` | Your Google Cloud project ID. |
+| `LOCATION` | The region where your model is deployed (e.g., `us-central1`). |
+| `MODEL_ID` | The model identifier (e.g., `gemini-1.5-pro-001`). |
+| `my-cache-display-name` | A unique name to identify your cache. |
+| `your_api_key` | Your Portkey API key. |
+| `@my-vertex-ai-provider` | Your Vertex AI provider slug from Portkey's Model Catalog. |
+
+<Note>
+Context caching requires a minimum of 1024 tokens in the cached content. The cache has a default TTL (time-to-live) which you can configure using the `ttl` parameter.
+</Note>
+
+### Use the cache in inference requests
+
+Once the cache is created, reference it in your chat completion requests using the `cached_content` parameter (see [Use case 1](#use-case-1-using-existing-context-caches) above).
+
+### Context caching pricing
+
+Vertex AI context caching uses separate pricing for cache operations:
+
+| Token type | Price per token |
+|------------|-----------------|
+| Cache write input tokens | $0.000625 |
+| Cache read input tokens | $0.00005 |
+
+Cache read tokens are significantly cheaper than standard input tokens, making context caching cost-effective for repeated queries against the same large context.
+
 For more details on context caching options like TTL configuration and cache management, refer to the [Vertex AI context caching documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-create).
 
 ---