Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 67 additions & 42 deletions integrations/llms/vertex-ai.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -195,56 +195,23 @@ This route only works with Claude models. For other models, use the standard Ope
Portkey supports the [Google Vertex AI CountTokens API](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/model-reference/count-tokens) to estimate token usage before sending requests. Check out the count-tokens guide for more details.
</Card>

## Explicit context caching
## Vertex AI context caching

Vertex AI supports [context caching](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-create) to reduce costs and latency for repeated prompts with large amounts of context. You can explicitly create a cache and then reference it in subsequent inference requests.
Vertex AI supports [context caching](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-create) to reduce costs and latency for repeated prompts with large amounts of context. You can create a cache and then reference it in subsequent inference requests.

### Step 1: Create a context cache

Use the Vertex AI `cachedContents` endpoint through Portkey to create a cache:

```sh cURL
curl --location 'https://api.portkey.ai/v1/projects/{{YOUR_PROJECT_ID}}/locations/{{LOCATION}}/cachedContents' \
--header 'x-portkey-provider: {{@my-vertex-ai-provider}}' \
--header 'Content-Type: application/json' \
--header 'x-portkey-api-key: {{your_api_key}}' \
--header 'x-portkey-custom-host: https://aiplatform.googleapis.com/v1' \
--data '{
"model": "projects/{{YOUR_PROJECT_ID}}/locations/{{LOCATION}}/publishers/google/models/{{MODEL_ID}}",
"displayName": "{{my-cache-display-name}}",
"contents": [{
"role": "user",
"parts": [{
"text": "This is sample text to demonstrate explicit caching. (you need a minimum of 1024 tokens)"
}]
},
{
"role": "model",
"parts": [{
"text": "thankyou I am your helpful assistant"
}]
}]
}'
```

**Request variables:**
<Info>
**This is Vertex AI's native context caching** - a feature specific to Gemini models on Vertex AI. This is different from [Portkey's gateway caching](/product/ai-gateway/cache-simple-and-semantic) which provides simple and semantic caching modes at the Portkey layer for any provider.

| Variable | Description |
|----------|-------------|
| `YOUR_PROJECT_ID` | Your Google Cloud project ID. |
| `LOCATION` | The region where your model is deployed (e.g., `us-central1`). |
| `MODEL_ID` | The model identifier (e.g., `gemini-1.5-pro-001`). |
| `my-cache-display-name` | A unique name to identify your cache. |
| `your_api_key` | Your Portkey API key. |
| `@my-vertex-ai-provider` | Your Vertex AI provider slug from Portkey's Model Catalog. |
Use Vertex AI context caching when you need provider-native cache management with TTL controls. Use Portkey's gateway caching for cross-provider caching without provider-specific setup.
</Info>

<Note>
Context caching requires a minimum of 1024 tokens in the cached content. The cache has a default TTL (time-to-live) which you can configure using the `ttl` parameter.
Context caching on Vertex AI is only available for **Gemini models**. It is not supported for Anthropic, Meta, or other models hosted on Vertex AI.
</Note>

### Step 2: Use the cache in inference requests
### Use case 1: Using existing context caches

Once the cache is created, reference it in your chat completion requests using the `cached_content` parameter:
If you have already created a context cache using Vertex AI's APIs or console, you can reference it in your Portkey requests using the `cached_content` parameter.

<Tabs>
<Tab title="cURL">
Expand Down Expand Up @@ -308,6 +275,64 @@ console.log(completion);
The model and region used in the inference request must match the model and region used when creating the cache.
</Warning>

### Use case 2: Creating new context caches

Use Portkey's proxy capability with the `x-portkey-custom-host` header to call Vertex AI's native caching endpoints directly. This allows you to create and manage caches through Portkey while leveraging Vertex AI's native caching infrastructure.

```sh cURL
curl --location 'https://api.portkey.ai/v1/projects/{{YOUR_PROJECT_ID}}/locations/{{LOCATION}}/cachedContents' \
--header 'x-portkey-provider: {{@my-vertex-ai-provider}}' \
--header 'Content-Type: application/json' \
--header 'x-portkey-api-key: {{your_api_key}}' \
--header 'x-portkey-custom-host: https://aiplatform.googleapis.com/v1' \
--data '{
"model": "projects/{{YOUR_PROJECT_ID}}/locations/{{LOCATION}}/publishers/google/models/{{MODEL_ID}}",
"displayName": "{{my-cache-display-name}}",
"contents": [{
"role": "user",
"parts": [{
"text": "This is sample text to demonstrate explicit caching. (you need a minimum of 1024 tokens)"
}]
},
{
"role": "model",
"parts": [{
"text": "thankyou I am your helpful assistant"
}]
}]
}'
```

**Request variables:**

| Variable | Description |
|----------|-------------|
| `YOUR_PROJECT_ID` | Your Google Cloud project ID. |
| `LOCATION` | The region where your model is deployed (e.g., `us-central1`). |
| `MODEL_ID` | The model identifier (e.g., `gemini-1.5-pro-001`). |
| `my-cache-display-name` | A unique name to identify your cache. |
| `your_api_key` | Your Portkey API key. |
| `@my-vertex-ai-provider` | Your Vertex AI provider slug from Portkey's Model Catalog. |

<Note>
Context caching requires a minimum of 1024 tokens in the cached content. The cache has a default TTL (time-to-live) which you can configure using the `ttl` parameter.
</Note>

### Use the cache in inference requests

Once the cache is created, reference it in your chat completion requests using the `cached_content` parameter (see [Use case 1](#use-case-1-using-existing-context-caches) above).

### Context caching pricing

Vertex AI context caching uses separate pricing for cache operations:

| Token type | Price per token |
|------------|-----------------|
| Cache write input tokens | $0.000625 |
| Cache read input tokens | $0.00005 |

Cache read tokens are significantly cheaper than standard input tokens, making context caching cost-effective for repeated queries against the same large context.

For more details on context caching options like TTL configuration and cache management, refer to the [Vertex AI context caching documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-create).

---
Expand Down