Skip to content

Conversation

@techiejd
Copy link
Owner

Release v0.3.0: Enhanced Chunking API, Extension Fields, and Filterable Vector Search

🎯 Overview

This release introduces a major refactor of the chunking API, adds support for extension fields, and enables powerful filtering capabilities in vector search. The plugin now uses Drizzle ORM throughout, eliminating raw SQL and providing a more maintainable, type-safe codebase.

⚠️ Breaking Changes

1. Field-Based Chunking Replaced with toKnowledgePool Functions

Before (v0.2.x):

collections: {
  posts: {
    fields: {
      title: { chunker: chunkText },
      content: { chunker: chunkRichText },
    },
  },
}

After (v0.3.0):

collections: {
  posts: {
    toKnowledgePool: async (doc, payload) => {
      const entries = []
      const titleChunks = await chunkText(doc.title ?? '', payload)
      titleChunks.forEach(chunk => entries.push({ chunk, category: doc.category }))
      
      const contentChunks = await chunkRichText(doc.content, payload)
      contentChunks.forEach(chunk => entries.push({ chunk, category: doc.category }))
      
      return entries
    },
  },
}

Why: This change provides full control over chunking logic, allowing you to:

  • Combine multiple fields into single chunks
  • Attach custom metadata (extension fields) to each chunk
  • Implement complex chunking strategies that weren't possible with field-based approach

2. fieldPath Removed from Search Results

The fieldPath property has been removed from VectorSearchResult. If you were using this to identify which field a chunk came from, you'll need to track this via extension fields or your chunking logic.

Before:

{
  id: "...",
  fieldPath: "content", // ❌ No longer available
  chunkText: "...",
  // ...
}

After:

{
  id: "...",
  chunkText: "...",
  category: "guides", // ✅ Use extension fields instead
  priority: 4,
  // ...
}

✨ New Features

1. Extension Fields

Add custom fields to the embeddings collection schema and persist values per chunk:

collections: {
  posts: {
    toKnowledgePool: postsToKnowledgePool,
    extensionFields: [
      { name: 'category', type: 'text' },
      { name: 'priority', type: 'number' },
    ],
  },
}

Extension fields are:

  • Added to the embeddings table schema automatically
  • Stored with each chunk when vectorizing
  • Returned in search results
  • Queryable via the where clause (see below)

2. Filterable Vector Search

The vector search endpoint now accepts Payload-style where clauses and a limit parameter:

const response = await fetch('/api/vector-search', {
  method: 'POST',
  body: JSON.stringify({
    query: 'machine learning',
    knowledgePool: 'main',
    where: {
      category: { equals: 'guides' },
      priority: { gte: 3 },
    },
    limit: 5,
  }),
})

Supported operators:

  • equals, not_equals / notEquals
  • in, not_in / notIn
  • like, contains
  • greater_than / greaterThan, greater_than_equal / greaterThanEqual
  • less_than / lessThan, less_than_equal / lessThanEqual
  • exists (null checks)
  • Nested and / or conditions

You can filter on:

  • Default embedding columns: sourceCollection, docId, chunkIndex, chunkText, embeddingVersion
  • Any extension fields you've defined

3. Improved Chunking Control

The toKnowledgePool function gives you complete control over:

  • What gets chunked: Combine any fields, transform content, or skip fields entirely
  • How it's chunked: Use different chunkers for different parts of a document
  • Metadata per chunk: Attach extension field values that vary per chunk

Example: Chunk a blog post's title separately from its content, and attach different metadata to each:

const postsToKnowledgePool: ToKnowledgePoolFn = async (doc, payload) => {
  const entries = []
  
  // Title chunks get high priority
  const titleChunks = await chunkText(doc.title ?? '', payload)
  titleChunks.forEach(chunk => 
    entries.push({ 
      chunk, 
      category: doc.category,
      priority: 10, // High priority for titles
    })
  )
  
  // Content chunks get normal priority
  const contentChunks = await chunkRichText(doc.content, payload)
  contentChunks.forEach(chunk => 
    entries.push({ 
      chunk, 
      category: doc.category,
      priority: doc.priority ?? 0,
    })
  )
  
  return entries
}

🔧 Technical Improvements

Drizzle ORM Integration

  • Eliminated raw SQL: All queries now use Drizzle's query builder and type-safe functions
  • Uses public API only: No reliance on Drizzle's private _ properties, ensuring forward compatibility
  • Better type safety: Leverages Drizzle's type system throughout
  • Maintainable: Easier to understand and modify query logic

Custom WHERE Clause Converter

Implemented a custom convertWhereToDrizzle function that:

  • Converts Payload's Where objects to Drizzle conditions
  • Handles all common operators and nested and/or logic
  • Works with both default columns and extension fields
  • Provides clear error messages for invalid queries

Dynamic Table Registration

The plugin now dynamically generates Drizzle table definitions during schema initialization and stores them in a registry. This allows:

  • Direct access to table columns without introspection
  • Type-safe column references
  • Clean separation between schema definition and query building

📝 Migration Guide

Step 1: Update Your Collection Configuration

Replace field-based chunking with toKnowledgePool functions:

// OLD
collections: {
  posts: {
    fields: {
      title: { chunker: chunkText },
      content: { chunker: chunkRichText },
    },
  },
}

// NEW
collections: {
  posts: {
    toKnowledgePool: async (doc, payload) => {
      const entries = []
      const titleChunks = await chunkText(doc.title ?? '', payload)
      titleChunks.forEach(chunk => entries.push({ chunk }))
      
      const contentChunks = await chunkRichText(doc.content, payload)
      contentChunks.forEach(chunk => entries.push({ chunk }))
      
      return entries
    },
  },
}

Step 2: Update Search Result Handling

Remove any code that references fieldPath:

// OLD
results.forEach(result => {
  console.log(`Field: ${result.fieldPath}`) // ❌ No longer exists
})

// NEW
results.forEach(result => {
  console.log(`Chunk: ${result.chunkText}`)
  // Use extension fields if you need metadata
  if (result.category) {
    console.log(`Category: ${result.category}`)
  }
})

Step 3: (Optional) Add Extension Fields

If you want to store and query custom metadata:

collections: {
  posts: {
    toKnowledgePool: postsToKnowledgePool,
    extensionFields: [
      { name: 'category', type: 'text' },
      { name: 'priority', type: 'number' },
    ],
  },
}

Then update your toKnowledgePool function to return these values:

const postsToKnowledgePool: ToKnowledgePoolFn = async (doc, payload) => {
  return [
    { chunk: '...', category: doc.category, priority: doc.priority },
    // ...
  ]
}

Step 4: Re-vectorize Your Content

After updating your configuration, you'll need to re-vectorize existing documents. The plugin will automatically:

  • Delete old embeddings when documents are updated
  • Create new embeddings with the updated structure

🧪 Testing

  • ✅ All existing tests updated and passing
  • ✅ New test suite for extension fields (dev/specs/extensionFields.spec.ts)
  • ✅ Expanded vector search tests with WHERE clause filtering
  • ✅ Integration tests verify end-to-end functionality

📚 Documentation

  • Updated README with new API examples
  • Added comprehensive CHANGELOG.md
  • Migration guide included in CHANGELOG
  • All code examples updated to reflect new API

Full Changelog: See CHANGELOG.md for complete details.

@techiejd techiejd merged commit 78b1409 into main Nov 19, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants