Skip to content

Conversation

@Davidyz
Copy link
Owner

@Davidyz Davidyz commented Jun 9, 2025

This PR aims to implement summarisation for retrieval results.

This would:

  1. Use fewer tokens in the main chat because long documents can be replaced by their summaries.
  2. Allow users to use a smaller/cheaper model/adapter for the summarisation, and hence saving the cost.
  • Refactoring the existing code to put VectorCode.Result into a function
  • Accept an adapter as a config option
  • Implement the summarisation request
    • make it not block the main thread
  • Implement a thresholding mechanism that only triggers the summarisation if the document is too long provides a callback that decides whether summarisation should kick in for each tool call

Example config:

opts.extensions.vectorcode = {
  ---@type VectorCode.CodeCompanion.ExtensionOpts
  opts = {
    tool_opts = {
      query = {
        summarise = {
          ---@type boolean|fun(chat: CodeCompanion.Chat,results: VectorCode.QueryResult[]):boolean
          enabled = true,
          adapter = function()
            return require("codecompanion.adapters").extend("gemini", {
              name = "Summariser",
              schema = {
                model = { default = "gemini-2.0-flash-lite" },
              },
              opts = { stream = false },
            })
          end,
        },
      },
    },
  },
}

image

Related PR:

@Davidyz Davidyz added enhancement New feature or request feature labels Jun 9, 2025
@codecov
Copy link

codecov bot commented Jun 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.49%. Comparing base (b3a8fa2) to head (9ff39fd).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #179   +/-   ##
=======================================
  Coverage   99.49%   99.49%           
=======================================
  Files          21       21           
  Lines        1589     1589           
=======================================
  Hits         1581     1581           
  Misses          8        8           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Davidyz Davidyz force-pushed the nvim/result_summary branch from eba101d to 6ce9a03 Compare June 9, 2025 12:12
@Davidyz
Copy link
Owner Author

Davidyz commented Jun 10, 2025

I've managed to make requests from an adapter, but handling the async requests from a sync context is TRICKY. @olimorris any suggestions on how I might be able to simplify this?

@olimorris
Copy link
Contributor

Could you hook into CodeCompanion's event system and listen for CodeCompanionRequestStarted and CodeCompanionRequestFinished?

Alternatively, we could add a sync method on CodeCompanion's http.lua file, something like :request_sync. I think I've mentioned on a few posts that I'm looking to implement a background strategy that will make it much easier for external plugins to leverage the adapters and http module to make calls to LLMs. I intend on adding a sync method for that too.

@Davidyz
Copy link
Owner Author

Davidyz commented Jun 10, 2025

Could you hook into CodeCompanion's event system and listen for CodeCompanionRequestStarted and CodeCompanionRequestFinished?

I hadn't thought of that. Will look into that. I'll probably still need to work out how to make the wait non-blocking for the main thread, though.

I think I've mentioned on a few posts that I'm looking to implement a background strategy that will make it much easier for external plugins to leverage the adapters and http module to make calls to LLMs. I intend on adding a sync method for that too.

That would be very nice to have for this PR. For me, the real tricky bit is to make it NOT block the main UI. tbf I feel like I'm spoiled by modern async like Python asyncio, and have no idea how to work with coroutines directly 😭

@olimorris
Copy link
Contributor

That would be very nice to have for this PR. For me, the real tricky bit is to make it NOT block the main UI. tbf I feel like I'm spoiled by modern async like Python asyncio, and have no idea how to work with coroutines directly

It's high up on my list after I've got the agent mode sorted in CodeCompanion. I've had this exact conversation with sooooo many LLMs over the last 12 months 😆.

I took a lot of inspiration from the lua-async-await library some time ago. I never ended up using it but what she's done in 90 LOC blew my mind.

@Davidyz Davidyz force-pushed the nvim/result_summary branch 5 times, most recently from 0ced3a4 to 1c562a1 Compare June 17, 2025 01:42
@Davidyz Davidyz force-pushed the nvim/result_summary branch from 1c562a1 to 86b59b4 Compare June 21, 2025 04:53
@Davidyz
Copy link
Owner Author

Davidyz commented Jun 21, 2025

@olimorris I've managed to implement this without blocking the main UI by putting the summarisation logics into the cmds function, not the output handler. This way we can take advantage of the async tool callback and use the existing async request. The tradeoff is, apparently, deeper nested callbacks 😢 Also we'd still need to work out some sort of throttling. Maybe this should be done in codecompanion, so that the adapter (if reused by different extensions) don't hit the rate limit too often.

@Davidyz
Copy link
Owner Author

Davidyz commented Jun 21, 2025

Or, we could move most of the result handling from output handler to cmds function, concatenate all results and send one single request to the summariser... not sure about this

@Davidyz Davidyz changed the title [WIP] Summarised retrieval results in CodeCompanion.nvim tool Optional retrieval result summarisation in CodeCompanion.nvim query tool Jun 22, 2025
@Davidyz Davidyz marked this pull request as ready for review June 22, 2025 06:51
@Davidyz
Copy link
Owner Author

Davidyz commented Jun 22, 2025

I've managed to work around the rate limit by including the full results into one request (obviously, this'll need the summariser to be good at long context, but this is MUCH easier to implement than a rate limiter).

@Davidyz
Copy link
Owner Author

Davidyz commented Jun 22, 2025

@ravitemer, any suggestions on this feature? I'm asking because you've also done summarisation (from a different perspective), and maybe you can spot something I'm missing?

@Davidyz Davidyz force-pushed the nvim/result_summary branch from 4e90c51 to b517c76 Compare June 22, 2025 09:31
@ravitemer
Copy link

@Davidyz This looks amazing! I didn't follow the previous commits but the current implementation seems solid.

From a user's perspective, when I want to go through some repo to let LLM get an overview kind of a repo map but better I would certainly use the summary feature. Only thing I can think of is if you can make the summary option dynamic through some variable or adding a summarize field to the query tool so that LLM decides if it needs just an overview in cases to understand the project or the accurate file content to do some edits.

And just an observation, another edge case might be managing context window and max_tokens limits. As you know for the history summarization which is less complex than this, we have a hard limit for each summary request and if there are some messages left we prepend the generated summary to the remaining messages to generate the final summary. I see that strategy might not work here I think. We can tweak the max results or maybe we split the files into maybe 5 files per chunk and send multiple requests and combine all the summaries? I know this adds complexity to this and I am totally okay with not having this at all!

@Davidyz
Copy link
Owner Author

Davidyz commented Jun 22, 2025

another edge case might be managing context window and max_tokens limits

I thought this could be done through the adapter configuration, so I didn't do it here. OpenAI API, for example, offers max_tokens that limits the maximum number of tokens generated. I prefer to have a single source of truth, so I intentionally chose not to implement my own hard switch.

We can tweak the max results or maybe we split the files into maybe 5 files per chunk and send multiple requests and combine all the summaries?

In the initial iterations, I send a request for each result (document or chunk). This simply doesn't work OOTB because of the rate limits (imagine having 50 simultaneous requests hitting a server with a rate limit of 10 per minute). I'm open to the possibility, but it'll be very tricky to implement. I'll have to think about it. Maybe this should be upstreamed, as each adapter instance has its debounce counter, so multiple requests can't all happen at the same time. With more extensions making their own requests (outside of the chat buffer itself), I think this will actually make sense.

@Davidyz
Copy link
Owner Author

Davidyz commented Jun 22, 2025

Only thing I can think of is if you can make the summary option dynamic through some variable or adding a summarize field to the query tool so that LLM decides if it needs just an overview in cases to understand the project or the accurate file content to do some edits.

As for this one, currently there's the enabled option that can be a function (see the type annotation), which allows you to write some custom logic (for example, a hard switch based on the length of the retrieval results) to turn the summarisation on/off. I'm not so sure about letting the LLM decide this. To my knowledge, the LLMs don't usually have a good understanding of how much of their context window has been used. There are also providers that automatically truncates the input, which makes this matter even worse.

@ravitemer
Copy link

currently there's the enabled option that can be a function (see the type annotation),

Thanks. Didn't see that. That solves it then.

To my knowledge, the LLMs don't usually have a good understanding of how much of their context window has been used. There are also providers that automatically truncates the input, which makes this matter even worse.

Agreed. It looks solid for me.

@Davidyz
Copy link
Owner Author

Davidyz commented Jun 23, 2025

In a quick (non-rigorous) test, the summarisation reduced the query result from a 50k+ character string to an 18k+ string, meaning a 60% reduction in the token count for the tool result!

@Davidyz Davidyz merged commit bb3d169 into main Jun 23, 2025
13 checks passed
@Davidyz Davidyz deleted the nvim/result_summary branch June 23, 2025 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants