Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 9 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,13 @@
VectorCode is a code repository indexing tool. It helps you build better prompt
for your coding LLMs by indexing and providing information about the code
repository you're working on. This repository also contains the corresponding
neovim plugin because that's what I used to write this tool.
neovim plugin that provides a set of APIs for you to build or enhance AI plugins,
and integrations for some of the popular plugins.

> [!NOTE]
> This project is in beta quality and is undergoing rapid iterations.
> I know there are plenty of rooms for improvements, and any help is welcomed.

> [!NOTE]
> [Chromadb](https://www.trychroma.com/), the vector database backend behind
> this project, supports multiple embedding engines. I developed this tool using
> SentenceTransformer, but if you encounter any issues with a different embedding
> function, please open an issue (or even better, a pull request :D).

<!-- mtoc-start -->

* [Why VectorCode?](#why-vectorcode)
Expand All @@ -37,22 +32,23 @@ releases. Their capabilities on these projects are quite limited. With
VectorCode, you can easily (and programmatically) inject task-relevant context
from the project into the prompt. This significantly improves the quality of the
model output and reduce hallucination.
![](./images/codecompanion_chat.png)

[![asciicast](https://asciinema.org/a/8WP8QJHNAR9lEllZSSx3poLPD.svg)](https://asciinema.org/a/8WP8QJHNAR9lEllZSSx3poLPD?t=3)

## Documentation

> [!NOTE]
> The documentation on the `main` branch reflects the code on the latest commit
> (apologies if I forget to update the docs, but this will be what I aim for). To
> check for the documentation for the version you're using, you can [check out
> The documentation on the `main` branch reflects the code on the latest commit.
> To check for the documentation for the version you're using, you can [check out
> the corresponding tags](https://github.com/Davidyz/VectorCode/tags).

- For the setup and usage of the command-line tool, see [the CLI documentation](./docs/cli.md);
- For neovim users, after you've gone through the CLI documentation, please refer to
[the neovim plugin documentation](./docs/neovim.md) for further instructions.
- Additional resources:
- the [wiki](https://github.com/Davidyz/VectorCode/wiki) for extra tricks and
tips that will help you get the most out of VectorCode;
tips that will help you get the most out of VectorCode, as well as
instructions to setup VectorCode to work with some other neovim plugins;
- the [discussions](https://github.com/Davidyz/VectorCode/discussions) where
you can ask general questions and share your cool usages about VectorCode.

Expand Down Expand Up @@ -98,7 +94,7 @@ This project follows an adapted semantic versioning:
- [ ] ability to view and delete files in a collection (atm you can only `drop`
and `vectorise` again);
- [x] joint search (kinda, using codecompanion.nvim/MCP);
- [ ] Nix support (#144);
- [x] Nix support (unofficial packages [here](https://search.nixos.org/packages?channel=unstable&from=0&size=50&sort=relevance&type=packages&query=vectorcode));
- [ ] Query rewriting (#124).


Expand Down
111 changes: 53 additions & 58 deletions doc/VectorCode-cli.txt
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,7 @@ significantly reduce the IO overhead and avoid potential race condition.


If you’re setting up a standalone ChromaDB server, I recommend sticking to
v0.6.3. ChromaDB recently released v1.0.0, which may not work with VectorCode.
I’m testing with v1.0.0 and will publish a new release when it’s ready.
v0.6.3, because VectorCode is not ready for the upgrade to ChromaDB 1.0 yet.

FOR WINDOWS USERS ~

Expand All @@ -146,6 +145,8 @@ NIX ~

A community-maintained Nix package is available here
<https://search.nixos.org/packages?channel=unstable&from=0&size=50&sort=relevance&type=packages&query=vectorcode>.
If you’re using nix to install a standalone Chromadb server, make sure to
stick to 0.6.3 <https://github.com/NixOS/nixpkgs/pull/412528>.


GETTING STARTED *VectorCode-cli-vectorcode-command-line-tool-getting-started*
Expand Down Expand Up @@ -212,7 +213,7 @@ REFRESHING EMBEDDINGS ~

To maintain the accuracy of the vector search, it’s important to keep your
embeddings up-to-date. You can simply run the `vectorise` subcommand on a file
to refresh the embedding for a particular file, and the CLI provides a
to refresh the embedding for that file. Apart from that, the CLI provides a
`vectorcode update` subcommand, which updates the embeddings for all files that
are currently indexed by VectorCode for the current project.

Expand Down Expand Up @@ -241,8 +242,8 @@ For each project, VectorCode creates a collection (similar to tables in
traditional databases) and puts the code embeddings in the corresponding
collection. In the root directory of a project, you may run `vectorcode init`.
This will initialise the repository with a subdirectory
`project_root/.vectorcode/`. This will mark this directory a _project root_, a
concept that will later be used to construct the collection. You may put a
`project_root/.vectorcode/`. This will mark this directory as a _project root_,
a concept that will later be used to construct the collection. You may put a
`config.json` file in `project_root/.vectorcode`. This file may be used to
store project-specific settings such as embedding functions and database entry
point (more on this later). If you already have a global configuration file at
Expand Down Expand Up @@ -272,31 +273,22 @@ hooks. The `init` subcommand provides a `--hooks` flag which helps you manage
hooks when working with a git repository. You can put some custom hooks in
`~/.config/vectorcode/hooks/` and the `vectorcode init --hooks` command will
pick them up and append them to your existing hooks, or create new hook scripts
if they don’t exist yet. The hook files should be named the same as they
would be under the `.git/hooks` directory. For example, a pre-commit hook would
be named `~/.config/vectorcode/hooks/pre-commit`.
if they don’t exist yet. The custom hook files should be named the same as
they would be under the `.git/hooks` directory. For example, a pre-commit hook
would be named `~/.config/vectorcode/hooks/pre-commit`.

By default, there are 2 pre-defined hooks:

>bash
# pre-commit hook that vectorise changed files before you commit.
diff_files=$(git diff --cached --name-only)
[ -z "$diff_files" ] || vectorcode vectorise $diff_files
<
1. A pre-commit hook that vectorises the modified files.
2. A post-checkout hook that:- vectorises the full repository if it’s an initial commit/clone and a
`vectorcode.include` spec is available (either locally in the project or
globally);
- vectorises the files changed by the checkout.


>bash
# post-checkout hook that vectorise changed files when you checkout to a
# different branch/tag/commit
files=$(git diff --name-only "$1" "$2")
[ -z "$files" ] || vectorcode vectorise $files
<

When you run `vectorcode init --hooks` in a git repo, these 2 hooks will be
added to your `.git/hooks/`. Hooks that are managed by VectorCode will be
wrapped by `# VECTORCODE_HOOK_START` and `# VECTORCODE_HOOK_END` comment lines.
They help VectorCode determine whether hooks have been added, so don’t delete
the markers unless you know what you’re doing. To remove the hooks, simply
delete the lines wrapped by these 2 comment strings.
Both hooks will only be triggered on repositories that have a `.vectorcode`
directory in them.


CONFIGURING VECTORCODE ~
Expand Down Expand Up @@ -328,31 +320,32 @@ model_name="nomic-embed-text")`. Default: `{}`; - `db_url`string, the url that
points to the Chromadb server. VectorCode will start an HTTP server for
Chromadb at a randomly picked free port on `localhost` if your configured
`http://host:port` is not accessible. Default: `http://127.0.0.1:8000`; -
`db_path`string, Path to local persistent database. This is where the files for
your database will be stored. Default: `~/.local/share/vectorcode/chromadb/`; -
`db_log_path`string, path to the _directory_ where the built-in chromadb server
will write the log to. Default: `~/.local/share/vectorcode/`; -
`chunk_size`integer, the maximum number of characters per chunk. A larger value
reduces the number of items in the database, and hence accelerates the search,
but at the cost of potentially truncated data and lost information. Default:
`2500`. To disable chunking, set it to a negative number; -
`overlap_ratio`float between 0 and 1, the ratio of overlapping/shared content
between 2 adjacent chunks. A larger ratio improves the coherences of chunks,
but at the cost of increasing number of entries in the database and hence
slowing down the search. Default: `0.2`. _Starting from 0.4.11, VectorCode will
use treesitter to parse languages that it can automatically detect. It uses
pygments to guess the language from filename, and tree-sitter-language-pack to
fetch the correct parser. overlap_ratio has no effects when treesitter works.
If VectorCode fails to find an appropriate parser, it’ll fallback to the
legacy naive parser, in which case overlap_ratio works exactly in the same way
as before;_ - `query_multiplier`integer, when you use the `query` command to
retrieve `n` documents, VectorCode will check `n * query_multiplier` chunks and
return at most `n` documents. A larger value of `query_multiplier` guarantees
the return of `n` documents, but with the risk of including too many
less-relevant chunks that may affect the document selection. Default: `-1` (any
negative value means selecting documents based on all indexed chunks); -
`reranker`string, the reranking method to use. Currently supports
`CrossEncoderReranker` (default, using sentence-transformers cross-encoder
`db_path`string, Path to local persistent database. If you didn’t set up a
standalone Chromadb server, this is where the files for your database will be
stored. Default: `~/.local/share/vectorcode/chromadb/`; - `db_log_path`string,
path to the _directory_ where the built-in chromadb server will write the log
to. Default: `~/.local/share/vectorcode/`; - `chunk_size`integer, the maximum
number of characters per chunk. A larger value reduces the number of items in
the database, and hence accelerates the search, but at the cost of potentially
truncated data and lost information. Default: `2500`. To disable chunking, set
it to a negative number; - `overlap_ratio`float between 0 and 1, the ratio of
overlapping/shared content between 2 adjacent chunks. A larger ratio improves
the coherence of chunks, but at the cost of increasing number of entries in the
database and hence slowing down the search. Default: `0.2`. _Starting from
0.4.11, VectorCode will use treesitter to parse languages that it can
automatically detect. It uses pygments to guess the language from filename, and
tree-sitter-language-pack to fetch the correct parser. overlap_ratio has no
effects when treesitter works. If VectorCode fails to find an appropriate
parser, it’ll fallback to the legacy naive parser, in which case
overlap_ratio works exactly in the same way as before;_ -
`query_multiplier`integer, when you use the `query` command to retrieve `n`
documents, VectorCode will check `n * query_multiplier` chunks and return at
most `n` documents. A larger value of `query_multiplier` guarantees the return
of `n` documents, but with the risk of including too many less-relevant chunks
that may affect the document selection. Default: `-1` (any negative value means
selecting documents based on all indexed chunks); - `reranker`string, the
reranking method to use. Currently supports `CrossEncoderReranker` (default,
using sentence-transformers cross-encoder
<https://sbert.net/docs/package_reference/cross_encoder/cross_encoder.html> )
and `NaiveReranker` (sort chunks by the "distance" between the embedding
vectors); - `reranker_params`dictionary, similar to `embedding_params`. The
Expand All @@ -361,17 +354,16 @@ these are the options passed to the `CrossEncoder`
<https://sbert.net/docs/package_reference/cross_encoder/cross_encoder.html#id1>
class. For example, if you want to use a non-default model, you can use the
following: `json { "reranker_params": { "model_name_or_path": "your_model_here"
} }` ; - `db_settings`dictionary, works in a similar way to `embedding_params`,
} }` - `db_settings`dictionary, works in a similar way to `embedding_params`,
but for Chromadb client settings so that you can configure authentication for
remote Chromadb <https://docs.trychroma.com/production/administration/auth>; -
`hnsw`a dictionary of hnsw settings
<https://cookbook.chromadb.dev/core/configuration/#hnsw-configuration> that may
improve the query performances or avoid runtime errors during queries. **It’s
recommended to re-vectorise the collection after modifying these options,
because some of the options can only be set during collection creation.**
Example: `json5 // the following is the default value. "hnsw": { "hnsw:M": 64,
}` - `filetype_map``dict[str, list[str]]`, a dictionary where keys are language
name
Example (and default): `json5 "hnsw": { "hnsw:M": 64, }` -
`filetype_map``dict[str, list[str]]`, a dictionary where keys are language name
<https://github.com/Goldziher/tree-sitter-language-pack?tab=readme-ov-file#available-languages>
and values are lists of Python regex patterns
<https://docs.python.org/3/library/re.html> that will match file extensions.
Expand Down Expand Up @@ -566,7 +558,7 @@ the `VECTORCODE_LOG_LEVEL` variable to one of `ERROR`, `WARN` (`WARNING`),
`INFO` or `DEBUG`. For the CLI that you interact with in your shell, this will
output logs to `STDERR` and write a log file to
`~/.local/share/vectorcode/logs/`. For LSP and MCP servers, because `STDIO` is
used for the RPC, only the log file will be written.
used for the RPC, the logs will only be written to the log file, not `STDERR`.

For example:

Expand All @@ -575,6 +567,9 @@ For example:
<


Depending on the MCP/LSP client implementation, you may need to take extra
steps to make sure the environment variables are captured by VectorCode.

SHELL COMPLETION*VectorCode-cli-vectorcode-command-line-tool-shell-completion*

VectorCode supports shell completion for bash/zsh/tcsh. You can use `vectorcode
Expand Down Expand Up @@ -602,9 +597,9 @@ following options in the JSON config file:
For Intel users, sentence transformer <https://www.sbert.net/index.html>
supports OpenVINO
<https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html>
backend for supported GPU. Run `pipx install vectorcode[intel]` which will
bundle the relevant libraries when you install VectorCode. After that, you will
need to configure `SentenceTransformer` to use `openvino` backend. In your
backend for supported GPU. Run `uv install vectorcode[intel]` which will bundle
the relevant libraries when you install VectorCode. After that, you will need
to configure `SentenceTransformer` to use `openvino` backend. In your
`config.json`, set `backend` key in `embedding_params` to `"openvino"`

>json
Expand Down
Loading