Skip to content

small docs fix.#1

Merged
vincentzed merged 2 commits intomainfrom
vz/010926
Jan 9, 2026
Merged

small docs fix.#1
vincentzed merged 2 commits intomainfrom
vz/010926

Conversation

@vincentzed
Copy link
Owner

@vincentzed vincentzed commented Jan 9, 2026

Summary by cubic

Improved Python bindings docs and packaging for decon. Added clear install steps and a Quickstart, clarified the “decontaminate” package vs “decon” import, linked API reference details, and refined pyproject classifiers/maintainers for better PyPI visibility.

Written for commit 5631890. Summary will update on new commits. Review in cubic

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello @vincentzed, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the documentation for the Python bindings of the decon project. The changes aim to provide clearer installation instructions, more comprehensive usage examples, and better-structured API references, making it easier for Python users to get started and understand the library's capabilities.

Highlights

  • Documentation Clarity: Improved the clarity and detail of Python binding documentation across README.md, crates/decon-py/README.md, and doc/python.md to enhance user experience.
  • Installation Instructions: Expanded installation instructions to include uv pip install and recommendations for the datasets library, along with important notes on package vs. import names.
  • Enhanced Code Examples: Updated and expanded code examples in the 'Quickstart' sections to better demonstrate detect, Tokenizer, and clean_text usage.
  • Project Metadata Updates: Updated pyproject.toml with additional author information, refined project classifiers, and standardized URLs for better project discoverability and categorization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the documentation and packaging for the Python bindings. The updated READMEs are much clearer for new users, with better installation instructions, a quickstart guide, and API reference details. The pyproject.toml changes also enhance the package's metadata on PyPI.

I've left a few minor suggestions to further polish the language in the documentation for clarity and conciseness. I also noticed that crates/decon-py/README.md and doc/python.md have identical content. While this is fine, you might consider symlinking one to the other to avoid maintaining duplicate documentation in the future.

Overall, these are great changes that will improve the developer experience for Python users.

Comment on lines +11 to +13
> The goal of this fork is to simply expose the API transparently for python users.
> Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`.
> Use `pip install decontaminate` to install the package to get started.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a helpful note for new users. The wording could be made clearer and more concise. The installation command is also mentioned in the 'Python Quick Start' section, so it might be redundant here. Consider consolidating this information for better readability.

Suggested change
> The goal of this fork is to simply expose the API transparently for python users.
> Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`.
> Use `pip install decontaminate` to install the package to get started.
> The goal of this fork is to provide transparent Python bindings for the core Rust API.
>
> Please note that the package is installed from PyPI as `decontaminate` but imported into Python as `decon`.

Comment on lines +19 to +22
There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
`datasets` for easy management:

So in the same environment you can do `pip install datasets`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section has a typo ('recomend' should be 'recommend') and the phrasing is a bit conversational and split across lines. I suggest rephrasing for clarity and correcting the typo.

Suggested change
There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
`datasets` for easy management:
So in the same environment you can do `pip install datasets`.
There are no dependencies installed by default. Since it is common to load datasets from Python, we recommend using the `datasets` library for easy management.
In the same environment, you can run:
`pip install datasets`

```

## API
We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence could be phrased more formally and clearly. 'Loss of quality' is a bit vague, and 'please help report it' is informal.

Suggested change
We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it.
We strive to maintain parity with the Rust API. If you encounter any discrepancies or issues, please report them.

The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).

## Documentation
Please refer to these sections for the full detail of API. `lib.rs`:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence could be slightly rephrased for better flow before the list of items from lib.rs.

Suggested change
Please refer to these sections for the full detail of API. `lib.rs`:
For full API details, please refer to the following sections in `lib.rs`:

- **Functions** (line ~830+): `detect()`, `clean_text()`, `review()`, `compare()`, `evals()`, `server()`

Full documentation: https://github.com/vincentzed/decon
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example ngram_size= could be confusing as it's incomplete. It would be clearer to show it as a keyword argument.

Suggested change
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python).
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable (e.g., `ngram_size` in Rust becomes the `ngram_size` keyword argument in Python).

Comment on lines +19 to +22
There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
`datasets` for easy management:

So in the same environment you can do `pip install datasets`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section has a typo ('recomend' should be 'recommend') and the phrasing is a bit conversational and split across lines. I suggest rephrasing for clarity and correcting the typo.

Suggested change
There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
`datasets` for easy management:
So in the same environment you can do `pip install datasets`.
There are no dependencies installed by default. Since it is common to load datasets from Python, we recommend using the `datasets` library for easy management.
In the same environment, you can run:
`pip install datasets`

cleaned = decon.clean_text("Hello, World!") # "hello world"
```

We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence could be phrased more formally and clearly. 'Loss of quality' is a bit vague, and 'please help report it' is informal.

Suggested change
We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it.
We strive to maintain parity with the Rust API. If you encounter any discrepancies or issues, please report them.

The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).

Key sections in `lib.rs`:
Please refer to these sections for the full detail of API. `lib.rs`:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence could be slightly rephrased for better flow before the list of items from lib.rs.

Suggested change
Please refer to these sections for the full detail of API. `lib.rs`:
For full API details, please refer to the following sections in `lib.rs`:

- **Functions** (line ~830+): `detect()`, `clean_text()`, `review()`, `compare()`, `evals()`, `server()`

The Rust parameter names map directly to Python kwargs (e.g., `ngram_size` in Rust = `ngram_size=` in Python).
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example ngram_size= could be confusing as it's incomplete. It would be clearer to show it as a keyword argument.

Suggested change
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python).
The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable (e.g., `ngram_size` in Rust becomes the `ngram_size` keyword argument in Python).

@vincentzed vincentzed merged commit 5881e84 into main Jan 9, 2026
4 checks passed
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 4 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="README.md">

<violation number="1" location="README.md:11">
P3: Capitalize 'python' to 'Python' to maintain consistency with Python naming conventions and the rest of the document.</violation>
</file>

<file name="doc/python.md">

<violation number="1" location="doc/python.md:19">
P2: Spelling error: 'recomend' should be 'recommend'</violation>
</file>

<file name="crates/decon-py/README.md">

<violation number="1" location="crates/decon-py/README.md:19">
P3: Spelling error: 'recomend' should be 'recommend'</violation>

<violation number="2" location="crates/decon-py/README.md:54">
P1: Incorrect relative path: `../crates/decon-py/src/lib.rs` should be `src/lib.rs`</violation>

<violation number="3" location="crates/decon-py/README.md:56">
P3: Grammar issue: 'full detail of API' should be 'full details of the API'</violation>
</file>

<file name="crates/decon-py/pyproject.toml">

<violation number="1" location="crates/decon-py/pyproject.toml:53">
P2: Inconsistent indentation: Homepage has 4 leading spaces while other keys in [project.urls] have none. Remove the leading spaces for consistency.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

- `detect()`, `review()`, `compare()`, `evals()`, `server()`
- `Tokenizer` (encode/decode with cl100k, o200k, etc.)
- `clean_text()` (text normalization)
The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Incorrect relative path: ../crates/decon-py/src/lib.rs should be src/lib.rs

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/README.md, line 54:

<comment>Incorrect relative path: `../crates/decon-py/src/lib.rs` should be `src/lib.rs`</comment>

<file context>
@@ -1,34 +1,61 @@
-- `detect()`, `review()`, `compare()`, `evals()`, `server()`
-- `Tokenizer` (encode/decode with cl100k, o200k, etc.)
-- `clean_text()` (text normalization)
+The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).
 
-## Documentation
</file context>
Suggested change
The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).
The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](src/lib.rs).
Fix with Cubic

uv pip install decontaminate
```

There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Spelling error: 'recomend' should be 'recommend'

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At doc/python.md, line 19:

<comment>Spelling error: 'recomend' should be 'recommend'</comment>

<file context>
@@ -4,13 +4,29 @@ Python bindings for decon via [PyO3](https://pyo3.rs/).
+uv pip install decontaminate
+```
+
+There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
+`datasets` for easy management: 
+
</file context>
Suggested change
There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
There are no dependencies installed by default. Since it is common to load datasets from python, we recommend using
Fix with Cubic


[project.urls]
Homepage = "https://github.com/allenai/decon"
Homepage = "https://github.com/allenai/decon"
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Inconsistent indentation: Homepage has 4 leading spaces while other keys in [project.urls] have none. Remove the leading spaces for consistency.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/pyproject.toml, line 53:

<comment>Inconsistent indentation: Homepage has 4 leading spaces while other keys in [project.urls] have none. Remove the leading spaces for consistency.</comment>

<file context>
@@ -12,32 +12,45 @@ license-files = ["LICENSE"]
 
 [project.urls]
-Homepage = "https://github.com/allenai/decon"
+    Homepage = "https://github.com/allenai/decon"
 Repository = "https://github.com/vincentzed/decon"
 Documentation = "https://github.com/vincentzed/decon/blob/main/doc/python.md"
</file context>
Suggested change
Homepage = "https://github.com/allenai/decon"
Homepage = "https://github.com/allenai/decon"
Fix with Cubic


> [!NOTE]
> **🐍 This fork adds Python bindings** — the core Rust functionality is unchanged. Skip to [Python Quick Start](#python) to get started, or see the [Architecture](#architecture) section to understand how bindings are structured. For the full Python API signature, see [`crates/decon-py/src/lib.rs`](crates/decon-py/src/lib.rs).
> The goal of this fork is to simply expose the API transparently for python users.
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Capitalize 'python' to 'Python' to maintain consistency with Python naming conventions and the rest of the document.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At README.md, line 11:

<comment>Capitalize 'python' to 'Python' to maintain consistency with Python naming conventions and the rest of the document.</comment>

<file context>
@@ -6,7 +6,11 @@ It uses [simple](doc/simple.md) token based sampling and counting methods, makin
 
+> [!NOTE]
 > **🐍 This fork adds Python bindings** — the core Rust functionality is unchanged. Skip to [Python Quick Start](#python) to get started, or see the [Architecture](#architecture) section to understand how bindings are structured. For the full Python API signature, see [`crates/decon-py/src/lib.rs`](crates/decon-py/src/lib.rs).
+> The goal of this fork is to simply expose the API transparently for python users.
+> Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`.
+> Use `pip install decontaminate` to install the package to get started.
</file context>
Suggested change
> The goal of this fork is to simply expose the API transparently for python users.
> The goal of this fork is to simply expose the API transparently for Python users.
Fix with Cubic

The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).

## Documentation
Please refer to these sections for the full detail of API. `lib.rs`:
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Grammar issue: 'full detail of API' should be 'full details of the API'

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/README.md, line 56:

<comment>Grammar issue: 'full detail of API' should be 'full details of the API'</comment>

<file context>
@@ -1,34 +1,61 @@
+The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).
 
-## Documentation
+Please refer to these sections for the full detail of API. `lib.rs`:
+- **`PyConfig`** (line ~230): All `Config` parameters with defaults in the `#[pyo3(signature = ...)]` block
+- **`PyTokenizer`** (line ~740): Tokenizer with `encode()`, `decode()`, `is_space_token()`
</file context>
Suggested change
Please refer to these sections for the full detail of API. `lib.rs`:
Please refer to these sections for the full details of the API. `lib.rs`:
Fix with Cubic

uv pip install decontaminate
```

There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Spelling error: 'recomend' should be 'recommend'

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/README.md, line 19:

<comment>Spelling error: 'recomend' should be 'recommend'</comment>

<file context>
@@ -1,34 +1,61 @@
+uv pip install decontaminate
+```
+
+There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
+`datasets` for easy management: 
+
</file context>
Suggested change
There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
There are no dependencies installed by default. Since it is common to load dataset from python, we recommend using
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant