Conversation
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Summary of ChangesHello @vincentzed, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on enhancing the documentation for the Python bindings of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly improves the documentation and packaging for the Python bindings. The updated READMEs are much clearer for new users, with better installation instructions, a quickstart guide, and API reference details. The pyproject.toml changes also enhance the package's metadata on PyPI.
I've left a few minor suggestions to further polish the language in the documentation for clarity and conciseness. I also noticed that crates/decon-py/README.md and doc/python.md have identical content. While this is fine, you might consider symlinking one to the other to avoid maintaining duplicate documentation in the future.
Overall, these are great changes that will improve the developer experience for Python users.
| > The goal of this fork is to simply expose the API transparently for python users. | ||
| > Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`. | ||
| > Use `pip install decontaminate` to install the package to get started. |
There was a problem hiding this comment.
This is a helpful note for new users. The wording could be made clearer and more concise. The installation command is also mentioned in the 'Python Quick Start' section, so it might be redundant here. Consider consolidating this information for better readability.
| > The goal of this fork is to simply expose the API transparently for python users. | |
| > Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`. | |
| > Use `pip install decontaminate` to install the package to get started. | |
| > The goal of this fork is to provide transparent Python bindings for the core Rust API. | |
| > | |
| > Please note that the package is installed from PyPI as `decontaminate` but imported into Python as `decon`. |
| There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using | ||
| `datasets` for easy management: | ||
|
|
||
| So in the same environment you can do `pip install datasets`. |
There was a problem hiding this comment.
This section has a typo ('recomend' should be 'recommend') and the phrasing is a bit conversational and split across lines. I suggest rephrasing for clarity and correcting the typo.
| There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using | |
| `datasets` for easy management: | |
| So in the same environment you can do `pip install datasets`. | |
| There are no dependencies installed by default. Since it is common to load datasets from Python, we recommend using the `datasets` library for easy management. | |
| In the same environment, you can run: | |
| `pip install datasets` |
| ``` | ||
|
|
||
| ## API | ||
| We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it. |
There was a problem hiding this comment.
This sentence could be phrased more formally and clearly. 'Loss of quality' is a bit vague, and 'please help report it' is informal.
| We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it. | |
| We strive to maintain parity with the Rust API. If you encounter any discrepancies or issues, please report them. |
| The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs). | ||
|
|
||
| ## Documentation | ||
| Please refer to these sections for the full detail of API. `lib.rs`: |
| - **Functions** (line ~830+): `detect()`, `clean_text()`, `review()`, `compare()`, `evals()`, `server()` | ||
|
|
||
| Full documentation: https://github.com/vincentzed/decon | ||
| The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python). |
There was a problem hiding this comment.
The example ngram_size= could be confusing as it's incomplete. It would be clearer to show it as a keyword argument.
| The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python). | |
| The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable (e.g., `ngram_size` in Rust becomes the `ngram_size` keyword argument in Python). |
| There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using | ||
| `datasets` for easy management: | ||
|
|
||
| So in the same environment you can do `pip install datasets`. |
There was a problem hiding this comment.
This section has a typo ('recomend' should be 'recommend') and the phrasing is a bit conversational and split across lines. I suggest rephrasing for clarity and correcting the typo.
| There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using | |
| `datasets` for easy management: | |
| So in the same environment you can do `pip install datasets`. | |
| There are no dependencies installed by default. Since it is common to load datasets from Python, we recommend using the `datasets` library for easy management. | |
| In the same environment, you can run: | |
| `pip install datasets` |
| cleaned = decon.clean_text("Hello, World!") # "hello world" | ||
| ``` | ||
|
|
||
| We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it. |
There was a problem hiding this comment.
This sentence could be phrased more formally and clearly. 'Loss of quality' is a bit vague, and 'please help report it' is informal.
| We strive to keep parity with Rust API, if there are any issues with loss of quality, please help report it. | |
| We strive to maintain parity with the Rust API. If you encounter any discrepancies or issues, please report them. |
| The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs). | ||
|
|
||
| Key sections in `lib.rs`: | ||
| Please refer to these sections for the full detail of API. `lib.rs`: |
| - **Functions** (line ~830+): `detect()`, `clean_text()`, `review()`, `compare()`, `evals()`, `server()` | ||
|
|
||
| The Rust parameter names map directly to Python kwargs (e.g., `ngram_size` in Rust = `ngram_size=` in Python). | ||
| The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python). |
There was a problem hiding this comment.
The example ngram_size= could be confusing as it's incomplete. It would be clearer to show it as a keyword argument.
| The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable. (e.g., `ngram_size` in Rust = `ngram_size=` in Python). | |
| The Rust parameter names map directly to Python kwargs, so they are easily reusable and recognizable (e.g., `ngram_size` in Rust becomes the `ngram_size` keyword argument in Python). |
There was a problem hiding this comment.
6 issues found across 4 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="README.md">
<violation number="1" location="README.md:11">
P3: Capitalize 'python' to 'Python' to maintain consistency with Python naming conventions and the rest of the document.</violation>
</file>
<file name="doc/python.md">
<violation number="1" location="doc/python.md:19">
P2: Spelling error: 'recomend' should be 'recommend'</violation>
</file>
<file name="crates/decon-py/README.md">
<violation number="1" location="crates/decon-py/README.md:19">
P3: Spelling error: 'recomend' should be 'recommend'</violation>
<violation number="2" location="crates/decon-py/README.md:54">
P1: Incorrect relative path: `../crates/decon-py/src/lib.rs` should be `src/lib.rs`</violation>
<violation number="3" location="crates/decon-py/README.md:56">
P3: Grammar issue: 'full detail of API' should be 'full details of the API'</violation>
</file>
<file name="crates/decon-py/pyproject.toml">
<violation number="1" location="crates/decon-py/pyproject.toml:53">
P2: Inconsistent indentation: Homepage has 4 leading spaces while other keys in [project.urls] have none. Remove the leading spaces for consistency.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| - `detect()`, `review()`, `compare()`, `evals()`, `server()` | ||
| - `Tokenizer` (encode/decode with cl100k, o200k, etc.) | ||
| - `clean_text()` (text normalization) | ||
| The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs). |
There was a problem hiding this comment.
P1: Incorrect relative path: ../crates/decon-py/src/lib.rs should be src/lib.rs
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/README.md, line 54:
<comment>Incorrect relative path: `../crates/decon-py/src/lib.rs` should be `src/lib.rs`</comment>
<file context>
@@ -1,34 +1,61 @@
-- `detect()`, `review()`, `compare()`, `evals()`, `server()`
-- `Tokenizer` (encode/decode with cl100k, o200k, etc.)
-- `clean_text()` (text normalization)
+The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).
-## Documentation
</file context>
| The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs). | |
| The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](src/lib.rs). |
| uv pip install decontaminate | ||
| ``` | ||
|
|
||
| There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using |
There was a problem hiding this comment.
P2: Spelling error: 'recomend' should be 'recommend'
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At doc/python.md, line 19:
<comment>Spelling error: 'recomend' should be 'recommend'</comment>
<file context>
@@ -4,13 +4,29 @@ Python bindings for decon via [PyO3](https://pyo3.rs/).
+uv pip install decontaminate
+```
+
+There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
+`datasets` for easy management:
+
</file context>
| There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using | |
| There are no dependencies installed by default. Since it is common to load datasets from python, we recommend using |
|
|
||
| [project.urls] | ||
| Homepage = "https://github.com/allenai/decon" | ||
| Homepage = "https://github.com/allenai/decon" |
There was a problem hiding this comment.
P2: Inconsistent indentation: Homepage has 4 leading spaces while other keys in [project.urls] have none. Remove the leading spaces for consistency.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/pyproject.toml, line 53:
<comment>Inconsistent indentation: Homepage has 4 leading spaces while other keys in [project.urls] have none. Remove the leading spaces for consistency.</comment>
<file context>
@@ -12,32 +12,45 @@ license-files = ["LICENSE"]
[project.urls]
-Homepage = "https://github.com/allenai/decon"
+ Homepage = "https://github.com/allenai/decon"
Repository = "https://github.com/vincentzed/decon"
Documentation = "https://github.com/vincentzed/decon/blob/main/doc/python.md"
</file context>
| Homepage = "https://github.com/allenai/decon" | |
| Homepage = "https://github.com/allenai/decon" |
|
|
||
| > [!NOTE] | ||
| > **🐍 This fork adds Python bindings** — the core Rust functionality is unchanged. Skip to [Python Quick Start](#python) to get started, or see the [Architecture](#architecture) section to understand how bindings are structured. For the full Python API signature, see [`crates/decon-py/src/lib.rs`](crates/decon-py/src/lib.rs). | ||
| > The goal of this fork is to simply expose the API transparently for python users. |
There was a problem hiding this comment.
P3: Capitalize 'python' to 'Python' to maintain consistency with Python naming conventions and the rest of the document.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At README.md, line 11:
<comment>Capitalize 'python' to 'Python' to maintain consistency with Python naming conventions and the rest of the document.</comment>
<file context>
@@ -6,7 +6,11 @@ It uses [simple](doc/simple.md) token based sampling and counting methods, makin
+> [!NOTE]
> **🐍 This fork adds Python bindings** — the core Rust functionality is unchanged. Skip to [Python Quick Start](#python) to get started, or see the [Architecture](#architecture) section to understand how bindings are structured. For the full Python API signature, see [`crates/decon-py/src/lib.rs`](crates/decon-py/src/lib.rs).
+> The goal of this fork is to simply expose the API transparently for python users.
+> Please note, the package currently used by `decon` is `decontaminate`, import is `import decon`.
+> Use `pip install decontaminate` to install the package to get started.
</file context>
| > The goal of this fork is to simply expose the API transparently for python users. | |
| > The goal of this fork is to simply expose the API transparently for Python users. |
| The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs). | ||
|
|
||
| ## Documentation | ||
| Please refer to these sections for the full detail of API. `lib.rs`: |
There was a problem hiding this comment.
P3: Grammar issue: 'full detail of API' should be 'full details of the API'
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/README.md, line 56:
<comment>Grammar issue: 'full detail of API' should be 'full details of the API'</comment>
<file context>
@@ -1,34 +1,61 @@
+The Python API is a thin wrapper over the Rust implementation. All parameters and their defaults are defined in [`crates/decon-py/src/lib.rs`](../crates/decon-py/src/lib.rs).
-## Documentation
+Please refer to these sections for the full detail of API. `lib.rs`:
+- **`PyConfig`** (line ~230): All `Config` parameters with defaults in the `#[pyo3(signature = ...)]` block
+- **`PyTokenizer`** (line ~740): Tokenizer with `encode()`, `decode()`, `is_space_token()`
</file context>
| Please refer to these sections for the full detail of API. `lib.rs`: | |
| Please refer to these sections for the full details of the API. `lib.rs`: |
| uv pip install decontaminate | ||
| ``` | ||
|
|
||
| There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using |
There was a problem hiding this comment.
P3: Spelling error: 'recomend' should be 'recommend'
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/decon-py/README.md, line 19:
<comment>Spelling error: 'recomend' should be 'recommend'</comment>
<file context>
@@ -1,34 +1,61 @@
+uv pip install decontaminate
+```
+
+There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using
+`datasets` for easy management:
+
</file context>
| There are no dependencies installed by default. Since it is common to load dataset from python, we recomend using | |
| There are no dependencies installed by default. Since it is common to load dataset from python, we recommend using |
Summary by cubic
Improved Python bindings docs and packaging for decon. Added clear install steps and a Quickstart, clarified the “decontaminate” package vs “decon” import, linked API reference details, and refined pyproject classifiers/maintainers for better PyPI visibility.
Written for commit 5631890. Summary will update on new commits. Review in cubic