Exposes API for processing pretraining data #672

RobotSail · 2025-11-25T02:31:34Z

This PR exposes a new API which allows the ingestion of datasets containing a document field, which then gets tokenized as input_ids.

Summary by CodeRabbit

New Features
- Continual pretraining mode with configurable block size and document column; public PretrainingConfig exported and available via CLI and programmatic config.
- Document preprocessing tool that tokenizes documents, enforces EOS, logs stats, and outputs JSONL for block sampling.
- Block-based dataset/loader path that concatenates tokens into fixed-size padded blocks for pretraining.
Tests
- Extensive unit tests covering preprocessing, block dataset behavior, loader batching, and end-to-end pretraining flow.
Documentation
- Added "Continual pretraining mode" section with CLI and code examples.
Bug Fixes
- Improved process handling and updated tokenizer call behavior in tests.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-25T02:31:47Z

Warning

Rate limit exceeded

@RobotSail has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 51 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 5f8f2a5 and 77d39f6.

📒 Files selected for processing (4)

src/instructlab/training/accelerator.py
src/instructlab/training/data_process.py
src/instructlab/training/main_ds.py
tests/unit/test_pretraining_data_process.py

📝 Walkthrough

Walkthrough

Adds continual pretraining: new PretrainingConfig and TrainingArgs field, document-to-JSONL preprocessing, PretrainingBlockDataset and loader branch for fixed-size block sampling, CLI flags for block-size/document-column-name, wiring through main orchestration, and unit tests for the new flow.

Changes

Cohort / File(s)	Summary
Configuration `src/instructlab/training/config.py`	Added `PretrainingConfig(BaseModel)` with `block_size: int` and `document_column_name: str = "document"`. Extended `DataProcessArgs` with `is_pretraining: bool = False` and `pretraining_column_name: str = "document"`. Added `pretraining_config: Optional[PretrainingConfig] = None` to `TrainingArgs`.
Pretraining data processing `src/instructlab/training/data_process.py`	Added `process_documents_for_pretraining(...)` to load a JSON dataset, require a document column and tokenizer EOS, tokenize documents (append EOS if missing), emit per-document records `{ "input_ids": [...], "len": N }` into `data.jsonl`, and log stats. Also updated tokenizer calls to use `return_dict=False`.
Data loading & sampling `src/instructlab/training/sampler.py`	Added `PretrainingBlockDataset` that concatenates `input_ids`, splits into fixed-size blocks with padding, provides `from_jsonl_file`, `__len__`, `__getitem__`, and `get_lengths`. `get_data_loader` now accepts `pretraining_config: Optional[PretrainingConfig]` and selects the pretraining dataset when provided. Added logging imports.
Training entry / CLI / orchestration `src/instructlab/training/main_ds.py`	Wire `pretraining_config` through CLI and runtime: added `--block-size` and `--document-column-name` flags, call `process_documents_for_pretraining` when configured, pass `pretraining_config` into `get_data_loader`, and include block args in the torchrun invocation. Replaced `process.poll` usage with `process.wait(timeout=60)` and updated logs.
Public exports `src/instructlab/training/__init__.py`	Exported `PretrainingConfig` by importing it from `.config` and adding it to `__all__`.
Tests `tests/unit/test_pretraining_mode.py`, `tests/unit/test_pretraining_data_process.py`, `tests/unit/test_pretraining_sampler.py`, `tests/unit/test_data_process.py`	Added comprehensive tests for `process_documents_for_pretraining`, `PretrainingBlockDataset`, pretraining data loader behavior, and updated mocks to accept `return_dict`/`**kwargs`.
Docs `README.md`	Added "Continual pretraining mode" section, CLI/programmatic examples for `--block-size` and `--document-column-name`, navigation entry, and minor heading adjustment.
Dev tooling `tox.ini`	Added `pydantic` to the `[testenv:mypy]` deps.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as CLI
    participant Main as main_ds.train
    participant Proc as data_process.process_documents_for_pretraining
    participant Tok as AutoTokenizer
    participant Loader as sampler.get_data_loader
    participant Trainer as torchrun/Trainer

    CLI->>Main: parse args (may include --block-size, --document-column-name)
    Main->>Main: set train_args.pretraining_config (optional)
    alt pretraining_config provided
        Main->>Proc: process_documents_for_pretraining(data_path, out_path, model_path, ...)
        Proc->>Tok: AutoTokenizer.from_pretrained(model_path)
        Tok-->>Proc: tokenizer (must have eos_token_id)
        loop per document
            Proc->>Tok: tokenizer(doc, add_special_tokens=True, return_dict=False)
            Tok-->>Proc: input_ids
            Proc->>Proc: append EOS if missing, write record to data.jsonl
        end
        Proc-->>Main: data.jsonl ready
        Main->>Loader: get_data_loader(..., pretraining_config)
        Loader-->>Main: DataLoader(PretrainingBlockDataset)
        Main->>Trainer: assemble torchrun with --block-size & --document-column-name
    else standard training
        Main->>Main: process_data(...) (existing pipeline)
        Main->>Loader: get_data_loader(..., pretraining_config=None)
        Loader-->>Main: DataLoader(TokenDataset)
        Main->>Trainer: assemble standard torchrun
    end
    Main->>Trainer: launch distributed training
    Trainer-->>Main: exit (return code)

sequenceDiagram
    autonumber
    participant Caller as process_documents_for_pretraining()
    participant FS as FileSystem
    participant Tok as AutoTokenizer
    Note over Caller: per-document tokenize -> jsonl output

    Caller->>FS: load JSON dataset
    FS-->>Caller: dataset
    Caller->>Tok: AutoTokenizer.from_pretrained(model_path)
    Tok-->>Caller: tokenizer (must have eos_token_id)

    loop per document
      Caller->>Tok: tokenizer(doc, add_special_tokens=True, return_dict=False)
      Tok-->>Caller: input_ids
      Caller->>Caller: ensure EOS present
      Caller->>FS: append `{ "input_ids": [...], "len": N }` to data.jsonl
    end

    Caller->>FS: finalize/write file
    Caller-->>Caller: log totals and averages

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I'm a rabbit with a tokenizer hat,
I hop through docs and stitch them flat.
Blocks of tokens, tidy and neat,
JSONL trails for models to eat.
🐇✨ Pretraining ready — bounce on repeat!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Exposes API for processing pretraining data' accurately and concisely summarizes the main objective of the pull request: introducing a new public API for processing pretraining datasets.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/instructlab/training/data_process.py (2)
1202-1204: Prefer lazy formatting in logger calls.

Using f-strings in logger.info() evaluates the string even when the log level is disabled. Use lazy formatting for better performance.
-    logger.info(f"Processed {len(tokenized_data):,} documents")
-    logger.info(f"Total tokens: {total_tokens:,}")
-    logger.info(f"Average tokens per document: {avg_tokens:.1f}")
+    logger.info("Processed %s documents", f"{len(tokenized_data):,}")
+    logger.info("Total tokens: %s", f"{total_tokens:,}")
+    logger.info("Average tokens per document: %.1f", avg_tokens)
1217-1218: Same lazy formatting suggestion for these logger calls.
-    logger.info(f"Saved tokenized documents to {output_file}")
+    logger.info("Saved tokenized documents to %s", output_file)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 781c36f and 2b3c3ef.

📒 Files selected for processing (2)

src/instructlab/training/config.py (1 hunks)
src/instructlab/training/data_process.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (4)

src/instructlab/training/config.py (2)

66-70: LGTM!

The is_pretraining field is well-defined with a clear description that explains its purpose.

75-80: LGTM!

The validator correctly enforces mutual exclusivity between is_pretraining mode and chat_tmpl_path, which aligns with the pretraining workflow that tokenizes raw documents without chat templates.
src/instructlab/training/data_process.py (2)
1156-1179: LGTM!

Good defensive programming with proper validation for write permissions, dataset format, field requirements, and tokenizer EOS token.

1136-1141: Verify integration: is_pretraining flag not wired to this function.

The DataProcessArgs.is_pretraining flag was added in config.py, but neither main() nor process_data() in this file dispatches to process_documents_for_pretraining based on that flag. Users would need to call this function directly.

If this is intentional (exposing API only), the approach is fine. Otherwise, consider adding integration in process_data():
def process_data(..., is_pretraining: bool = False):
    if is_pretraining:
        process_documents_for_pretraining(
            data_path=data_path,
            data_output_path=data_output_path,
            model_path=model_path,
            num_cpu_procs=num_cpu_procs,
        )
    elif chat_tmpl_path:
        # existing legacy path...

src/instructlab/training/data_process.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/instructlab/training/data_process.py (1)
1152-1152: Fix documentation: field name should be document (singular).

The docstring says {"documents": "text"} but the code validates for 'document' (singular) at Line 1171 and the parameter defaults to "document". This inconsistency will confuse API users.
-        data_path: Path to input JSONL with {"documents": "text"} format
+        data_path: Path to input JSONL with {"document": "text"} format

🧹 Nitpick comments (1)

src/instructlab/training/data_process.py (1)
1176-1180: Consider adding error handling for tokenizer loading and validating BOS token.

For consistency with the dataset loading pattern (lines 1161-1166), consider wrapping the tokenizer loading in a try/except. Additionally, since the docstring at Line 1149 documents that documents follow the [BOS][tokens][EOS] pattern, consider validating that tokenizer.bos_token_id exists, similar to the EOS validation.
 logger.info("Loading tokenizer from %s", model_path)
-tokenizer = AutoTokenizer.from_pretrained(model_path)
+try:
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+except Exception as e:
+    raise ValueError(
+        f"Failed to load tokenizer from {model_path}"
+    ) from e

+if tokenizer.bos_token_id is None:
+    logger.warning("Tokenizer does not have a BOS token defined")
+
 if tokenizer.eos_token_id is None:
     raise ValueError("Tokenizer must have an EOS token defined for pretraining")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b3c3ef and 6b6ed96.

📒 Files selected for processing (2)

src/instructlab/training/config.py (1 hunks)
src/instructlab/training/data_process.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/instructlab/training/config.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (1)

src/instructlab/training/data_process.py (1)

1190-1192: Good defensive guard against duplicate EOS tokens.

The guard correctly prevents duplicate EOS tokens, addressing the concern from the previous review. This handles tokenizers that may or may not automatically append EOS with add_special_tokens=True.

src/instructlab/training/data_process.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/instructlab/training/config.py (1)

66-75: Pretraining config fields look sound; consider minor naming and doc tightening

The new is_pretraining and pretraining_column_name fields are backward‑compatible and clearly scoped to pretraining, which is good. A couple of small polish ideas:

The naming between config (pretraining_column_name) and the pretraining helper (document_column_name in process_documents_for_pretraining(...) per the summary) is slightly inconsistent. Consider aligning them (e.g., document_column_name everywhere) to reduce cognitive overhead when wiring configs to function parameters.

You may want the pretraining_column_name description to explicitly mention the default and mode dependency, e.g., “Name of the dataset column containing raw documents to pretrain on (default: document). Used only when is_pretraining is True.”

These are non‑blocking but would make the API a bit clearer for users and future maintainers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b6ed96 and dc2f5fd.

📒 Files selected for processing (1)

src/instructlab/training/config.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/instructlab/training/data_process.py (2)
1163-1177: Validate document contents (None/empty/non‑string) before tokenization.

tokenize_document assumes sample[document_column_name] is a non‑empty string. If a row has None, a non‑string, or an empty/whitespace‑only value, tokenizer.encode(...) and especially input_ids[-1] can fail or produce meaningless sequences.

Consider validating and failing fast with a clear error:
-    def tokenize_document(sample):
-        input_ids = tokenizer.encode(
-            sample[document_column_name], add_special_tokens=True
-        )
+    def tokenize_document(sample):
+        doc_text = sample[document_column_name]
+        if doc_text is None or not isinstance(doc_text, str) or not doc_text.strip():
+            raise ValueError(
+                f"Document in column '{document_column_name}' is missing, non-string, or empty"
+            )
+
+        input_ids = tokenizer.encode(doc_text, add_special_tokens=True)
@@
-        # ensures eos token is present without double-adding it.
-        if input_ids[-1] != tokenizer.eos_token_id:
+        # ensures eos token is present without double-adding it.
+        if input_ids and input_ids[-1] != tokenizer.eos_token_id:
             input_ids.append(tokenizer.eos_token_id)
This makes the API more robust to dirty data and avoids opaque tokenizer/indexing errors during large runs.

Also applies to: 1186-1199

1145-1159: Fix docstring to match document_column_name and actual schema.

The docstring currently documents {"documents": "text"}, but the function defaults document_column_name="document" and later validates that this exact field exists.

Recommend aligning the docstring with the actual API:
-    Args:
-        data_path: Path to input JSONL with {"documents": "text"} format
+    Args:
+        data_path: Path to input JSONL with {"document": "text"} format
         data_output_path: Directory for processed data output
         model_path: Path to model/tokenizer
         num_cpu_procs: Number of parallel processes
         document_column_name: Name of the column containing the documents

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc2f5fd and 2d12ab0.

📒 Files selected for processing (1)

src/instructlab/training/data_process.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: pylint
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: Summary

🔇 Additional comments (1)

src/instructlab/training/data_process.py (1)

691-693: Return type verification confirms the code change is correct.

The web search confirms that tokenizer.apply_chat_template(..., return_dict=False) with the default tokenize=True returns a list[int] (or list[list[int]] for batched inputs) rather than a dictionary. This flat token sequence is exactly what's needed for the subsequent indexing and region-masking logic that follows in the unmasking implementation.

coderabbitai · 2025-11-26T06:17:33Z

src/instructlab/training/data_process.py

    data_with_input_ids = data.map(
        lambda x: {
-            "input_ids": tokenizer.apply_chat_template(x["messages"], tokenize=True),
+            # newer versions of transformers have `return_dict=True` by default
+            "input_ids": tokenizer.apply_chat_template(x["messages"], tokenize=True, return_dict=False),
            "unmask": bool(x["unmask"]) if "unmask" in x else False,
        },


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

transformers 4.55.0 PreTrainedTokenizer apply_chat_template default return_dict behavior

<web_search_result>
Short answer: In transformers 4.55.0, PreTrainedTokenizer.apply_chat_template defaults return_dict=False. When you set return_dict=True the method returns the usual tokenizer output dict (input_ids, attention_mask, pixel_values, etc. where applicable); with return_dict=False it returns the token id list (or token array) ready for the model. [1][2][3]

Sources

Transformers tokenizer docs (apply_chat_template signature & param default return_dict=False). [1]

Chat-template docs / internal tokenizer docs describing return_dict behavior and pixel_values handling. [2]

v4.55.0 examples showing usage with return_dict=True. [3]
</web_search_result>

Script executed:

#!/bin/bash # Search for other usages of apply_chat_template in the codebase to understand context rg -n "apply_chat_template" --type=py -A2 -B2

Repository: instructlab/training

<shell_result>
Repository: instructlab/training
Script execution failed with unrecoverable error: [CAPABILITY:RUN_SHELL_SCRIPT] Failed to clone repository instructlab/training: Clone operation failed
</shell_result>

transformers apply_chat_template return_dict default history earlier versions before 4.55.0

<web_search_result>
Short answer: Historically apply_chat_template defaulted to tokenize=True and return_dict=False (so it usually returned token ids or a string). Over a series of changes (notably PR/commit referenced as #36149) callers from Processor/ProcessorMixin started invoking tokenizer.apply_chat_template with tokenize=False and return_dict=False, which effectively forced non-tokenized string output and made options that require tokenization/return_dict (e.g. return_assistant_tokens_mask=True) unusable. See the changelog/issues and docs for details. [1][2][3]

Sources

Issue describing the behavior change and how #36149 hard-codes tokenize=False and return_dict=False (blocks return_assistant_tokens_mask). [1]

apply_chat_template signature/docs showing return_dict default and behavior (return_dict=False by default; return_assistant_tokens_mask requires return_dict=True/tokenize). [2]

Hugging Face chat-templating docs / examples describing apply_chat_template usage and generation-mask behavior. [3]
</web_search_result>

Clarify the rationale: return_dict=False is already the default, not a future-proofing measure.

The inline comment suggests that "newer versions of transformers have return_dict=True by default," but apply_chat_template defaults to return_dict=False in transformers 4.55.0 and historically has done so. The explicit parameter here is redundant but harmless.

If this is intended as defensive coding against hypothetical future changes, consider clarifying or removing the comment to avoid misleading future readers. If the intent is to document current behavior, the comment should state that the parameter preserves the existing default rather than implying a version-specific workaround.

🤖 Prompt for AI Agents

In src/instructlab/training/data_process.py around lines 413 to 418, the inline comment claiming "newer versions of transformers have `return_dict=True` by default" is misleading; update the comment to reflect the actual intent: either remove the comment entirely (since `return_dict=False` is the current default for apply_chat_template) or replace it with a concise note like "explicitly set return_dict=False to preserve current behavior" so readers understand this is preserving existing behavior rather than referencing a specific version change.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

src/instructlab/training/config.py (1)
80-88: Add "public API" comment for consistency.

Other public config classes in this file (e.g., DataProcessArgs, TorchrunArgs, LoraOptions, TrainingArgs) are marked with # public API comments. Consider adding this marker to PretrainingConfig for consistency since it's exposed via TrainingArgs.
+# public API
 class PretrainingConfig(BaseModel):
     """
     Configuration for pretraining mode.
     """
src/instructlab/training/sampler.py (1)
382-387: Clarify get_lengths() vs __getitem__["len"] semantics.

get_lengths() returns the actual token count for the last block (last_block_len), while __getitem__ always returns "len": self.block_size (the padded length). This inconsistency may be intentional:

"len" in samples → padded size for batch allocation

get_lengths() → actual tokens for statistics

Consider adding a docstring to get_lengths() clarifying it returns actual token counts, not padded lengths, to prevent confusion.
     def get_lengths(self) -> np.ndarray:
+        """Return actual token counts per block (not padded lengths)."""
         lengths = np.full(self.num_blocks, self.block_size, dtype=np.int64)
         if self.num_blocks and self.last_block_len != self.block_size:
             lengths[-1] = self.last_block_len
         return lengths
tests/unit/test_pretraining_mode.py (1)

102-153: Consider adding edge case tests.

The test validates the happy path but could benefit from additional edge cases:

Empty document handling

Documents with existing EOS tokens (to verify no double-adding)

Custom document_column_name parameter

These can be added in a follow-up PR if desired.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d12ab0 and 50f8ad8.

📒 Files selected for processing (4)

src/instructlab/training/config.py (2 hunks)
src/instructlab/training/main_ds.py (7 hunks)
src/instructlab/training/sampler.py (5 hunks)
tests/unit/test_pretraining_mode.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tests/unit/test_pretraining_mode.py (3)

src/instructlab/training/config.py (1)

PretrainingConfig (80-87)

src/instructlab/training/data_process.py (1)

process_documents_for_pretraining (1138-1224)

src/instructlab/training/sampler.py (4)

PretrainingBlockDataset (298-386)

get_data_loader (433-496)

get_lengths (382-386)

get_lengths (429-430)

src/instructlab/training/sampler.py (1)

src/instructlab/training/config.py (1)

PretrainingConfig (80-87)

src/instructlab/training/main_ds.py (2)

src/instructlab/training/config.py (1)

PretrainingConfig (80-87)

src/instructlab/training/data_process.py (2)

process_documents_for_pretraining (1138-1224)

process_data (1452-1495)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (15)

src/instructlab/training/config.py (2)

66-75: LGTM!

The new pretraining fields are well-defined with sensible defaults and clear descriptions. The is_pretraining flag and configurable column name provide good flexibility for the pretraining workflow.

277-283: LGTM!

The pretraining_config field is correctly defined as Optional with a None default, allowing pretraining to be an opt-in feature. The description clearly explains the purpose.

src/instructlab/training/sampler.py (5)

4-22: LGTM!

The new imports are appropriately organized and the logger is correctly initialized at module level.

340-370: Verify labels alignment for causal LM training.

For full blocks (lines 361-363), labels is set identical to input_ids. This pattern relies on the model internally shifting labels during loss computation (common in HuggingFace causal LM models). Confirm this is the expected behavior for the models used with this training library.

Additionally, loss_tokens = self.block_size - 1 correctly accounts for the first token having no prediction target.

372-380: LGTM!

The from_jsonl_file factory method correctly loads the dataset and delegates to the constructor.

444-474: LGTM!

The get_data_loader function correctly branches between PretrainingBlockDataset and TokenDataset based on the presence of pretraining_config. The logging provides helpful context about the pretraining configuration.

298-338: I encountered a repository clone failure, which prevents me from verifying the codebase patterns. However, I can analyze the review comment based on the provided code snippet and context.

Review comment analysis:

The review comment raises a valid technical concern:

The code does concatenate all input_ids into a single Python list (lines 312-317)

This is indeed memory-intensive for large datasets (billions of tokens)

The comment appropriately acknowledges this is acceptable for manageable sizes

It flags this as an optional refactor for future iteration

The concern is technically sound and the suggestion is reasonable. However, without access to the full codebase, I cannot verify:

Whether existing patterns for memory-mapped arrays or lazy loading are already implemented elsewhere

What the expected dataset size ranges are in practice

Whether this concern has already been addressed in other parts of the codebase

Since verification cannot be completed through automated checks, the review comment should be flagged for manual verification.

Memory usage consideration for large datasets.

The implementation concatenates all input_ids into a single Python list in memory (lines 312-317). For very large pretraining corpora (billions of tokens), this could lead to high memory consumption during dataset initialization.

For now, this is acceptable if the expected dataset sizes are manageable. If memory becomes an issue, consider lazy loading with memory-mapped arrays or chunked processing in a future iteration.

tests/unit/test_pretraining_mode.py (2)

24-57: LGTM!

Good test coverage for PretrainingBlockDataset. The test correctly validates:

Block creation with concatenated tokens

Padding behavior for the partial last block

Loss token counting for full vs partial blocks

get_lengths() returning actual token counts

59-100: LGTM!

Comprehensive integration test for the pretraining data loader. The test correctly validates the packed batch structure, sample counts, and loss token aggregation.

src/instructlab/training/main_ds.py (6)

51-52: LGTM!

The PretrainingConfig import is correctly added alongside related config imports.

366-378: LGTM!

The pretraining_config is correctly passed to get_data_loader using defensive getattr access, which handles both CLI-invoked and programmatic calls gracefully.

542-545: LGTM!

The --block-size flag is correctly appended to the subprocess command when pretraining mode is enabled.

788-793: LGTM!

The --block-size CLI argument is well-defined with a clear help message explaining its purpose.

842-845: LGTM!

The CLI argument is correctly parsed into a PretrainingConfig instance when provided, maintaining the same pattern as other config objects.

470-489: I apologize for the technical difficulties with repository access. I'm unable to execute verification scripts due to consistent cloning failures. However, based on the information provided in the review comment itself, I can provide a rewritten version:

Document column name parameter is not exposed through TrainingArgs.

The call to process_documents_for_pretraining doesn't pass the document_column_name parameter (which defaults to "document"), and this parameter is not configurable through TrainingArgs. This prevents users of run_training() from customizing the column name for pretraining workflows.

Consider either:

Adding pretraining_document_column_name to PretrainingConfig or TrainingArgs to allow customization, or

Documenting this as an intentional limitation if simplicity of the API is preferred.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

src/instructlab/training/data_process.py (3)
693-694: Same misleading comment about return_dict default.

This is the same issue as at line 415. Consider updating or removing the comment.

1155-1156: Docstring field name mismatch: should be document (singular).

The docstring says {"documents": "text"} but the code validates for 'document' (the document_column_name default). This was flagged in a prior review and remains unfixed.
-        data_path: Path to input JSONL with {"documents": "text"} format
+        data_path: Path to input JSONL with {"document": "text"} format
1189-1201: Missing validation for empty or None documents.

The tokenize_document function doesn't validate that sample[document_column_name] is non-empty before tokenization. Empty strings produce minimal [BOS][EOS] tokens, and None values will cause tokenizer.encode() to fail. This was flagged in a prior review.
 def tokenize_document(sample):
+    doc_text = sample[document_column_name]
+    if not doc_text or not isinstance(doc_text, str) or not doc_text.strip():
+        raise ValueError(
+            f"Document at column '{document_column_name}' is empty or invalid"
+        )
+
     input_ids = tokenizer.encode(
-        sample[document_column_name], add_special_tokens=True
+        doc_text, add_special_tokens=True
     )

🧹 Nitpick comments (5)

src/instructlab/training/data_process.py (3)
415-418: Comment is misleading: return_dict=False is already the default.

The comment states newer transformers versions have return_dict=True by default, but apply_chat_template has always defaulted to return_dict=False. The explicit parameter is harmless but the comment is incorrect.
-            # newer versions of transformers have `return_dict=True` by default
+            # explicitly set return_dict=False to ensure we get a list of token ids
             "input_ids": tokenizer.apply_chat_template(
                 x["messages"], tokenize=True, return_dict=False
             ),
1210-1215: Use lazy string formatting for logger calls.

The f-string formatting happens unconditionally even if the log level would suppress the message. Use %-style formatting for performance.
-    logger.info(f"Processed {len(tokenized_data):,} documents")
-    logger.info(f"Total tokens: {total_tokens:,}")
-    logger.info(f"Average tokens per document: {avg_tokens:.1f}")
+    logger.info("Processed %s documents", f"{len(tokenized_data):,}")
+    logger.info("Total tokens: %s", f"{total_tokens:,}")
+    logger.info("Average tokens per document: %.1f", avg_tokens)
1225-1226: Same f-string formatting in logger calls.
-    logger.info(f"Saved tokenized documents to {output_file}")
+    logger.info("Saved tokenized documents to %s", output_file)
tests/unit/test_pretraining_mode.py (1)
112-115: StubTokenizer.encode logic may be inverted.

The stub returns base when add_special_tokens=True and base[1:] when False. This implies BOS is always at index 0, but for a stub that doesn't actually prepend a BOS token (it just uses char ordinals), this is semantically inconsistent. However, since the test just validates the pipeline flow rather than tokenizer correctness, this is acceptable for test purposes.

Consider adding a comment clarifying the stub behavior:
 def encode(self, text, add_special_tokens=True):
+    # Stub: simulate BOS by including first char only when add_special_tokens=True
     base = [ord(ch) % 50 + 1 for ch in text]
     return base if add_special_tokens else base[1:]
src/instructlab/training/sampler.py (1)
375-383: Consider adding error handling for better user experience.

The from_jsonl_file method lacks explicit error handling. If the file is missing or malformed, users will receive raw exceptions from load_dataset. Consider adding try/except blocks with clearer error messages.

Example:
 @classmethod
 def from_jsonl_file(
     cls,
     data_path: str,
     block_size: int,
     pad_token_id: int,
 ) -> "PretrainingBlockDataset":
+    try:
         dataset = load_dataset("json", data_files=data_path, split="train")
+    except FileNotFoundError:
+        raise FileNotFoundError(f"Pretraining data file not found: {data_path}")
+    except Exception as e:
+        raise ValueError(f"Failed to load pretraining data from {data_path}: {e}")
     return cls(dataset, block_size, pad_token_id)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50f8ad8 and 75aa3bc.

📒 Files selected for processing (6)

src/instructlab/training/__init__.py (2 hunks)
src/instructlab/training/config.py (2 hunks)
src/instructlab/training/data_process.py (3 hunks)
src/instructlab/training/main_ds.py (7 hunks)
src/instructlab/training/sampler.py (5 hunks)
tests/unit/test_pretraining_mode.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (5)

src/instructlab/training/__init__.py (1)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

tests/unit/test_pretraining_mode.py (3)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/data_process.py (2)

process_documents_for_pretraining (1140-1226)

main (1500-1517)

src/instructlab/training/sampler.py (4)

PretrainingBlockDataset (299-389)

get_data_loader (436-499)

get_lengths (385-389)

get_lengths (432-433)

src/instructlab/training/sampler.py (3)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/padded_batch_packer.py (1)

batch_lengths_to_minibatches_padded (200-244)

src/instructlab/training/type_definitions.py (1)

CollatedItem (49-61)

src/instructlab/training/main_ds.py (2)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/data_process.py (2)

process_documents_for_pretraining (1140-1226)

process_data (1454-1497)

src/instructlab/training/data_process.py (1)

tests/unit/test_pretraining_mode.py (1)

encode (113-115)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (19)

src/instructlab/training/data_process.py (1)

1194-1196: Good: EOS guard now prevents duplicate EOS tokens.

The guard if input_ids[-1] != tokenizer.eos_token_id correctly prevents double-appending EOS. This addresses the previous review feedback.

src/instructlab/training/__init__.py (1)

13-14: LGTM: PretrainingConfig correctly exposed in public API.

The new PretrainingConfig is properly added to __all__ and imported from .config, making it available for external users.

Also applies to: 27-28

src/instructlab/training/config.py (3)

80-92: LGTM: PretrainingConfig is well-defined.

The PretrainingConfig class has a required block_size field (no default forces explicit specification) and a sensible default for document_column_name. Good design for a configuration that requires explicit opt-in.

281-287: LGTM: Optional pretraining_config integration in TrainingArgs.

The pretraining_config field is correctly typed as Optional[PretrainingConfig] with None default, allowing backward compatibility while enabling pretraining mode when explicitly provided.

66-75: I'll verify whether the is_pretraining and pretraining_column_name fields are actually used in the codebase or represent dead code.
<function_calls>

#!/bin/bash

Search for all usages of is_pretraining and pretraining_column_name

rg -n "is_pretraining|pretraining_column_name" --type=py -C3

</function_calls>
<function_calls>

#!/bin/bash

Check for process_documents_for_pretraining function and its usage

rg -n "process_documents_for_pretraining|def process_" --type=py -C5

</function_calls>

tests/unit/test_pretraining_mode.py (3)

27-60: LGTM: Comprehensive test for PretrainingBlockDataset.

The test correctly validates:

Block count calculation (7 tokens → 2 blocks of size 4)

First block: full block with num_loss_counted_tokens = block_size - 1 = 3

Second block: partial with padding, correct labels (-100 for pad), and num_loss_counted_tokens = 2

get_lengths() returns actual token counts per block

63-104: LGTM: Good integration test for pretraining data loader.

The test validates:

PretrainingBlockDataset is correctly instantiated when pretraining_config is provided

Batch structure includes expected keys and shapes

Loss token counting is propagated correctly

106-157: LGTM: End-to-end test for document processing pipeline.

The test properly:

Mocks AutoTokenizer.from_pretrained to isolate the tokenization logic

Validates output JSONL structure (input_ids, len fields)

Confirms EOS token is appended correctly

src/instructlab/training/main_ds.py (6)

49-52: LGTM: PretrainingConfig import added correctly.

366-378: LGTM: Data loader integration with pretraining config.

The get_data_loader call correctly passes pretraining_config using getattr with a None fallback, ensuring backward compatibility when the attribute is not present.

470-490: LGTM: Correct branching for pretraining data processing.

When pretraining_config is provided, the code correctly calls process_documents_for_pretraining with the appropriate parameters including document_column_name from the config. Otherwise, it falls back to the standard process_data path.

543-547: LGTM: Command construction propagates pretraining flags.

The CLI arguments --block-size and --document-column-name are correctly appended when pretraining_config is provided.

790-801: LGTM: CLI arguments for pretraining mode.

New --block-size and --document-column-name arguments are correctly defined with appropriate help text.

850-861: LGTM: Validation and PretrainingConfig construction from CLI args.

Good validation that --document-column-name requires --block-size. The PretrainingConfig is correctly constructed only when block_size is provided, and document_column_name is included via kwargs when specified.

src/instructlab/training/sampler.py (5)

5-5: LGTM: Clean imports for pretraining support.

The added imports and logger initialization are appropriate for the new pretraining functionality.

Also applies to: 8-8, 17-17, 23-24

385-389: LGTM: Correct length calculation.

The get_lengths method correctly handles both uniform blocks and the partial last block.

447-447: LGTM: Clean integration of pretraining support.

The conditional logic cleanly separates pretraining and fine-tuning dataset construction. The logging is helpful, and backward compatibility is maintained with the optional parameter.

Also applies to: 462-462, 467-478

362-362: I'm unable to complete the verification task as the repository failed to clone. Without direct access to the codebase, I cannot examine the referenced code sections to verify the loss token counting logic.

However, based on the review comment itself, which is well-reasoned and identifies a specific concern about consistency in loss token counting between PretrainingBlockDataset (lines 362, 366) and TokenDataset (lines 421-423), this requires manual verification by someone with access to the repository.

Loss token counting consistency requires verification against model loss computation.

The comment correctly identifies that for causal language modeling, N tokens should produce N-1 predictions. However, the consistency between PretrainingBlockDataset (which uses N-1 via max(actual_len - 1, 0) and self.block_size - 1) and TokenDataset (which counts labels != -100) needs to be verified by reviewing how the model actually computes loss and whether the -1 is applied consistently across both pretraining and fine-tuning modes.

313-318: I encountered technical difficulties accessing the repository to complete the verification. However, I can provide an assessment based on the code snippet provided:

Memory loading pattern confirmed, but verification incomplete.

The code snippet does exhibit the memory concern described—it loads the entire dataset's input_ids into a single list during initialization. While this pattern is technically valid for smaller datasets, I cannot verify:

Target dataset sizes for this project

Whether this is a documented limitation or known issue

If alternative lazy-loading patterns exist elsewhere in the codebase

Real-world usage constraints and typical dataset sizes

Given that I cannot access the repository to validate the severity and context of this concern, manual verification from the development team is needed to confirm whether this is a blocker for your intended use cases.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (7)

tests/unit/test_pretraining_sampler.py (7)
306-335: Same parameter mismatch issue here.

This test also uses is_pretraining=False which doesn't match the actual function signature. When pretraining_config is None (or not provided), the function should use TokenDataset, so the test should simply omit the pretraining_config parameter or explicitly pass pretraining_config=None.

336-365: Same parameter mismatch issue: use pretraining_config instead.

Update to:
loader = get_data_loader(
    data_path=temp_pretraining_file,
    batch_size=2,
    max_tokens_per_gpu=1000,
    seed=42,
    rank=0,
    world_size=1,
    pretraining_config=PretrainingConfig(block_size=block_size),
    pad_token_id=0,
)
366-396: Same parameter mismatch issue.

397-432: Same parameter mismatch issue.

433-467: Same parameter mismatch issue.

468-507: Same parameter mismatch issue.

508-540: Same parameter mismatch issue.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27c9f55 and fba8ce6.

📒 Files selected for processing (10)

README.md
src/instructlab/training/__init__.py
src/instructlab/training/config.py
src/instructlab/training/data_process.py
src/instructlab/training/main_ds.py
src/instructlab/training/sampler.py
tests/unit/test_data_process.py
tests/unit/test_pretraining_data_process.py
tests/unit/test_pretraining_mode.py
tests/unit/test_pretraining_sampler.py

🚧 Files skipped from review as they are similar to previous changes (3)

src/instructlab/training/init.py
tests/unit/test_data_process.py
src/instructlab/training/config.py

🧰 Additional context used

🧬 Code graph analysis (5)

tests/unit/test_pretraining_data_process.py (1)

src/instructlab/training/data_process.py (1)

process_documents_for_pretraining (1140-1226)

tests/unit/test_pretraining_mode.py (3)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/data_process.py (2)

process_documents_for_pretraining (1140-1226)

main (1500-1517)

src/instructlab/training/sampler.py (3)

PretrainingBlockDataset (299-389)

get_lengths (385-389)

get_lengths (432-433)

src/instructlab/training/sampler.py (2)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/type_definitions.py (1)

CollatedItem (49-61)

tests/unit/test_pretraining_sampler.py (1)

src/instructlab/training/sampler.py (5)

PretrainingBlockDataset (299-389)

get_data_loader (436-499)

TokenDataset (392-433)

EpochSampler (26-55)

MaxTokensPerRankCollator (203-296)

src/instructlab/training/main_ds.py (2)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/data_process.py (2)

process_documents_for_pretraining (1140-1226)

process_data (1454-1497)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (12)

src/instructlab/training/sampler.py (3)

5-24: LGTM!

The new imports and logger initialization are appropriate for the pretraining functionality.

299-390: LGTM!

The PretrainingBlockDataset implementation is well-designed:

Proper input validation for block_size, pad_token_id, and required columns

Correct concatenation and block-splitting logic

Appropriate handling of partial last blocks with padding and label masking

Clear separation between complete blocks (loss_tokens = block_size - 1) and partial blocks (loss_tokens = actual_len - 1)

Good error messages for index out of range

436-478: LGTM!

The integration of pretraining support into get_data_loader is clean and correct:

Accepts optional pretraining_config parameter

Conditionally creates PretrainingBlockDataset when config is provided

Falls back to TokenDataset for standard instruction-tuning

Includes informative logging for the pretraining path

README.md (1)

126-165: LGTM!

The continual pretraining mode documentation is clear and comprehensive:

Explains both CLI and programmatic configuration options

Provides practical examples

Notes the automatic invocation of process_documents_for_pretraining()

Well-integrated into the existing documentation structure

tests/unit/test_pretraining_mode.py (1)

1-157: LGTM!

The test suite for pretraining components is comprehensive and well-structured:

Validates block padding and loss token counting

Checks correct handling of partial blocks

Tests length metadata and alignment

Uses appropriate mocking for tokenizers and datasets

Covers end-to-end processing pipeline

tests/unit/test_pretraining_data_process.py (1)

1-441: LGTM!

Excellent test coverage for the pretraining data processing pipeline:

Validates BOS/EOS token handling

Tests error conditions (empty dataset, missing fields, tokenizer without EOS)

Verifies statistics calculation and logging

Checks parallel processing and output format

Includes integration test with real tokenizer

Comprehensive mocking strategy

src/instructlab/training/main_ds.py (4)

369-380: LGTM!

Correctly passes pretraining_config through to get_data_loader, enabling the pretraining dataset path when configured.

474-496: LGTM!

The conditional data processing flow is well-implemented:

Routes to process_documents_for_pretraining when pretraining_config is provided

Falls back to standard process_data for instruction-tuning

Properly passes through all required parameters including document_column_name

552-557: LGTM!

Correctly appends pretraining-specific flags to the training command when pretraining_config is present.

805-900: LGTM!

The CLI integration is well-designed:

Adds --block-size and --document-column-name arguments with clear help text

Validates that --document-column-name requires --block-size (line 889-890)

Properly constructs PretrainingConfig and assigns to args.pretraining_config

Handles the case when pretraining is not enabled (sets to None)

src/instructlab/training/data_process.py (2)

415-420: Explicitly setting return_dict=False is correct.

While the comment suggests this is for "newer versions of transformers," return_dict=False is actually the current default for apply_chat_template. The explicit parameter is defensive coding but harmless. This is consistent with the changes at line 693-694.

1140-1227: LGTM!

The process_documents_for_pretraining implementation is solid:

Proper validation of output directory and dataset

Clear error messages for missing/empty data

Validates tokenizer has EOS token

Guards against duplicate EOS tokens (lines 1194-1196)

Calculates and logs useful statistics

Clean separation from instruction-tuning processing

Note: A past review comment correctly identified that the docstring at line 1156 says {"documents": "text"} but should say {"document": "text"} to match the default document_column_name parameter and validation at line 1175.

tests/unit/test_pretraining_sampler.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

src/instructlab/training/main_ds.py (1)
890-901: LGTM with optional simplification.

The validation correctly ensures --document-column-name requires --block-size. The PretrainingConfig construction is correct.

The kwargs pattern works but could be slightly simplified since PretrainingConfig.document_column_name has a default:
🔎 Optional simplification
     if args.block_size is not None:
-        pretraining_kwargs = {}
-        if args.document_column_name is not None:
-            pretraining_kwargs["document_column_name"] = args.document_column_name
-        args.pretraining_config = PretrainingConfig(
-            block_size=args.block_size, **pretraining_kwargs
-        )
+        args.pretraining_config = PretrainingConfig(
+            block_size=args.block_size,
+            **({"document_column_name": args.document_column_name} if args.document_column_name else {}),
+        )
     else:
         args.pretraining_config = None
Or keep the current explicit form for readability—either approach is fine.
tests/unit/test_pretraining_sampler.py (2)
29-36: Consider using mock configuration methods.

While the current approach works, setting __len__ and __iter__ via lambda functions is unconventional. Consider using MagicMock's configuration methods for cleaner test setup.
Alternative mock setup approach
 @pytest.fixture
 def mock_hf_dataset(self, sample_pretraining_data):
     """Mock HuggingFace dataset."""
     mock_ds = MagicMock()
     mock_ds.column_names = ["input_ids", "len"]
-    mock_ds.__len__ = lambda self: len(sample_pretraining_data)
-    mock_ds.__iter__ = lambda self: iter(sample_pretraining_data)
+    mock_ds.__len__.return_value = len(sample_pretraining_data)
+    mock_ds.__iter__.return_value = iter(sample_pretraining_data)
     return mock_ds
519-520: Use Pythonic boolean assertion.

The assertion == True is not idiomatic Python. Use direct boolean evaluation instead.
Recommended change
-        # When num_workers > 0, persistent_workers should be True
-        assert loader.persistent_workers == True
+        # When num_workers > 0, persistent_workers should be True
+        assert loader.persistent_workers

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fba8ce6 and f57251b.

📒 Files selected for processing (3)

src/instructlab/training/main_ds.py
tests/unit/test_pretraining_data_process.py
tests/unit/test_pretraining_sampler.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unit/test_pretraining_data_process.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/instructlab/training/main_ds.py (3)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/logger.py (1)

setup_metric_logger (711-800)

src/instructlab/training/data_process.py (2)

process_documents_for_pretraining (1140-1226)

process_data (1454-1497)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (6)

src/instructlab/training/main_ds.py (6)

46-52: LGTM!

The PretrainingConfig import is correctly added alongside other config imports and is necessary for the new pretraining mode functionality.

367-379: LGTM!

The pretraining_config is correctly passed to get_data_loader. Using getattr with a default of None provides defensive handling, though args.pretraining_config is always set by the CLI parsing logic.

460-462: LGTM!

Good improvement—using train_args.logger_type and train_args.run_name instead of hardcoded defaults respects user configuration and aligns with the setup_metric_logger signature.

475-496: LGTM!

The conditional branching correctly routes to process_documents_for_pretraining when pretraining mode is enabled. The parameter mapping aligns with the function signature, and omitting max_seq_len is appropriate since blocking happens at training time per the function's documented behavior.

553-557: LGTM!

Pretraining arguments are correctly appended to the distributed training command when pretraining_config is present, with argument names matching the CLI definitions.

806-817: LGTM!

The CLI arguments are well-defined. When --block-size is provided without --document-column-name, the PretrainingConfig default of "document" will be used, which is the expected behavior.

tests/unit/test_pretraining_sampler.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f57251b and a354284.

📒 Files selected for processing (1)

src/instructlab/training/main_ds.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/instructlab/training/main_ds.py (3)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/logger.py (1)

setup_metric_logger (711-800)

src/instructlab/training/data_process.py (1)

process_documents_for_pretraining (1140-1226)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (7)

src/instructlab/training/main_ds.py (7)

49-49: LGTM!

The PretrainingConfig import is correctly added to expose the pretraining API.

368-379: LGTM!

The get_data_loader call correctly forwards the pretraining_config parameter using a safe getattr pattern with a default of None, which handles both pretraining and non-pretraining modes.

460-462: LGTM!

Replacing hardcoded logger configuration with train_args.logger_type and train_args.run_name improves configurability and flexibility.

476-496: LGTM!

The conditional data processing path cleanly separates pretraining and instruction-tuning workflows. The implementation correctly forwards document_column_name from pretraining_config and maintains backward compatibility for the non-pretraining path.

553-558: LGTM!

The command construction correctly propagates block_size and document_column_name to the distributed training subprocess when pretraining mode is enabled.

803-814: LGTM!

The new CLI arguments --block-size and --document-column-name are properly defined with clear help text and appropriate defaults for optional pretraining mode configuration.

887-898: LGTM!

The argument parsing correctly validates that --document-column-name requires --block-size, and the PretrainingConfig creation logic properly handles both explicit and default column names using the kwargs pattern.

coderabbitai · 2025-12-23T21:51:29Z

src/instructlab/training/main_ds.py

+        return_code = process.wait(timeout=60)  # wait for 1 min or error
+        if return_code == 0:
            logger.info("Operation completed successfully! 🎉")
        else:
            logger.error(
-                f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}"
+                f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {return_code}"
            )


⚠️ Potential issue | 🟠 Major

Fix the misleading error message and incorrect process handling logic.

The process.wait(timeout=60) call blocks until the subprocess exits (or raises TimeoutExpired). When it returns a non-zero return_code, the process has already exited with an error. However, the log message on line 674 incorrectly states "has not exited yet" and the subsequent terminate() call is unnecessary since the process is already dead.

🔎 Proposed fix

- return_code = process.wait(timeout=60) # wait for 1 min or error - if return_code == 0: + try: + return_code = process.wait(timeout=60) + except subprocess.TimeoutExpired: + logger.error( + "Training subprocess did not exit within 60s. Sending SIGTERM." + ) + process.terminate() + try: + process.wait(timeout=60) + except subprocess.TimeoutExpired: + logger.error("Training subprocess did not terminate, sending SIGKILL.") + process.kill() + raise RuntimeError("Training subprocess timed out") + + if return_code == 0: logger.info("Operation completed successfully! 🎉") + return else: logger.error( - f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {return_code}" + f"Training subprocess exited with non-zero code: {return_code}" ) - - process.terminate() - try: - logger.info("Waiting for process to exit, 60s...") - process.wait(timeout=60) - except subprocess.TimeoutExpired: - logger.error( - "Training subprocess did not terminate before timeout, sending SIGKILL." - ) - process.kill() + failure = True

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/instructlab/training/main_ds.py around lines 669-675, the code treats a non-zero return value from process.wait(timeout=60) as "has not exited yet" and calls terminate() incorrectly; instead, change the logic so that a non-zero return_code is logged as a process-exited-with-error (include the return_code and ideally any stderr output) and do NOT call terminate() in that branch; add an explicit except subprocess.TimeoutExpired handler around process.wait(timeout=60) that logs a timeout and then calls process.terminate() (and optionally process.kill() after a short grace period) to handle processes that truly do not exit within the timeout.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/instructlab/training/main_ds.py (1)

667-683: The process exit handling logic remains incorrect.

This issue was previously flagged: when process.wait(timeout=60) returns a non-zero return_code, the process has already exited with an error. The log message "has not exited yet" on line 672 is misleading, and the terminate() call on line 675 is unnecessary since the process is already dead.

The correct approach is to wrap wait() in a try-except for TimeoutExpired to handle actual timeout cases, and treat non-zero return codes as completed-but-failed processes.

🧹 Nitpick comments (1)

src/instructlab/training/main_ds.py (1)
885-896: Consider adding validation for block_size positivity.

The validation correctly ensures --document-column-name requires --block-size. However, there's no check that block_size is a positive integer. A zero or negative block size would likely cause issues during training.
🔎 Proposed validation
 if args.document_column_name is not None and args.block_size is None:
     parser.error("--document-column-name requires --block-size to be specified.")

 if args.block_size is not None:
+    if args.block_size <= 0:
+        parser.error("--block-size must be a positive integer.")
     pretraining_kwargs = {}
     if args.document_column_name is not None:
         pretraining_kwargs["document_column_name"] = args.document_column_name

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 699ca77 and 9c3905b.

📒 Files selected for processing (1)

src/instructlab/training/main_ds.py

🧰 Additional context used

🧬 Code graph analysis (1)

src/instructlab/training/main_ds.py (2)

src/instructlab/training/config.py (1)

PretrainingConfig (80-91)

src/instructlab/training/data_process.py (2)

process_documents_for_pretraining (1140-1226)

process_data (1454-1497)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (5)

src/instructlab/training/main_ds.py (5)

46-52: LGTM!

The PretrainingConfig import is correctly added alongside other config imports and is properly used throughout the file for the new pretraining functionality.

367-379: LGTM!

The pretraining_config parameter is correctly passed to get_data_loader using getattr with a None default, ensuring backward compatibility when pretraining is not configured.

473-494: LGTM!

The conditional branching correctly routes to process_documents_for_pretraining when pretraining_config is set, passing the appropriate document_column_name. The block_size is intentionally omitted here since blocking/chunking happens during training as noted in the process_documents_for_pretraining docstring.

551-556: LGTM!

The pretraining configuration parameters are correctly appended to the torchrun command when pretraining_config is present, ensuring the subprocess receives the necessary configuration.

801-812: LGTM!

The new CLI arguments --block-size and --document-column-name are correctly defined with appropriate types, defaults, and descriptive help text that aligns with the PretrainingConfig fields.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/instructlab/training/main_ds.py (1)

667-673: Incorrect error handling for subprocess exit.

This issue was already identified in a previous review. The logic incorrectly treats a non-zero return code from process.wait() as "has not exited yet" when the process has actually already exited with an error. The subsequent terminate() call is unnecessary.

See the previous review comment for the detailed explanation and proposed fix.

🧹 Nitpick comments (1)

src/instructlab/training/main_ds.py (1)
574-575: Consider explicit check for mock_data_len > 0.

The truthiness check if train_args.mock_data_len: will skip appending the flag when mock_data_len is 0. While this is likely intentional (treating 0 as "not set"), an explicit comparison would be clearer:
if train_args.mock_data_len > 0:
This makes the intent more obvious and avoids potential confusion.
🔎 Proposed refactor
     if train_args.mock_data:
         command.append("--mock_data")
-        if train_args.mock_data_len:
+        if train_args.mock_data_len > 0:
             command.append(f"--mock_len={train_args.mock_data_len}")

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c3905b and 5f8f2a5.

📒 Files selected for processing (3)

src/instructlab/training/config.py
src/instructlab/training/main_ds.py
tox.ini

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: unit: 3.11 on ubuntu-latest
GitHub Check: unit: 3.12 on ubuntu-latest
GitHub Check: unit: 3.13 on ubuntu-latest
GitHub Check: pylint
GitHub Check: Summary

🔇 Additional comments (10)

tox.ini (1)

107-107: LGTM! Necessary dependency for type checking pydantic models.

Adding pydantic to the mypy dependencies is required to properly type-check the new PretrainingConfig and other pydantic-based models introduced in this PR.

src/instructlab/training/config.py (3)

66-74: LGTM! Clean pretraining mode configuration.

The addition of is_pretraining and pretraining_column_name to DataProcessArgs provides a clear toggle for pretraining workflows with sensible defaults.

80-92: LGTM! Well-defined pretraining configuration model.

The PretrainingConfig class is properly structured with:

Clear field descriptions

Sensible default for document_column_name

Required block_size to ensure explicit configuration

293-299: LGTM! Proper integration of pretraining config.

The pretraining_config field is correctly defined as optional with a descriptive explanation of when and how it's used. This maintains backward compatibility while enabling the new pretraining mode.

src/instructlab/training/main_ds.py (6)

49-49: LGTM! Necessary import for pretraining support.

Importing PretrainingConfig enables the pretraining configuration to flow through the training pipeline.

474-494: LGTM! Clean conditional data processing logic.

The branching logic correctly routes to process_documents_for_pretraining when pretraining_config is present, passing the appropriate document_column_name from the config. The fallback to standard process_data maintains backward compatibility.

551-555: LGTM! Proper command construction for pretraining mode.

The conditional appending of --block-size and --document-column-name correctly propagates the pretraining configuration to the subprocess training command.

801-812: LGTM! Well-documented CLI arguments for pretraining.

The new --block-size and --document-column-name arguments are properly defined with clear help text explaining their purpose for continual pretraining.

885-896: LGTM! Correct CLI validation logic.

The validation properly enforces that --document-column-name requires --block-size to be specified, while allowing --block-size alone (which uses the default document column name). The PretrainingConfig construction handles both cases correctly.

378-378: Remove unnecessary defensive getattr() access.

The getattr(args, "pretraining_config", None) is defensive but unnecessary. The CLI parsing logic (lines 891-896) guarantees pretraining_config is always set on args before main(args) is called—either as a PretrainingConfig instance or None. Use direct attribute access args.pretraining_config instead for clarity.

mergify bot added the ci-failure label Nov 25, 2025

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

src/instructlab/training/data_process.py Outdated Show resolved Hide resolved

src/instructlab/training/data_process.py Show resolved Hide resolved

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

src/instructlab/training/data_process.py Show resolved Hide resolved

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

mergify bot added the testing Relates to testing label Nov 26, 2025

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation ci-failure and removed ci-failure labels Nov 26, 2025

RobotSail force-pushed the os-dp-cpt branch from 7904601 to fede3e7 Compare November 26, 2025 16:27

RobotSail added 9 commits December 23, 2025 15:15

adds API for converting documents into pretraining data

fb2a1aa

removes extraneous validation + account for eos already being present

6fbdcfd

exposes the pretraining column name in training config

fd6f5f3

explicitly do not request a dict

09a0b90

adds the ability for training to consume pretraining forrmat data

967a3c4

linting + minor fixes

dd8e363

add docs

333a8ef

updates mock tokenizer

1cd4aa4

tests

fba8ce6

RobotSail force-pushed the os-dp-cpt branch from 27c9f55 to fba8ce6 Compare December 23, 2025 19:36

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

tests/unit/test_pretraining_sampler.py Outdated Show resolved Hide resolved

linting

f57251b

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

tests/unit/test_pretraining_sampler.py Show resolved Hide resolved

double-check failure

a354284

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

RobotSail added 2 commits December 23, 2025 21:53

fix tests

699ca77

revert debu change

9c3905b

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

RobotSail added 2 commits December 23, 2025 22:24

fix incorrect accese

fa0b7d1

adds pydantic to mypy for tox

5f8f2a5

mergify bot added the CI/CD Affects CI/CD configuration label Dec 23, 2025

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

more linting

d7be2a3

RobotSail force-pushed the os-dp-cpt branch from ea64510 to d7be2a3 Compare December 23, 2025 22:29

RobotSail added 2 commits December 23, 2025 22:35

revert changes

0275455

fix tests

77d39f6

Exposes API for processing pretraining data #672

Are you sure you want to change the base?

Exposes API for processing pretraining data #672

Uh oh!

Conversation

RobotSail commented Nov 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Search for all usages of is_pretraining and pretraining_column_name

Check for process_documents_for_pretraining function and its usage

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RobotSail commented Nov 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 25, 2025 •

edited

Loading