-
Notifications
You must be signed in to change notification settings - Fork 75
Exposes API for processing pretraining data #672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Warning Rate limit exceeded@RobotSail has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 51 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds continual pretraining: new PretrainingConfig and TrainingArgs field, document-to-JSONL preprocessing, PretrainingBlockDataset and loader branch for fixed-size block sampling, CLI flags for block-size/document-column-name, wiring through main orchestration, and unit tests for the new flow. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CLI as CLI
participant Main as main_ds.train
participant Proc as data_process.process_documents_for_pretraining
participant Tok as AutoTokenizer
participant Loader as sampler.get_data_loader
participant Trainer as torchrun/Trainer
CLI->>Main: parse args (may include --block-size, --document-column-name)
Main->>Main: set train_args.pretraining_config (optional)
alt pretraining_config provided
Main->>Proc: process_documents_for_pretraining(data_path, out_path, model_path, ...)
Proc->>Tok: AutoTokenizer.from_pretrained(model_path)
Tok-->>Proc: tokenizer (must have eos_token_id)
loop per document
Proc->>Tok: tokenizer(doc, add_special_tokens=True, return_dict=False)
Tok-->>Proc: input_ids
Proc->>Proc: append EOS if missing, write record to data.jsonl
end
Proc-->>Main: data.jsonl ready
Main->>Loader: get_data_loader(..., pretraining_config)
Loader-->>Main: DataLoader(PretrainingBlockDataset)
Main->>Trainer: assemble torchrun with --block-size & --document-column-name
else standard training
Main->>Main: process_data(...) (existing pipeline)
Main->>Loader: get_data_loader(..., pretraining_config=None)
Loader-->>Main: DataLoader(TokenDataset)
Main->>Trainer: assemble standard torchrun
end
Main->>Trainer: launch distributed training
Trainer-->>Main: exit (return code)
sequenceDiagram
autonumber
participant Caller as process_documents_for_pretraining()
participant FS as FileSystem
participant Tok as AutoTokenizer
Note over Caller: per-document tokenize -> jsonl output
Caller->>FS: load JSON dataset
FS-->>Caller: dataset
Caller->>Tok: AutoTokenizer.from_pretrained(model_path)
Tok-->>Caller: tokenizer (must have eos_token_id)
loop per document
Caller->>Tok: tokenizer(doc, add_special_tokens=True, return_dict=False)
Tok-->>Caller: input_ids
Caller->>Caller: ensure EOS present
Caller->>FS: append `{ "input_ids": [...], "len": N }` to data.jsonl
end
Caller->>FS: finalize/write file
Caller-->>Caller: log totals and averages
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/instructlab/training/data_process.py (2)
1202-1204: Prefer lazy formatting in logger calls.Using f-strings in
logger.info()evaluates the string even when the log level is disabled. Use lazy formatting for better performance.- logger.info(f"Processed {len(tokenized_data):,} documents") - logger.info(f"Total tokens: {total_tokens:,}") - logger.info(f"Average tokens per document: {avg_tokens:.1f}") + logger.info("Processed %s documents", f"{len(tokenized_data):,}") + logger.info("Total tokens: %s", f"{total_tokens:,}") + logger.info("Average tokens per document: %.1f", avg_tokens)
1217-1218: Same lazy formatting suggestion for these logger calls.- logger.info(f"Saved tokenized documents to {output_file}") + logger.info("Saved tokenized documents to %s", output_file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/instructlab/training/config.py(1 hunks)src/instructlab/training/data_process.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (4)
src/instructlab/training/config.py (2)
66-70: LGTM!The
is_pretrainingfield is well-defined with a clear description that explains its purpose.
75-80: LGTM!The validator correctly enforces mutual exclusivity between
is_pretrainingmode andchat_tmpl_path, which aligns with the pretraining workflow that tokenizes raw documents without chat templates.src/instructlab/training/data_process.py (2)
1156-1179: LGTM!Good defensive programming with proper validation for write permissions, dataset format, field requirements, and tokenizer EOS token.
1136-1141: Verify integration:is_pretrainingflag not wired to this function.The
DataProcessArgs.is_pretrainingflag was added inconfig.py, but neithermain()norprocess_data()in this file dispatches toprocess_documents_for_pretrainingbased on that flag. Users would need to call this function directly.If this is intentional (exposing API only), the approach is fine. Otherwise, consider adding integration in
process_data():def process_data(..., is_pretraining: bool = False): if is_pretraining: process_documents_for_pretraining( data_path=data_path, data_output_path=data_output_path, model_path=model_path, num_cpu_procs=num_cpu_procs, ) elif chat_tmpl_path: # existing legacy path...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/instructlab/training/data_process.py (1)
1152-1152: Fix documentation: field name should bedocument(singular).The docstring says
{"documents": "text"}but the code validates for'document'(singular) at Line 1171 and the parameter defaults to"document". This inconsistency will confuse API users.- data_path: Path to input JSONL with {"documents": "text"} format + data_path: Path to input JSONL with {"document": "text"} format
🧹 Nitpick comments (1)
src/instructlab/training/data_process.py (1)
1176-1180: Consider adding error handling for tokenizer loading and validating BOS token.For consistency with the dataset loading pattern (lines 1161-1166), consider wrapping the tokenizer loading in a try/except. Additionally, since the docstring at Line 1149 documents that documents follow the
[BOS][tokens][EOS]pattern, consider validating thattokenizer.bos_token_idexists, similar to the EOS validation.logger.info("Loading tokenizer from %s", model_path) -tokenizer = AutoTokenizer.from_pretrained(model_path) +try: + tokenizer = AutoTokenizer.from_pretrained(model_path) +except Exception as e: + raise ValueError( + f"Failed to load tokenizer from {model_path}" + ) from e +if tokenizer.bos_token_id is None: + logger.warning("Tokenizer does not have a BOS token defined") + if tokenizer.eos_token_id is None: raise ValueError("Tokenizer must have an EOS token defined for pretraining")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/instructlab/training/config.py(1 hunks)src/instructlab/training/data_process.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/instructlab/training/config.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (1)
src/instructlab/training/data_process.py (1)
1190-1192: Good defensive guard against duplicate EOS tokens.The guard correctly prevents duplicate EOS tokens, addressing the concern from the previous review. This handles tokenizers that may or may not automatically append EOS with
add_special_tokens=True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/instructlab/training/config.py (1)
66-75: Pretraining config fields look sound; consider minor naming and doc tighteningThe new
is_pretrainingandpretraining_column_namefields are backward‑compatible and clearly scoped to pretraining, which is good. A couple of small polish ideas:
- The naming between config (
pretraining_column_name) and the pretraining helper (document_column_nameinprocess_documents_for_pretraining(...)per the summary) is slightly inconsistent. Consider aligning them (e.g.,document_column_nameeverywhere) to reduce cognitive overhead when wiring configs to function parameters.- You may want the
pretraining_column_namedescription to explicitly mention the default and mode dependency, e.g., “Name of the dataset column containing raw documents to pretrain on (default:document). Used only whenis_pretrainingis True.”These are non‑blocking but would make the API a bit clearer for users and future maintainers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/instructlab/training/config.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/instructlab/training/data_process.py (2)
1163-1177: Validate document contents (None/empty/non‑string) before tokenization.
tokenize_documentassumessample[document_column_name]is a non‑empty string. If a row hasNone, a non‑string, or an empty/whitespace‑only value,tokenizer.encode(...)and especiallyinput_ids[-1]can fail or produce meaningless sequences.Consider validating and failing fast with a clear error:
- def tokenize_document(sample): - input_ids = tokenizer.encode( - sample[document_column_name], add_special_tokens=True - ) + def tokenize_document(sample): + doc_text = sample[document_column_name] + if doc_text is None or not isinstance(doc_text, str) or not doc_text.strip(): + raise ValueError( + f"Document in column '{document_column_name}' is missing, non-string, or empty" + ) + + input_ids = tokenizer.encode(doc_text, add_special_tokens=True) @@ - # ensures eos token is present without double-adding it. - if input_ids[-1] != tokenizer.eos_token_id: + # ensures eos token is present without double-adding it. + if input_ids and input_ids[-1] != tokenizer.eos_token_id: input_ids.append(tokenizer.eos_token_id)This makes the API more robust to dirty data and avoids opaque tokenizer/indexing errors during large runs.
Also applies to: 1186-1199
1145-1159: Fix docstring to matchdocument_column_nameand actual schema.The docstring currently documents
{"documents": "text"}, but the function defaultsdocument_column_name="document"and later validates that this exact field exists.Recommend aligning the docstring with the actual API:
- Args: - data_path: Path to input JSONL with {"documents": "text"} format + Args: + data_path: Path to input JSONL with {"document": "text"} format data_output_path: Directory for processed data output model_path: Path to model/tokenizer num_cpu_procs: Number of parallel processes document_column_name: Name of the column containing the documents
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/instructlab/training/data_process.py(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: pylint
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: Summary
🔇 Additional comments (1)
src/instructlab/training/data_process.py (1)
691-693: Return type verification confirms the code change is correct.The web search confirms that
tokenizer.apply_chat_template(..., return_dict=False)with the defaulttokenize=Truereturns alist[int](orlist[list[int]]for batched inputs) rather than a dictionary. This flat token sequence is exactly what's needed for the subsequent indexing and region-masking logic that follows in the unmasking implementation.
| data_with_input_ids = data.map( | ||
| lambda x: { | ||
| "input_ids": tokenizer.apply_chat_template(x["messages"], tokenize=True), | ||
| # newer versions of transformers have `return_dict=True` by default | ||
| "input_ids": tokenizer.apply_chat_template(x["messages"], tokenize=True, return_dict=False), | ||
| "unmask": bool(x["unmask"]) if "unmask" in x else False, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
transformers 4.55.0 PreTrainedTokenizer apply_chat_template default return_dict behavior
<web_search_result>
Short answer: In transformers 4.55.0, PreTrainedTokenizer.apply_chat_template defaults return_dict=False. When you set return_dict=True the method returns the usual tokenizer output dict (input_ids, attention_mask, pixel_values, etc. where applicable); with return_dict=False it returns the token id list (or token array) ready for the model. [1][2][3]
Sources
- Transformers tokenizer docs (apply_chat_template signature & param default return_dict=False). [1]
- Chat-template docs / internal tokenizer docs describing return_dict behavior and pixel_values handling. [2]
- v4.55.0 examples showing usage with return_dict=True. [3]
</web_search_result>
Script executed:
#!/bin/bash
# Search for other usages of apply_chat_template in the codebase to understand context
rg -n "apply_chat_template" --type=py -A2 -B2Repository: instructlab/training
<shell_result>
Repository: instructlab/training
Script execution failed with unrecoverable error: [CAPABILITY:RUN_SHELL_SCRIPT] Failed to clone repository instructlab/training: Clone operation failed
</shell_result>
transformers apply_chat_template return_dict default history earlier versions before 4.55.0
<web_search_result>
Short answer: Historically apply_chat_template defaulted to tokenize=True and return_dict=False (so it usually returned token ids or a string). Over a series of changes (notably PR/commit referenced as #36149) callers from Processor/ProcessorMixin started invoking tokenizer.apply_chat_template with tokenize=False and return_dict=False, which effectively forced non-tokenized string output and made options that require tokenization/return_dict (e.g. return_assistant_tokens_mask=True) unusable. See the changelog/issues and docs for details. [1][2][3]
Sources
- Issue describing the behavior change and how #36149 hard-codes tokenize=False and return_dict=False (blocks return_assistant_tokens_mask). [1]
- apply_chat_template signature/docs showing return_dict default and behavior (return_dict=False by default; return_assistant_tokens_mask requires return_dict=True/tokenize). [2]
- Hugging Face chat-templating docs / examples describing apply_chat_template usage and generation-mask behavior. [3]
</web_search_result>
Clarify the rationale: return_dict=False is already the default, not a future-proofing measure.
The inline comment suggests that "newer versions of transformers have return_dict=True by default," but apply_chat_template defaults to return_dict=False in transformers 4.55.0 and historically has done so. The explicit parameter here is redundant but harmless.
If this is intended as defensive coding against hypothetical future changes, consider clarifying or removing the comment to avoid misleading future readers. If the intent is to document current behavior, the comment should state that the parameter preserves the existing default rather than implying a version-specific workaround.
🤖 Prompt for AI Agents
In src/instructlab/training/data_process.py around lines 413 to 418, the inline
comment claiming "newer versions of transformers have `return_dict=True` by
default" is misleading; update the comment to reflect the actual intent: either
remove the comment entirely (since `return_dict=False` is the current default
for apply_chat_template) or replace it with a concise note like "explicitly set
return_dict=False to preserve current behavior" so readers understand this is
preserving existing behavior rather than referencing a specific version change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/instructlab/training/config.py (1)
80-88: Add "public API" comment for consistency.Other public config classes in this file (e.g.,
DataProcessArgs,TorchrunArgs,LoraOptions,TrainingArgs) are marked with# public APIcomments. Consider adding this marker toPretrainingConfigfor consistency since it's exposed viaTrainingArgs.+# public API class PretrainingConfig(BaseModel): """ Configuration for pretraining mode. """src/instructlab/training/sampler.py (1)
382-387: Clarifyget_lengths()vs__getitem__["len"]semantics.
get_lengths()returns the actual token count for the last block (last_block_len), while__getitem__always returns"len": self.block_size(the padded length). This inconsistency may be intentional:
"len"in samples → padded size for batch allocationget_lengths()→ actual tokens for statisticsConsider adding a docstring to
get_lengths()clarifying it returns actual token counts, not padded lengths, to prevent confusion.def get_lengths(self) -> np.ndarray: + """Return actual token counts per block (not padded lengths).""" lengths = np.full(self.num_blocks, self.block_size, dtype=np.int64) if self.num_blocks and self.last_block_len != self.block_size: lengths[-1] = self.last_block_len return lengthstests/unit/test_pretraining_mode.py (1)
102-153: Consider adding edge case tests.The test validates the happy path but could benefit from additional edge cases:
- Empty document handling
- Documents with existing EOS tokens (to verify no double-adding)
- Custom
document_column_nameparameterThese can be added in a follow-up PR if desired.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/instructlab/training/config.py(2 hunks)src/instructlab/training/main_ds.py(7 hunks)src/instructlab/training/sampler.py(5 hunks)tests/unit/test_pretraining_mode.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/unit/test_pretraining_mode.py (3)
src/instructlab/training/config.py (1)
PretrainingConfig(80-87)src/instructlab/training/data_process.py (1)
process_documents_for_pretraining(1138-1224)src/instructlab/training/sampler.py (4)
PretrainingBlockDataset(298-386)get_data_loader(433-496)get_lengths(382-386)get_lengths(429-430)
src/instructlab/training/sampler.py (1)
src/instructlab/training/config.py (1)
PretrainingConfig(80-87)
src/instructlab/training/main_ds.py (2)
src/instructlab/training/config.py (1)
PretrainingConfig(80-87)src/instructlab/training/data_process.py (2)
process_documents_for_pretraining(1138-1224)process_data(1452-1495)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (15)
src/instructlab/training/config.py (2)
66-75: LGTM!The new pretraining fields are well-defined with sensible defaults and clear descriptions. The
is_pretrainingflag and configurable column name provide good flexibility for the pretraining workflow.
277-283: LGTM!The
pretraining_configfield is correctly defined asOptionalwith aNonedefault, allowing pretraining to be an opt-in feature. The description clearly explains the purpose.src/instructlab/training/sampler.py (5)
4-22: LGTM!The new imports are appropriately organized and the logger is correctly initialized at module level.
340-370: Verify labels alignment for causal LM training.For full blocks (lines 361-363),
labelsis set identical toinput_ids. This pattern relies on the model internally shifting labels during loss computation (common in HuggingFace causal LM models). Confirm this is the expected behavior for the models used with this training library.Additionally,
loss_tokens = self.block_size - 1correctly accounts for the first token having no prediction target.
372-380: LGTM!The
from_jsonl_filefactory method correctly loads the dataset and delegates to the constructor.
444-474: LGTM!The
get_data_loaderfunction correctly branches betweenPretrainingBlockDatasetandTokenDatasetbased on the presence ofpretraining_config. The logging provides helpful context about the pretraining configuration.
298-338: I encountered a repository clone failure, which prevents me from verifying the codebase patterns. However, I can analyze the review comment based on the provided code snippet and context.Review comment analysis:
The review comment raises a valid technical concern:
- The code does concatenate all
input_idsinto a single Python list (lines 312-317)- This is indeed memory-intensive for large datasets (billions of tokens)
- The comment appropriately acknowledges this is acceptable for manageable sizes
- It flags this as an optional refactor for future iteration
The concern is technically sound and the suggestion is reasonable. However, without access to the full codebase, I cannot verify:
- Whether existing patterns for memory-mapped arrays or lazy loading are already implemented elsewhere
- What the expected dataset size ranges are in practice
- Whether this concern has already been addressed in other parts of the codebase
Since verification cannot be completed through automated checks, the review comment should be flagged for manual verification.
Memory usage consideration for large datasets.
The implementation concatenates all
input_idsinto a single Python list in memory (lines 312-317). For very large pretraining corpora (billions of tokens), this could lead to high memory consumption during dataset initialization.For now, this is acceptable if the expected dataset sizes are manageable. If memory becomes an issue, consider lazy loading with memory-mapped arrays or chunked processing in a future iteration.
tests/unit/test_pretraining_mode.py (2)
24-57: LGTM!Good test coverage for
PretrainingBlockDataset. The test correctly validates:
- Block creation with concatenated tokens
- Padding behavior for the partial last block
- Loss token counting for full vs partial blocks
get_lengths()returning actual token counts
59-100: LGTM!Comprehensive integration test for the pretraining data loader. The test correctly validates the packed batch structure, sample counts, and loss token aggregation.
src/instructlab/training/main_ds.py (6)
51-52: LGTM!The
PretrainingConfigimport is correctly added alongside related config imports.
366-378: LGTM!The
pretraining_configis correctly passed toget_data_loaderusing defensivegetattraccess, which handles both CLI-invoked and programmatic calls gracefully.
542-545: LGTM!The
--block-sizeflag is correctly appended to the subprocess command when pretraining mode is enabled.
788-793: LGTM!The
--block-sizeCLI argument is well-defined with a clear help message explaining its purpose.
842-845: LGTM!The CLI argument is correctly parsed into a
PretrainingConfiginstance when provided, maintaining the same pattern as other config objects.
470-489: I apologize for the technical difficulties with repository access. I'm unable to execute verification scripts due to consistent cloning failures. However, based on the information provided in the review comment itself, I can provide a rewritten version:Document column name parameter is not exposed through
TrainingArgs.The call to
process_documents_for_pretrainingdoesn't pass thedocument_column_nameparameter (which defaults to"document"), and this parameter is not configurable throughTrainingArgs. This prevents users ofrun_training()from customizing the column name for pretraining workflows.Consider either:
- Adding
pretraining_document_column_nametoPretrainingConfigorTrainingArgsto allow customization, or- Documenting this as an intentional limitation if simplicity of the API is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
src/instructlab/training/data_process.py (3)
693-694: Same misleading comment aboutreturn_dictdefault.This is the same issue as at line 415. Consider updating or removing the comment.
1155-1156: Docstring field name mismatch: should bedocument(singular).The docstring says
{"documents": "text"}but the code validates for'document'(thedocument_column_namedefault). This was flagged in a prior review and remains unfixed.- data_path: Path to input JSONL with {"documents": "text"} format + data_path: Path to input JSONL with {"document": "text"} format
1189-1201: Missing validation for empty or None documents.The
tokenize_documentfunction doesn't validate thatsample[document_column_name]is non-empty before tokenization. Empty strings produce minimal[BOS][EOS]tokens, andNonevalues will causetokenizer.encode()to fail. This was flagged in a prior review.def tokenize_document(sample): + doc_text = sample[document_column_name] + if not doc_text or not isinstance(doc_text, str) or not doc_text.strip(): + raise ValueError( + f"Document at column '{document_column_name}' is empty or invalid" + ) + input_ids = tokenizer.encode( - sample[document_column_name], add_special_tokens=True + doc_text, add_special_tokens=True )
🧹 Nitpick comments (5)
src/instructlab/training/data_process.py (3)
415-418: Comment is misleading:return_dict=Falseis already the default.The comment states newer transformers versions have
return_dict=Trueby default, butapply_chat_templatehas always defaulted toreturn_dict=False. The explicit parameter is harmless but the comment is incorrect.- # newer versions of transformers have `return_dict=True` by default + # explicitly set return_dict=False to ensure we get a list of token ids "input_ids": tokenizer.apply_chat_template( x["messages"], tokenize=True, return_dict=False ),
1210-1215: Use lazy string formatting for logger calls.The f-string formatting happens unconditionally even if the log level would suppress the message. Use
%-style formatting for performance.- logger.info(f"Processed {len(tokenized_data):,} documents") - logger.info(f"Total tokens: {total_tokens:,}") - logger.info(f"Average tokens per document: {avg_tokens:.1f}") + logger.info("Processed %s documents", f"{len(tokenized_data):,}") + logger.info("Total tokens: %s", f"{total_tokens:,}") + logger.info("Average tokens per document: %.1f", avg_tokens)
1225-1226: Same f-string formatting in logger calls.- logger.info(f"Saved tokenized documents to {output_file}") + logger.info("Saved tokenized documents to %s", output_file)tests/unit/test_pretraining_mode.py (1)
112-115: StubTokenizer.encode logic may be inverted.The stub returns
basewhenadd_special_tokens=Trueandbase[1:]whenFalse. This implies BOS is always at index 0, but for a stub that doesn't actually prepend a BOS token (it just uses char ordinals), this is semantically inconsistent. However, since the test just validates the pipeline flow rather than tokenizer correctness, this is acceptable for test purposes.Consider adding a comment clarifying the stub behavior:
def encode(self, text, add_special_tokens=True): + # Stub: simulate BOS by including first char only when add_special_tokens=True base = [ord(ch) % 50 + 1 for ch in text] return base if add_special_tokens else base[1:]src/instructlab/training/sampler.py (1)
375-383: Consider adding error handling for better user experience.The
from_jsonl_filemethod lacks explicit error handling. If the file is missing or malformed, users will receive raw exceptions fromload_dataset. Consider adding try/except blocks with clearer error messages.Example:
@classmethod def from_jsonl_file( cls, data_path: str, block_size: int, pad_token_id: int, ) -> "PretrainingBlockDataset": + try: dataset = load_dataset("json", data_files=data_path, split="train") + except FileNotFoundError: + raise FileNotFoundError(f"Pretraining data file not found: {data_path}") + except Exception as e: + raise ValueError(f"Failed to load pretraining data from {data_path}: {e}") return cls(dataset, block_size, pad_token_id)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/instructlab/training/__init__.py(2 hunks)src/instructlab/training/config.py(2 hunks)src/instructlab/training/data_process.py(3 hunks)src/instructlab/training/main_ds.py(7 hunks)src/instructlab/training/sampler.py(5 hunks)tests/unit/test_pretraining_mode.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
src/instructlab/training/__init__.py (1)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)
tests/unit/test_pretraining_mode.py (3)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/data_process.py (2)
process_documents_for_pretraining(1140-1226)main(1500-1517)src/instructlab/training/sampler.py (4)
PretrainingBlockDataset(299-389)get_data_loader(436-499)get_lengths(385-389)get_lengths(432-433)
src/instructlab/training/sampler.py (3)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/padded_batch_packer.py (1)
batch_lengths_to_minibatches_padded(200-244)src/instructlab/training/type_definitions.py (1)
CollatedItem(49-61)
src/instructlab/training/main_ds.py (2)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/data_process.py (2)
process_documents_for_pretraining(1140-1226)process_data(1454-1497)
src/instructlab/training/data_process.py (1)
tests/unit/test_pretraining_mode.py (1)
encode(113-115)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (19)
src/instructlab/training/data_process.py (1)
1194-1196: Good: EOS guard now prevents duplicate EOS tokens.The guard
if input_ids[-1] != tokenizer.eos_token_idcorrectly prevents double-appending EOS. This addresses the previous review feedback.src/instructlab/training/__init__.py (1)
13-14: LGTM: PretrainingConfig correctly exposed in public API.The new
PretrainingConfigis properly added to__all__and imported from.config, making it available for external users.Also applies to: 27-28
src/instructlab/training/config.py (3)
80-92: LGTM: PretrainingConfig is well-defined.The
PretrainingConfigclass has a requiredblock_sizefield (no default forces explicit specification) and a sensible default fordocument_column_name. Good design for a configuration that requires explicit opt-in.
281-287: LGTM: Optional pretraining_config integration in TrainingArgs.The
pretraining_configfield is correctly typed asOptional[PretrainingConfig]withNonedefault, allowing backward compatibility while enabling pretraining mode when explicitly provided.
66-75: I'll verify whether theis_pretrainingandpretraining_column_namefields are actually used in the codebase or represent dead code.
<function_calls>
#!/bin/bashSearch for all usages of is_pretraining and pretraining_column_name
rg -n "is_pretraining|pretraining_column_name" --type=py -C3
</function_calls>
<function_calls>
#!/bin/bashCheck for process_documents_for_pretraining function and its usage
rg -n "process_documents_for_pretraining|def process_" --type=py -C5
</function_calls>tests/unit/test_pretraining_mode.py (3)
27-60: LGTM: Comprehensive test for PretrainingBlockDataset.The test correctly validates:
- Block count calculation (7 tokens → 2 blocks of size 4)
- First block: full block with
num_loss_counted_tokens = block_size - 1 = 3- Second block: partial with padding, correct labels (-100 for pad), and
num_loss_counted_tokens = 2get_lengths()returns actual token counts per block
63-104: LGTM: Good integration test for pretraining data loader.The test validates:
PretrainingBlockDatasetis correctly instantiated whenpretraining_configis provided- Batch structure includes expected keys and shapes
- Loss token counting is propagated correctly
106-157: LGTM: End-to-end test for document processing pipeline.The test properly:
- Mocks
AutoTokenizer.from_pretrainedto isolate the tokenization logic- Validates output JSONL structure (input_ids, len fields)
- Confirms EOS token is appended correctly
src/instructlab/training/main_ds.py (6)
49-52: LGTM: PretrainingConfig import added correctly.
366-378: LGTM: Data loader integration with pretraining config.The
get_data_loadercall correctly passespretraining_configusinggetattrwith aNonefallback, ensuring backward compatibility when the attribute is not present.
470-490: LGTM: Correct branching for pretraining data processing.When
pretraining_configis provided, the code correctly callsprocess_documents_for_pretrainingwith the appropriate parameters includingdocument_column_namefrom the config. Otherwise, it falls back to the standardprocess_datapath.
543-547: LGTM: Command construction propagates pretraining flags.The CLI arguments
--block-sizeand--document-column-nameare correctly appended whenpretraining_configis provided.
790-801: LGTM: CLI arguments for pretraining mode.New
--block-sizeand--document-column-namearguments are correctly defined with appropriate help text.
850-861: LGTM: Validation and PretrainingConfig construction from CLI args.Good validation that
--document-column-namerequires--block-size. ThePretrainingConfigis correctly constructed only whenblock_sizeis provided, anddocument_column_nameis included via kwargs when specified.src/instructlab/training/sampler.py (5)
5-5: LGTM: Clean imports for pretraining support.The added imports and logger initialization are appropriate for the new pretraining functionality.
Also applies to: 8-8, 17-17, 23-24
385-389: LGTM: Correct length calculation.The
get_lengthsmethod correctly handles both uniform blocks and the partial last block.
447-447: LGTM: Clean integration of pretraining support.The conditional logic cleanly separates pretraining and fine-tuning dataset construction. The logging is helpful, and backward compatibility is maintained with the optional parameter.
Also applies to: 462-462, 467-478
362-362: I'm unable to complete the verification task as the repository failed to clone. Without direct access to the codebase, I cannot examine the referenced code sections to verify the loss token counting logic.However, based on the review comment itself, which is well-reasoned and identifies a specific concern about consistency in loss token counting between
PretrainingBlockDataset(lines 362, 366) andTokenDataset(lines 421-423), this requires manual verification by someone with access to the repository.Loss token counting consistency requires verification against model loss computation.
The comment correctly identifies that for causal language modeling, N tokens should produce N-1 predictions. However, the consistency between
PretrainingBlockDataset(which uses N-1 viamax(actual_len - 1, 0)andself.block_size - 1) andTokenDataset(which counts labels != -100) needs to be verified by reviewing how the model actually computes loss and whether the -1 is applied consistently across both pretraining and fine-tuning modes.
313-318: I encountered technical difficulties accessing the repository to complete the verification. However, I can provide an assessment based on the code snippet provided:Memory loading pattern confirmed, but verification incomplete.
The code snippet does exhibit the memory concern described—it loads the entire dataset's
input_idsinto a single list during initialization. While this pattern is technically valid for smaller datasets, I cannot verify:
- Target dataset sizes for this project
- Whether this is a documented limitation or known issue
- If alternative lazy-loading patterns exist elsewhere in the codebase
- Real-world usage constraints and typical dataset sizes
Given that I cannot access the repository to validate the severity and context of this concern, manual verification from the development team is needed to confirm whether this is a blocker for your intended use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (7)
tests/unit/test_pretraining_sampler.py (7)
306-335: Same parameter mismatch issue here.This test also uses
is_pretraining=Falsewhich doesn't match the actual function signature. Whenpretraining_configis None (or not provided), the function should useTokenDataset, so the test should simply omit thepretraining_configparameter or explicitly passpretraining_config=None.
336-365: Same parameter mismatch issue: usepretraining_configinstead.Update to:
loader = get_data_loader( data_path=temp_pretraining_file, batch_size=2, max_tokens_per_gpu=1000, seed=42, rank=0, world_size=1, pretraining_config=PretrainingConfig(block_size=block_size), pad_token_id=0, )
366-396: Same parameter mismatch issue.
397-432: Same parameter mismatch issue.
433-467: Same parameter mismatch issue.
468-507: Same parameter mismatch issue.
508-540: Same parameter mismatch issue.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
README.mdsrc/instructlab/training/__init__.pysrc/instructlab/training/config.pysrc/instructlab/training/data_process.pysrc/instructlab/training/main_ds.pysrc/instructlab/training/sampler.pytests/unit/test_data_process.pytests/unit/test_pretraining_data_process.pytests/unit/test_pretraining_mode.pytests/unit/test_pretraining_sampler.py
🚧 Files skipped from review as they are similar to previous changes (3)
- src/instructlab/training/init.py
- tests/unit/test_data_process.py
- src/instructlab/training/config.py
🧰 Additional context used
🧬 Code graph analysis (5)
tests/unit/test_pretraining_data_process.py (1)
src/instructlab/training/data_process.py (1)
process_documents_for_pretraining(1140-1226)
tests/unit/test_pretraining_mode.py (3)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/data_process.py (2)
process_documents_for_pretraining(1140-1226)main(1500-1517)src/instructlab/training/sampler.py (3)
PretrainingBlockDataset(299-389)get_lengths(385-389)get_lengths(432-433)
src/instructlab/training/sampler.py (2)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/type_definitions.py (1)
CollatedItem(49-61)
tests/unit/test_pretraining_sampler.py (1)
src/instructlab/training/sampler.py (5)
PretrainingBlockDataset(299-389)get_data_loader(436-499)TokenDataset(392-433)EpochSampler(26-55)MaxTokensPerRankCollator(203-296)
src/instructlab/training/main_ds.py (2)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/data_process.py (2)
process_documents_for_pretraining(1140-1226)process_data(1454-1497)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (12)
src/instructlab/training/sampler.py (3)
5-24: LGTM!The new imports and logger initialization are appropriate for the pretraining functionality.
299-390: LGTM!The
PretrainingBlockDatasetimplementation is well-designed:
- Proper input validation for block_size, pad_token_id, and required columns
- Correct concatenation and block-splitting logic
- Appropriate handling of partial last blocks with padding and label masking
- Clear separation between complete blocks (loss_tokens = block_size - 1) and partial blocks (loss_tokens = actual_len - 1)
- Good error messages for index out of range
436-478: LGTM!The integration of pretraining support into
get_data_loaderis clean and correct:
- Accepts optional
pretraining_configparameter- Conditionally creates
PretrainingBlockDatasetwhen config is provided- Falls back to
TokenDatasetfor standard instruction-tuning- Includes informative logging for the pretraining path
README.md (1)
126-165: LGTM!The continual pretraining mode documentation is clear and comprehensive:
- Explains both CLI and programmatic configuration options
- Provides practical examples
- Notes the automatic invocation of
process_documents_for_pretraining()- Well-integrated into the existing documentation structure
tests/unit/test_pretraining_mode.py (1)
1-157: LGTM!The test suite for pretraining components is comprehensive and well-structured:
- Validates block padding and loss token counting
- Checks correct handling of partial blocks
- Tests length metadata and alignment
- Uses appropriate mocking for tokenizers and datasets
- Covers end-to-end processing pipeline
tests/unit/test_pretraining_data_process.py (1)
1-441: LGTM!Excellent test coverage for the pretraining data processing pipeline:
- Validates BOS/EOS token handling
- Tests error conditions (empty dataset, missing fields, tokenizer without EOS)
- Verifies statistics calculation and logging
- Checks parallel processing and output format
- Includes integration test with real tokenizer
- Comprehensive mocking strategy
src/instructlab/training/main_ds.py (4)
369-380: LGTM!Correctly passes
pretraining_configthrough toget_data_loader, enabling the pretraining dataset path when configured.
474-496: LGTM!The conditional data processing flow is well-implemented:
- Routes to
process_documents_for_pretrainingwhen pretraining_config is provided- Falls back to standard
process_datafor instruction-tuning- Properly passes through all required parameters including
document_column_name
552-557: LGTM!Correctly appends pretraining-specific flags to the training command when
pretraining_configis present.
805-900: LGTM!The CLI integration is well-designed:
- Adds
--block-sizeand--document-column-namearguments with clear help text- Validates that
--document-column-namerequires--block-size(line 889-890)- Properly constructs
PretrainingConfigand assigns toargs.pretraining_config- Handles the case when pretraining is not enabled (sets to None)
src/instructlab/training/data_process.py (2)
415-420: Explicitly settingreturn_dict=Falseis correct.While the comment suggests this is for "newer versions of transformers,"
return_dict=Falseis actually the current default forapply_chat_template. The explicit parameter is defensive coding but harmless. This is consistent with the changes at line 693-694.
1140-1227: LGTM!The
process_documents_for_pretrainingimplementation is solid:
- Proper validation of output directory and dataset
- Clear error messages for missing/empty data
- Validates tokenizer has EOS token
- Guards against duplicate EOS tokens (lines 1194-1196)
- Calculates and logs useful statistics
- Clean separation from instruction-tuning processing
Note: A past review comment correctly identified that the docstring at line 1156 says
{"documents": "text"}but should say{"document": "text"}to match the defaultdocument_column_nameparameter and validation at line 1175.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/instructlab/training/main_ds.py (1)
890-901: LGTM with optional simplification.The validation correctly ensures
--document-column-namerequires--block-size. ThePretrainingConfigconstruction is correct.The kwargs pattern works but could be slightly simplified since
PretrainingConfig.document_column_namehas a default:🔎 Optional simplification
if args.block_size is not None: - pretraining_kwargs = {} - if args.document_column_name is not None: - pretraining_kwargs["document_column_name"] = args.document_column_name - args.pretraining_config = PretrainingConfig( - block_size=args.block_size, **pretraining_kwargs - ) + args.pretraining_config = PretrainingConfig( + block_size=args.block_size, + **({"document_column_name": args.document_column_name} if args.document_column_name else {}), + ) else: args.pretraining_config = NoneOr keep the current explicit form for readability—either approach is fine.
tests/unit/test_pretraining_sampler.py (2)
29-36: Consider using mock configuration methods.While the current approach works, setting
__len__and__iter__via lambda functions is unconventional. Consider using MagicMock's configuration methods for cleaner test setup.Alternative mock setup approach
@pytest.fixture def mock_hf_dataset(self, sample_pretraining_data): """Mock HuggingFace dataset.""" mock_ds = MagicMock() mock_ds.column_names = ["input_ids", "len"] - mock_ds.__len__ = lambda self: len(sample_pretraining_data) - mock_ds.__iter__ = lambda self: iter(sample_pretraining_data) + mock_ds.__len__.return_value = len(sample_pretraining_data) + mock_ds.__iter__.return_value = iter(sample_pretraining_data) return mock_ds
519-520: Use Pythonic boolean assertion.The assertion
== Trueis not idiomatic Python. Use direct boolean evaluation instead.Recommended change
- # When num_workers > 0, persistent_workers should be True - assert loader.persistent_workers == True + # When num_workers > 0, persistent_workers should be True + assert loader.persistent_workers
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/instructlab/training/main_ds.pytests/unit/test_pretraining_data_process.pytests/unit/test_pretraining_sampler.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/test_pretraining_data_process.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/instructlab/training/main_ds.py (3)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/logger.py (1)
setup_metric_logger(711-800)src/instructlab/training/data_process.py (2)
process_documents_for_pretraining(1140-1226)process_data(1454-1497)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (6)
src/instructlab/training/main_ds.py (6)
46-52: LGTM!The
PretrainingConfigimport is correctly added alongside other config imports and is necessary for the new pretraining mode functionality.
367-379: LGTM!The
pretraining_configis correctly passed toget_data_loader. Usinggetattrwith a default ofNoneprovides defensive handling, thoughargs.pretraining_configis always set by the CLI parsing logic.
460-462: LGTM!Good improvement—using
train_args.logger_typeandtrain_args.run_nameinstead of hardcoded defaults respects user configuration and aligns with thesetup_metric_loggersignature.
475-496: LGTM!The conditional branching correctly routes to
process_documents_for_pretrainingwhen pretraining mode is enabled. The parameter mapping aligns with the function signature, and omittingmax_seq_lenis appropriate since blocking happens at training time per the function's documented behavior.
553-557: LGTM!Pretraining arguments are correctly appended to the distributed training command when
pretraining_configis present, with argument names matching the CLI definitions.
806-817: LGTM!The CLI arguments are well-defined. When
--block-sizeis provided without--document-column-name, thePretrainingConfigdefault of"document"will be used, which is the expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/instructlab/training/main_ds.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/instructlab/training/main_ds.py (3)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/logger.py (1)
setup_metric_logger(711-800)src/instructlab/training/data_process.py (1)
process_documents_for_pretraining(1140-1226)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (7)
src/instructlab/training/main_ds.py (7)
49-49: LGTM!The PretrainingConfig import is correctly added to expose the pretraining API.
368-379: LGTM!The
get_data_loadercall correctly forwards thepretraining_configparameter using a safegetattrpattern with a default ofNone, which handles both pretraining and non-pretraining modes.
460-462: LGTM!Replacing hardcoded logger configuration with
train_args.logger_typeandtrain_args.run_nameimproves configurability and flexibility.
476-496: LGTM!The conditional data processing path cleanly separates pretraining and instruction-tuning workflows. The implementation correctly forwards
document_column_namefrompretraining_configand maintains backward compatibility for the non-pretraining path.
553-558: LGTM!The command construction correctly propagates
block_sizeanddocument_column_nameto the distributed training subprocess when pretraining mode is enabled.
803-814: LGTM!The new CLI arguments
--block-sizeand--document-column-nameare properly defined with clear help text and appropriate defaults for optional pretraining mode configuration.
887-898: LGTM!The argument parsing correctly validates that
--document-column-namerequires--block-size, and thePretrainingConfigcreation logic properly handles both explicit and default column names using the kwargs pattern.
src/instructlab/training/main_ds.py
Outdated
| return_code = process.wait(timeout=60) # wait for 1 min or error | ||
| if return_code == 0: | ||
| logger.info("Operation completed successfully! 🎉") | ||
| else: | ||
| logger.error( | ||
| f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {process_code}" | ||
| f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {return_code}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the misleading error message and incorrect process handling logic.
The process.wait(timeout=60) call blocks until the subprocess exits (or raises TimeoutExpired). When it returns a non-zero return_code, the process has already exited with an error. However, the log message on line 674 incorrectly states "has not exited yet" and the subsequent terminate() call is unnecessary since the process is already dead.
🔎 Proposed fix
- return_code = process.wait(timeout=60) # wait for 1 min or error
- if return_code == 0:
+ try:
+ return_code = process.wait(timeout=60)
+ except subprocess.TimeoutExpired:
+ logger.error(
+ "Training subprocess did not exit within 60s. Sending SIGTERM."
+ )
+ process.terminate()
+ try:
+ process.wait(timeout=60)
+ except subprocess.TimeoutExpired:
+ logger.error("Training subprocess did not terminate, sending SIGKILL.")
+ process.kill()
+ raise RuntimeError("Training subprocess timed out")
+
+ if return_code == 0:
logger.info("Operation completed successfully! 🎉")
+ return
else:
logger.error(
- f"Training subprocess has not exited yet. Sending SIGTERM. Process code: {return_code}"
+ f"Training subprocess exited with non-zero code: {return_code}"
)
-
- process.terminate()
- try:
- logger.info("Waiting for process to exit, 60s...")
- process.wait(timeout=60)
- except subprocess.TimeoutExpired:
- logger.error(
- "Training subprocess did not terminate before timeout, sending SIGKILL."
- )
- process.kill()
+ failure = TrueCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/instructlab/training/main_ds.py around lines 669-675, the code treats a
non-zero return value from process.wait(timeout=60) as "has not exited yet" and
calls terminate() incorrectly; instead, change the logic so that a non-zero
return_code is logged as a process-exited-with-error (include the return_code
and ideally any stderr output) and do NOT call terminate() in that branch; add
an explicit except subprocess.TimeoutExpired handler around
process.wait(timeout=60) that logs a timeout and then calls process.terminate()
(and optionally process.kill() after a short grace period) to handle processes
that truly do not exit within the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/instructlab/training/main_ds.py (1)
667-683: The process exit handling logic remains incorrect.This issue was previously flagged: when
process.wait(timeout=60)returns a non-zeroreturn_code, the process has already exited with an error. The log message "has not exited yet" on line 672 is misleading, and theterminate()call on line 675 is unnecessary since the process is already dead.The correct approach is to wrap
wait()in a try-except forTimeoutExpiredto handle actual timeout cases, and treat non-zero return codes as completed-but-failed processes.
🧹 Nitpick comments (1)
src/instructlab/training/main_ds.py (1)
885-896: Consider adding validation forblock_sizepositivity.The validation correctly ensures
--document-column-namerequires--block-size. However, there's no check thatblock_sizeis a positive integer. A zero or negative block size would likely cause issues during training.🔎 Proposed validation
if args.document_column_name is not None and args.block_size is None: parser.error("--document-column-name requires --block-size to be specified.") if args.block_size is not None: + if args.block_size <= 0: + parser.error("--block-size must be a positive integer.") pretraining_kwargs = {} if args.document_column_name is not None: pretraining_kwargs["document_column_name"] = args.document_column_name
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/instructlab/training/main_ds.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/instructlab/training/main_ds.py (2)
src/instructlab/training/config.py (1)
PretrainingConfig(80-91)src/instructlab/training/data_process.py (2)
process_documents_for_pretraining(1140-1226)process_data(1454-1497)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (5)
src/instructlab/training/main_ds.py (5)
46-52: LGTM!The
PretrainingConfigimport is correctly added alongside other config imports and is properly used throughout the file for the new pretraining functionality.
367-379: LGTM!The
pretraining_configparameter is correctly passed toget_data_loaderusinggetattrwith aNonedefault, ensuring backward compatibility when pretraining is not configured.
473-494: LGTM!The conditional branching correctly routes to
process_documents_for_pretrainingwhenpretraining_configis set, passing the appropriatedocument_column_name. Theblock_sizeis intentionally omitted here since blocking/chunking happens during training as noted in theprocess_documents_for_pretrainingdocstring.
551-556: LGTM!The pretraining configuration parameters are correctly appended to the torchrun command when
pretraining_configis present, ensuring the subprocess receives the necessary configuration.
801-812: LGTM!The new CLI arguments
--block-sizeand--document-column-nameare correctly defined with appropriate types, defaults, and descriptive help text that aligns with thePretrainingConfigfields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/instructlab/training/main_ds.py (1)
667-673: Incorrect error handling for subprocess exit.This issue was already identified in a previous review. The logic incorrectly treats a non-zero return code from
process.wait()as "has not exited yet" when the process has actually already exited with an error. The subsequentterminate()call is unnecessary.See the previous review comment for the detailed explanation and proposed fix.
🧹 Nitpick comments (1)
src/instructlab/training/main_ds.py (1)
574-575: Consider explicit check for mock_data_len > 0.The truthiness check
if train_args.mock_data_len:will skip appending the flag whenmock_data_lenis0. While this is likely intentional (treating 0 as "not set"), an explicit comparison would be clearer:if train_args.mock_data_len > 0:This makes the intent more obvious and avoids potential confusion.
🔎 Proposed refactor
if train_args.mock_data: command.append("--mock_data") - if train_args.mock_data_len: + if train_args.mock_data_len > 0: command.append(f"--mock_len={train_args.mock_data_len}")
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/instructlab/training/config.pysrc/instructlab/training/main_ds.pytox.ini
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: unit: 3.11 on ubuntu-latest
- GitHub Check: unit: 3.12 on ubuntu-latest
- GitHub Check: unit: 3.13 on ubuntu-latest
- GitHub Check: pylint
- GitHub Check: Summary
🔇 Additional comments (10)
tox.ini (1)
107-107: LGTM! Necessary dependency for type checking pydantic models.Adding
pydanticto the mypy dependencies is required to properly type-check the newPretrainingConfigand other pydantic-based models introduced in this PR.src/instructlab/training/config.py (3)
66-74: LGTM! Clean pretraining mode configuration.The addition of
is_pretrainingandpretraining_column_nametoDataProcessArgsprovides a clear toggle for pretraining workflows with sensible defaults.
80-92: LGTM! Well-defined pretraining configuration model.The
PretrainingConfigclass is properly structured with:
- Clear field descriptions
- Sensible default for
document_column_name- Required
block_sizeto ensure explicit configuration
293-299: LGTM! Proper integration of pretraining config.The
pretraining_configfield is correctly defined as optional with a descriptive explanation of when and how it's used. This maintains backward compatibility while enabling the new pretraining mode.src/instructlab/training/main_ds.py (6)
49-49: LGTM! Necessary import for pretraining support.Importing
PretrainingConfigenables the pretraining configuration to flow through the training pipeline.
474-494: LGTM! Clean conditional data processing logic.The branching logic correctly routes to
process_documents_for_pretrainingwhenpretraining_configis present, passing the appropriatedocument_column_namefrom the config. The fallback to standardprocess_datamaintains backward compatibility.
551-555: LGTM! Proper command construction for pretraining mode.The conditional appending of
--block-sizeand--document-column-namecorrectly propagates the pretraining configuration to the subprocess training command.
801-812: LGTM! Well-documented CLI arguments for pretraining.The new
--block-sizeand--document-column-namearguments are properly defined with clear help text explaining their purpose for continual pretraining.
885-896: LGTM! Correct CLI validation logic.The validation properly enforces that
--document-column-namerequires--block-sizeto be specified, while allowing--block-sizealone (which uses the default document column name). ThePretrainingConfigconstruction handles both cases correctly.
378-378: Remove unnecessary defensivegetattr()access.The
getattr(args, "pretraining_config", None)is defensive but unnecessary. The CLI parsing logic (lines 891-896) guaranteespretraining_configis always set onargsbeforemain(args)is called—either as aPretrainingConfiginstance orNone. Use direct attribute accessargs.pretraining_configinstead for clarity.
This PR exposes a new API which allows the ingestion of datasets containing a
documentfield, which then gets tokenized asinput_ids.Summary by CodeRabbit
New Features
Tests
Documentation
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.