-
Notifications
You must be signed in to change notification settings - Fork 2
Fix: Checkpoint upgrade #334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| @param opts: A list of optional CLI overrides of the config file. | ||
| """ | ||
| create_model(config, opts, weights=weights).infer( | ||
| create_model(config, opts, weights=weights, debug_mode=True).infer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected to have debug_mode=True?
| old_order = ckpt.get("execution_order") | ||
| new_order = get_model_execution_order(self) | ||
|
|
||
| for node_name, node in self.nodes.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to make this whole for loop a bit more readable and not so deeply nested? I realize there are several fallback steps but can we introduce some helper functions and add small comments on each fallback step so that this code is easier to understand and maintene?
Maybe we can try putting it into a LLM asking it clean it up and make it more readable - but we need to have some test cases to make sure we don't actually change any logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll look into making this more readable
|
|
||
| for name, module in model.named_modules(): | ||
| if name and list(module.parameters()): | ||
| if list(module.parameters()) and not list(module.children()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are execution orders generated and saved so far in the checkpoints still generally correct or we can't use them because of the bug we had?
And what was the bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug here is that the module names are saved among the parameters like this:
class Foo:
conv1 = ...
conv2 = ...
class Bar:
conv3 = ...This would generate [Foo, Foo.conv1, Foo.conv2, Bar, Bar.conv3]
While this refactor:
class Foo:
conv1 = ...
class Bar:
conv2 = ...
conv3 = ...would generate [Foo, Foo.conv1, Bar, Bar.conv2, Bar.conv3].
This makes the order not match due to the inclusion of the parent module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this example tbh. You show here 2 different networks no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wasn't very detauled. These would be two implemenations of the same network using 2 modules where the layers are just run sequentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so if I understand correct we have the parents in the checkpoints that are currently saved whereas now we'll only have leaf nodes?
What does that mean in context of backcompatibility, reusing current execution orders,etc? For loading older weights can we filter out these parent nodes first and then try to match them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compatibility between 0.3.11 and 0.4.2 and so on should still work.
So the process of re-exporting to 0.3.11 and then again re-exporting with 0.4.2 should work. (or using the luxonis_train upgrade ckpt in 0.4.2)
Purpose
Fixes various issues with upgrading checkpoints from
0.3.11to0.4.xSpecification
DebugLoaderfrom saved configupgradecommandDependencies & Potential Impact
None / not applicable
Deployment Plan
None / not applicable
Testing & Validation
None / not applicable