-
Notifications
You must be signed in to change notification settings - Fork 56
Description
Context I am experimenting with finetuning the 2B model using LoRA on a dataset of First-Person POV videos. My goal is to achieve specific motion control, but I am encountering significant issues with controllability, even when testing against the training samples.
Dataset & Training Details
-
Source Data: 151 POV videos.
-
Augmentation: I mirrored videos containing turning motions, bringing the total dataset to 206 videos.
-
Video Length: 121 frames per video.
-
Captions/Annotations: The dataset uses sparse captions focusing strictly on motion primitives (e.g., "turning left," "moving forward") rather than dense, detailed visual descriptions. Example captions: "Head right towards the seat.", "Turn left and go straight.".
-
Training Method: LoRA using the default config file, trained for 10 epochs
The Issue After finetuning, the model exhibits weak controllability. This is particularly noticeable in turning motions, where the model fails to adhere to the prompt even when generating samples from the training set (failure to overfit).
Hypothesis & Questions I suspect the issue might stem from the caption density or dataset size, but I would value your insights on the following:
-
Caption Granularity: My captions only describe motion (primitives). Does this model require detailed visual descriptions (background, objects, lighting) in the prompt to learn the motion-text alignment effectively?
-
Dataset Size: Is ~200 videos generally considered insufficient for this specific model architecture to learn temporal dynamics like turning?
Any advice on data preparation or hyperparameter tuning for this specific use case would be greatly appreciated. Thanks!