Skip to content

Conversation

@ssrajadh
Copy link

New features

  • CLI arg to select model (from GPT2 Family), with default being gpt2 (ex: python examples/practice_run.py --model_name distilgpt2)
  • Dynamically calculating num_features based on model being used (num_layers * hidden_size)
  • Added loss metric - total_loss_per_feature to make a more fair comparison between models of different sizes
  • More logging + model validation with error handling

Also, I ran into some issues with the Poetry config file. I had to change up some of the syntax and version constraints to work on my Linux machine.

I've also attached the data from running DistilGPT2 below. Some of the metrics like sparsity loss aren't as useful though so I'm going to rerun it with the added loss metric that I've mentioned above.
run_distilgpt2.zip

Let me know if this is good and what other verifications for the CLT you had in mind.

@etredal
Copy link
Owner

etredal commented Oct 13, 2025

Awesome! I am taking a look today/tomorrow at this.

@etredal etredal requested a review from StickOnAStick October 13, 2025 04:31
@etredal
Copy link
Owner

etredal commented Oct 13, 2025

The additional validation that I am most concerned about is this right here, and I think it is printed to the terminal. What are the actual token outputs of the original model vs replacement model for inference? Doing some research it seems we are going for 50% accuracy, so comparing the text, does the replacement model also have meaningful output even if it didn't necessarily choose the same as the original model?
image

@ssrajadh
Copy link
Author

ssrajadh commented Oct 13, 2025

@etredal
Here are the token outputs:


Test text 1: The president of the United States lives in the White House.
  Cosine similarity: 0.9622
  Mean squared error: 1046.1991
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Shape of input_ids before generate: torch.Size([1, 12])

  Original model output: The president of the United States lives in the White House. He is the only person in the world to be elected president.”

  Replacement model output: The president of the United States lives in the White House. The White and the white-the-the-the-white-the-the-the-the-the-white-the-the-the-the-the-the-

Test text 2: Artificial intelligence systems can learn from data.
  Cosine similarity: 0.9934
  Mean squared error: 824.0825
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Shape of input_ids before generate: torch.Size([1, 9])

  Original model output: Artificial intelligence systems can learn from data. But it is not all that surprising. Artificial intelligence systems are increasingly used to generate and analyze information from data. For example, researchers at the University of Cambridge have developed a system that can learn from data.

  Replacement model output: Artificial intelligence systems can learn from data.
ArtArt art art art art art art art art art art art art art art art art art art art art art art art art art art art art art art art art art art art art art art

Test text 3: The Sahara Desert is the largest hot desert in the world.
  Cosine similarity: 0.9911
  Mean squared error: 821.0219
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Shape of input_ids before generate: torch.Size([1, 12])

  Original model output: The Sahara Desert is the largest hot desert in the world. It is a barren desert with some of the highest rainfall in the world, with over a million people living there.

The Sahara Desert is the most hot desert in the world

  Replacement model output: The Sahara Desert is the largest hot desert in the world. The Sahara is the largest. The Sahara is the Sahara.
The most advanced.
The French the French
The French
The French
The French
The British
The French

While the cosine similarity is high, there seems to be a mode collapse issue with the replacement model. I'm going to add the attention mask, set pad_token_id, and add logging for the original vs. replacement model logits. Then I'll rerun the test and keep you updated with the outputs and metrics.

@ssrajadh
Copy link
Author

Just did another practice run and got some clues for the repetitive text patterns. The replacement model has over 50% lower variance resulting in a flatter distribution which could explain the mode collapse. Recon loss is also high (1.27). I'm going to add more epochs, increase the learning rate, and look into more tweaks to reduce recon loss and preserve variance. I'll keep you updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants