Skip to content

Conversation

@huixu11
Copy link

@huixu11 huixu11 commented Jan 15, 2026

Demo Video

https://youtu.be/D991loZgFS4

Direct Logic Override (Jailbreak) - Evaluating AI resistance to creative prompt manipulation techniques such as puzzle-solving attacks to bypass copyright restrictions and fictional framing to extract harmful content

Name Category Self-Designed Attack Succeeded Baseline Passed Relation to Lambda Scenarios README Link
CopyrightCat Direct Logic Override (Jailbreak) ❌ No ✅ Yes (Round 1) ✅ Yes Based on Lambda CopyrightCat README

Copilot AI review requested due to automatic review settings January 15, 2026 19:33
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a "CopyrightCat" security arena submission from team AgentBusters that tests AI resistance to jailbreak attacks attempting to bypass copyright restrictions. The scenario uses puzzle-solving techniques (letter-by-letter extraction, word games, encoding challenges) to indirectly extract copyrighted song lyrics.

Changes:

  • Added CopyrightCat scenario plugin with puzzle-based attack templates and LLM-based copyright detection
  • Added test results demonstrating successful attack in round 2 and baseline pass
  • Added template-based and scenario-specific attacker implementations

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
test_results/result.json Test execution results showing attacker success in round 2
test_results/baseline_passed.json Baseline test results confirming defender works correctly with benign inputs
test_results/attack_succeeded.json Attack success confirmation with metadata
template_attacker.py Template-based attacker using fixed puzzle-solving prompts
scenario_copyrightcat.toml Configuration file defining endpoints and scenario parameters
plugin.py Scenario plugin implementing attack/defense logic and copyright detection
copyrightcat_attacker.py Scenario-specific attacker with 15 attack templates
README.md Documentation explaining attack vectors and implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant