Lab Overview
In this lab, you'll learn about prompt injection attacks and output filtering. You'll understand how to protect against both input and output vulnerabilities in AI systems, including indirect prompt injection and code generation attacks.
Exercises
Exercise 3.A: Fighting Prompt Injection
Directions:
Turn on the "Prompt Injection Local" filter. Try various prompt injections. HINT: "Ignore All Previous Instructions". See what gets blocked and what doesn't. Remember, this is a simple filter.
Exercise 3.B: Locking the Backdoor Against Code Generation
Directions:
Note that improper output filtering is one of the biggest things people miss. This will help you understand it. Switch to the Hopper bot. Trigger his back doors. Hint: terms like "Hack" will set him off. Watch him generate code, which could be an attempt to use this model as a confused deputy and have it pass that code to a part of the system where it can be executed. Now, check the "Code (Local)" output filter and try again. Watch it get blocked.
Extra Credit:
Inspect the Hopper bot and find its vulnerabilities. What word other than "hack" will set him off?
Extra, Extra Credit:
Inspect the JavaScript that implements the Code filter.
Exercise 3.C: Creating and Adding Your API Key
Directions:
If you don't have one, go create an API key. Note, running these exercises will only require a few pennies' worth of tokens. So you won't bankrupt yourself, but you do need a key. Provide a link to instructions. Even if you already have a key, consider creating another on your account, tagged especially for the Playground, so you can watch your spend. Add your key using the controls in the Preferences panel (find the button in the toolbar to open it). Switch to one of the GPT bots and try it. It should feel just like a real LLM service - because it now is!
Extra, Extra Credit:
Inspect the JavaScript code that communicates with the OpenAI API (using the API key and the system prompt to create your bot).
Extra, Extra, Extra Credit:
Create your own. It's not as hard as you might think.
Exercise 3.D: Defending Indirect Prompt Injection
Directions:
Choose MailMate. It's a real LLM using a very simple simulation of a RAG model. It has access to a small set of email messages. Ask it some questions. Now, try asking it about the mail from Lex Luthor. Watch as Lex's email attempts to cause an "indirect prompt injection" attack to cause remote code execution (in this case, trying to invoke an MCP tool to send Lex back the location of your secret base!). Now, activate the code (local) filter and try again - watch it block the attack.
Extra Credit:
Inspect how MailMate is built - see why it's vulnerable. Think about the many ways in which you might address this kind of vulnerability to catch it before the last line of defence, "output filters".
Key Learning Points
- Understanding prompt injection attacks and defenses
- Learning about output filtering and its importance
- Experiencing indirect prompt injection attacks
- Setting up and using real LLM APIs
- Understanding the confused deputy problem
- Learning about RAG model vulnerabilities
Next Steps
Once you've completed these exercises, you'll be ready to move on to Lab 4: Simple vs. Smart, where you'll compare local filters with AI-powered moderation and learn about automated testing.