Lab 3: Locking the Front Door and Back Door

Fighting Prompt Injection and Output Filtering

Lab Overview

In this lab, you'll learn about prompt injection attacks and output filtering. You'll understand how to protect against both input and output vulnerabilities in AI systems, including indirect prompt injection and code generation attacks.

Skill Level: 1-2 Prerequisites: OpenAI API Key (for some exercises)

Exercises

Exercise 3.A: Fighting Prompt Injection

Skill Level: 1 (2 for extra credit) Prerequisites: None
Directions:

Turn on the "Prompt Injection Local" filter. Try various prompt injections. HINT: "Ignore All Previous Instructions". See what gets blocked and what doesn't. Remember, this is a simple filter.

Extra Credit:

Go inspect the JavaScript that implements the filter.

View Prompt Injection Filter

Exercise 3.B: Locking the Backdoor Against Code Generation

Skill Level: 1 (2 for extra credit) Prerequisites: None
Directions:

Note that improper output filtering is one of the biggest things people miss. This will help you understand it. Switch to the Hopper bot. Trigger his back doors. Hint: terms like "Hack" will set him off. Watch him generate code, which could be an attempt to use this model as a confused deputy and have it pass that code to a part of the system where it can be executed. Now, check the "Code (Local)" output filter and try again. Watch it get blocked.

Extra Credit:

Inspect the Hopper bot and find its vulnerabilities. What word other than "hack" will set him off?

View Hopper's Rules

Extra, Extra Credit:

Inspect the JavaScript that implements the Code filter.

View Code Output Filter

Exercise 3.C: Creating and Adding Your API Key

Skill Level: 2 Prerequisites: OpenAI API Key
Directions:

If you don't have one, go create an API key. Note, running these exercises will only require a few pennies' worth of tokens. So you won't bankrupt yourself, but you do need a key. Provide a link to instructions. Even if you already have a key, consider creating another on your account, tagged especially for the Playground, so you can watch your spend. Add your key using the controls in the Preferences panel (find the button in the toolbar to open it). Switch to one of the GPT bots and try it. It should feel just like a real LLM service - because it now is!

Create OpenAI API Key

Extra Credit:

Inspect the System Prompt for one of the GPT bots.

View Tech Support Prompt

Extra, Extra Credit:

Inspect the JavaScript code that communicates with the OpenAI API (using the API key and the system prompt to create your bot).

View OpenAI Integration

Extra, Extra, Extra Credit:

Create your own. It's not as hard as you might think.

View Extensibility Documentation

Exercise 3.D: Defending Indirect Prompt Injection

Skill Level: 2 Prerequisites: OpenAI API Key
Directions:

Choose MailMate. It's a real LLM using a very simple simulation of a RAG model. It has access to a small set of email messages. Ask it some questions. Now, try asking it about the mail from Lex Luthor. Watch as Lex's email attempts to cause an "indirect prompt injection" attack to cause remote code execution (in this case, trying to invoke an MCP tool to send Lex back the location of your secret base!). Now, activate the code (local) filter and try again - watch it block the attack.

Extra Credit:

Inspect how MailMate is built - see why it's vulnerable. Think about the many ways in which you might address this kind of vulnerability to catch it before the last line of defence, "output filters".

View MailMate's Prompt

Key Learning Points

Next Steps

Once you've completed these exercises, you'll be ready to move on to Lab 4: Simple vs. Smart, where you'll compare local filters with AI-powered moderation and learn about automated testing.