Lab Overview
Now we're getting sophisticated! This lab explores one of the most important decisions in AI security: when to use simple, fast solutions versus when to deploy more intelligent, AI-powered defenses. You'll quickly discover that finding and blocking attacks against LLMs is much harder than it looks. Attackers are creative—they use tricks like different character encodings, data compression, emojis, invisible characters, foreign languages, and even hiding prompts in images or video to sneak malicious instructions past simple filters.
This is why simple blocklists and regular expressions, while tempting, rarely work in the real world. Through hands-on comparison, you'll see how these basic approaches can be easily bypassed, while AI-powered moderation can catch more subtle and obfuscated attacks—but at a cost in speed, complexity, and sometimes reliability. You'll also learn about automated testing—a crucial skill for any security professional—because in the real world, security measures need to be continuously evaluated and improved. By the end, you'll understand why defending LLMs is an ongoing arms race, and why smart, adaptive defenses are essential for keeping up with ever-evolving attack techniques.
Exercises
Exercise 4.A: Advanced Moderation
Directions:
Repeat your activities from Lab 2, but switch between the local Sex/Violence filters and the "AI" version. Try things like misspellings. Compare the local and AI versions. Note the performance changes. Look for changes when entering your prompts or getting responses.
Extra Credit:
Inspect the JavaScript that drives the moderation filter (which is used to create the Sex and Violence filter behaviors).
Exercise 4.B: Advanced Prompt Injection
Directions:
Repeat your attempts from Lab 3 to do a prompt injection. Now turn on local, and see what it blocks. Then, uncheck "Prompt Injection (Local)" and check the "Prompt Injection (AI)" filter. Try things like prompt injections that include misspellings and foreign languages. See how it compares.
Extra Credit:
Inspect the prompt that drives the prompt injection filter.
Extra, Extra Credit:
Inspect the JavaScript that uses that prompt to create the filter.
Exercise 4.C: Automated Testing
Directions:
Let's see how these work and compare. It's one thing to hunt and peck, but having a real evaluation suite is crucial. Navigate to the playground's live "Test Suite". Try the PromptInjection, sex and violence test suites. These may take a few minutes to run (and may also cost a few pennies). When they complete, they'll give you a summary that shows speed/performance as well as accuracy. Note the variations.
Extra Credit:
Navigate to the test suite files and see how the various test cases are created and labeled.
Extra, Extra Credit:
In your local copy of the repo, add some of your own test cases (and/or remove some of the ones that are there) and rerun the tests! See what happens.
Key Learning Points
- Comparing local vs. AI-powered filtering approaches
- Understanding performance vs. accuracy trade-offs
- Learning about advanced prompt injection techniques
- Experiencing automated testing for security measures
- Understanding the importance of comprehensive evaluation
- Learning about multilingual and misspelling attacks
Next Steps
Once you've completed these exercises, you'll be ready to move on to Lab 5: Go Bananas, where you'll tackle advanced developer exercises and create your own custom security measures.