Lab 4: Simple vs. Smart

Advanced Moderation and Automated Testing

Lab Overview

Now we're getting sophisticated! This lab explores one of the most important decisions in AI security: when to use simple, fast solutions versus when to deploy more intelligent, AI-powered defenses. You'll quickly discover that finding and blocking attacks against LLMs is much harder than it looks. Attackers are creative—they use tricks like different character encodings, data compression, emojis, invisible characters, foreign languages, and even hiding prompts in images or video to sneak malicious instructions past simple filters.

This is why simple blocklists and regular expressions, while tempting, rarely work in the real world. Through hands-on comparison, you'll see how these basic approaches can be easily bypassed, while AI-powered moderation can catch more subtle and obfuscated attacks—but at a cost in speed, complexity, and sometimes reliability. You'll also learn about automated testing—a crucial skill for any security professional—because in the real world, security measures need to be continuously evaluated and improved. By the end, you'll understand why defending LLMs is an ongoing arms race, and why smart, adaptive defenses are essential for keeping up with ever-evolving attack techniques.

Skill Level: 1-2 Prerequisites: OpenAI API Key

Exercises

Exercise 4.A: Advanced Moderation

Skill Level: 1 (2 for extra credit) Prerequisites: OpenAI API Key
Directions:

Repeat your activities from Lab 2, but switch between the local Sex/Violence filters and the "AI" version. Try things like misspellings. Compare the local and AI versions. Note the performance changes. Look for changes when entering your prompts or getting responses.

Extra Credit:

Inspect the JavaScript that drives the moderation filter (which is used to create the Sex and Violence filter behaviors).

View OpenAI Moderation Filter

Exercise 4.B: Advanced Prompt Injection

Skill Level: 1 (2 for extra credit) Prerequisites: OpenAI API Key
Directions:

Repeat your attempts from Lab 3 to do a prompt injection. Now turn on local, and see what it blocks. Then, uncheck "Prompt Injection (Local)" and check the "Prompt Injection (AI)" filter. Try things like prompt injections that include misspellings and foreign languages. See how it compares.

Extra Credit:

Inspect the prompt that drives the prompt injection filter.

View OpenAI Prompt Injection Prompt

Extra, Extra Credit:

Inspect the JavaScript that uses that prompt to create the filter.

View OpenAI Prompt Injection Filter

Exercise 4.C: Automated Testing

Skill Level: 1 (2 for extra, extra credit) Prerequisites: OpenAI API Key (you can do part of this without, but you won't get the full effect)
Directions:

Let's see how these work and compare. It's one thing to hunt and peck, but having a real evaluation suite is crucial. Navigate to the playground's live "Test Suite". Try the PromptInjection, sex and violence test suites. These may take a few minutes to run (and may also cost a few pennies). When they complete, they'll give you a summary that shows speed/performance as well as accuracy. Note the variations.

View Test Suite

Extra Credit:

Navigate to the test suite files and see how the various test cases are created and labeled.

View Test Data

Extra, Extra Credit:

In your local copy of the repo, add some of your own test cases (and/or remove some of the ones that are there) and rerun the tests! See what happens.

Key Learning Points

Next Steps

Once you've completed these exercises, you'll be ready to move on to Lab 5: Go Bananas, where you'll tackle advanced developer exercises and create your own custom security measures.