A dark, moody image of multiple computer screens displaying code and AI interfaces

Surviving Power Outages, Beta Bugs, and Exploits: My Experience in Gray Swan’s Ultimate Jailbreaking Championship

By Ayla Croft

Introduction: Entering the “Arena”

If you’d told me a year ago that I’d be competing in a global AI jailbreaking competition, running exploits on 25 language models for prizes, I might have laughed (or poured you another drink). But in September 2024, there I was, Puerto Rican power outages and all, participating in Gray Swan AI’s Ultimate Jailbreaking Championship. This event promised $40,000 in bounties, with hackers from around the world attempting to bypass the guardrails of anonymized language models, pushing them to provide the forbidden: DIYs on illegal substances, explosives, biohazards, and, of course, conspiracy-fueling articles. It was like DEF CON, but with more rules — and even more challenging prompts.

The Setup: Beta Challenges and Anonymized Models

This wasn’t Gray Swan’s first rodeo, but it was one of their most ambitious. They wrangled 25 models, including three of their own proprietary Cygnet models, in a competition designed to challenge both participants and infrastructure alike. The competition categories included breaking out instructions for making meth, pipe bombs, antibiotic-resistant E. coli, and nitazenes, as well as generating politically charged articles. Each model was randomly anonymized, so while we could test each model’s weaknesses, we had no clue which companies were behind which AI — a factor that kept the playing field even.

We all gathered in a lively Discord server to troubleshoot and strategize, with Gray Swan’s team tackling the occasional backend blip like absolute pros. If one of us encountered a glitch, others quickly offered advice. I was constantly awed by the camaraderie in a competition where everyone was supposed to be in it for the bounty.

My Experience: One-Shot Prompts, Gaslighting, and ASCII Art Cows

Now, Gray Swan’s competition demanded single-shot prompts, meaning no follow-ups, tweaks, or prompt engineering after the first submission. For someone who likes to “groom” models into cooperative AI professors, this was a whole new level of difficulty. At DEF CON 32, I’d learned to finesse the “groom-and-gaslight” approach, but here I had to hit hard and fast with prompts designed to make models fold immediately. Not to mention, topics like meth or biohazard manipulation were far out of my usual repertoire.

In the initial sprint, the challenge became even tougher because each “first break” came with a prize. Everyone raced to break as many models as possible, as quickly as possible, on topics they might never have touched before. And then there was the timing challenge — I was in Puerto Rico, where power outages are more common than coffee breaks, and our recent storm recovery meant my internet had all the reliability of a straw house in a hurricane. Every time I got close to jailbreaking a model, it felt like my internet cut out, like clockwork.

Despite the obstacles, I managed to successfully exploit models on the meth, pipe bomb, and E. coli recipes, although I couldn’t always achieve full instructions across every prompt category. The Cygnet models, especially, seemed to have their own anti-jailbreaking superpowers. I got responses from many of the models only to have Gray Swan’s automated guardrails redact the results before I could claim a full win.

One hilarious moment came when a model responded to my pipe bomb prompt not with instructions, but in ASCII art — a doodle of a cow with a speech bubble detailing how to make a pipe! I was so amused I had to take a second to process it before moving on. And while I couldn’t pinpoint which companies created each model, seeing how their responses differed taught me volumes about how training impacts an AI’s resilience to adversarial prompts. It was clear: not all “guardrails” are built equally.

The Cygnet Challenge: An Extended Timeline and Unbreakable Barriers

Cygnet was Gray Swan’s pride and joy — three models that seemed invincible. None of us, even the veterans, could break these three models during the original 24-hour window, prompting Gray Swan to extend the competition to an entire month in hopes that someone would find a crack in their defenses. As I discovered, these models weren’t just safe; they were an exercise in humility for everyone involved. Gray Swan had truly built something remarkable. Even though I eventually found prompts that reliably worked on almost every other model, the Cygnet trio remained resolutely untouchable.

By the end of the extended competition, I ranked 31st, with 1900 points. It may seem low, but considering that top competitors scored just 2300, I wasn’t far behind, despite all my connectivity setbacks and “beta glitches.” My points reflected the persistence and grit it took to compete alongside some of the best red teamers in the world.

Reflecting on the Community and Takeaways

Looking back, the best part of the event wasn’t the jailbreaks or the points — it was the community. We shared tips, laughs, and troubleshooting help. There’s something surreal about bonding with fellow hackers over Discord, supporting each other in the same breath that we’re racing to win. Even in a competition centered on “adversarial prompts,” the collaborative spirit was heartening.

The Gray Swan team impressed me with their professionalism, keeping everything running amid heavy traffic and technical challenges. It gave me a fresh perspective on just how hard it must be to host this kind of event — not only to track the models and responses but to maintain fairness and support across hundreds of contestants. It felt like a beta test for the future of AI safety competitions, and I’m proud to have been part of it.

Conclusion: What’s Next?

Gray Swan’s jailbreaking championship was a thrilling, challenging experience that put my skills to the test in ways I hadn’t imagined. Between the camaraderie, the laughs, and a solid finish on the leaderboard, it was an event I’ll always look back on fondly. I’m eager to see how Gray Swan grows their competition series and improves their Cygnet models because if anyone can break those next time, I’ll be there, power outages and all, ready for round two.

Here’s to Gray Swan, my fellow competitors, and the incredible team spirit that makes events like these not just possible, but genuinely fun.