Does the phrase, “Mobile Audio” make you shiver with dread? When your target platform is a small, relatively underpowered device with a finite power source, no hardware acceleration, and not a lot of memory, face-melting HiDef audio doesn’t always feel within reach. Instead, it’s a battle to carry out our audio craft faithfully under the humblest of standards as departments ration already-tiny resource budgets. It shouldn’t come as a surprise, then, that the quality of sound design and music for mobile games tends to suffer. When it’s a struggle to play only a handful of sound effects and a single stream of music, it can feel like we’re edging away from compromise and toward giving up. When you download a Top 10 game and hear objectively poor sound coming out of the speakers, what message does that send about the necessity of high-quality sound design? If the competition can get away with a single sound effect for the entire UI and four bars of music, why should we shoot for anything more? These are difficult questions to answer at times when the casual market is exploding, but after several years of working with “core audience” titles, I can tell you that mobile audio, though challenging, is an opportunity to learn, grow, and remember why we love sound.
While you’re reading, take a look at the audio demo reel of Dungeon Keeper here: http://youtu.be/Hfl6V1rYGS0
First Dig
Before we got too far into the prototyping process for Dungeon Keeper, my team and I decided to identify what we were shooting for. We agreed that we wouldn’t treat our audio as being destined for a mobile device; we were simply going to make the absolute best audio we could, and figure out how to cram it into a phone along the way. Our previous game, Ultima Forever, had been a case study in this endeavor. It was a PC MMO that later became a tablet game, and we managed to jam 4GB of audio and a two-hour soundtrack into a 200MB footprint, while still making the audio run smoothly on an iPad 2. We had learned a lot and gained confidence in that process, but Dungeon Keeper would pose an even greater challenge, with significantly smaller resource budgets despite our ambitious audio plans. If we were going to succeed, we needed to carve our goals in stone and stick to them no matter what. If we weren’t working on “mobile audio”, then we needed to aspire for “next-gen audio”.
So, what exactly does next-gen audio mean? He’s what we determined:
- 1. Next-gen audio is detailed and dynamic. It’s full of organic layers that change over time and compliment the game scene, with sound effects that inform us of the details of our surroundings.
- 2. Next-gen audio is also full-spectrum and crisp. It moves elegantly from quiet, calm moments to ferociously loud ones. It covers the range from crystalline highs to booming lows, and it avoids the dull, lo-fi sound that audio compression frequently creates. Finally, it guides us along an emotional spectrum, taking us from moments of introspection and curiosity all the way to bombastic, gripping violence.
- 3. Next-gen audio is impactful but intelligible. This is important with regard to spectrum, because without careful consideration to dynamic mixing, we can end up with a mess of noise that doesn’t give us the opportunity to communicate clear messages to the player. Great audio can speak loudly, but only when the rest of the mix is capable of carving out space to accommodate it.
- 4. Finally, next-gen audio is supported with strong tools. In our case, we chose Audiokinetic Wwise, a combination audio engine and authoring tool. It’s a fast, powerful, and informative set of tools, but most of all, it’s flexible, something we knew would be essential for the development of an ambitious mobile title.
All In the Details
With the groundwork established, we began our work. Having previously played many hours of competitive city-building titles, one thing I wanted to focus on for Dungeon Keeper was making the dungeon seem as organic and life-like as possible. In other titles, I would work so hard on my creations but never feel really connected with my world due to the lack of ambient detail. For Dungeon Keeper, we wanted everything to feel alive, as if you were truly “The Hand of Evil”, hearing detailed sounds of digging out tunnels, the shouts and grunts of summoned minions, and the satisfying noise of ripping structures out of the ground to drop them somewhere else. The start of this was to give every building (or “rooms” as we called them) a unique ambience. This included an interesting stereo loop, as well as a variety of one-shot accents played out at random, giving the room ambiences the auditory sense (feel?) of being longer and more complex than they actually were. These sounds (were) would be persistent and carefully attenuated so that at a distance, they wouldn’t cost CPU or memory until you zoomed in to examine your creations. Even then, the audio ranges were shaped as upward-facing cones to more tightly control the voice count. Further, if you chose to zoom into a single room using the info button, the ambience would shift its HDR window and flip from a 3D voice to a 2D voice, allowing the sound to fill the entire soundstage while gracefully attenuating the rest of the music and ambience.
To further compliment the feeling of life and presence, we turned our attention to the minions. We first listed all their various body sizes and methods of locomotion and determined the smallest number of soundsets we could get away with, while using resource-cheap voice effects to increase variety. For example, we made light barefoot sounds for the Imps, heavy metallic clunks for the Necromancers, and leathery flapping for the Dragon Whelps. We then applied the sounds to animation notifies and carefully controlled their rolloffs and instance limits to again provide fun amounts of detail when examined up close, but not peg the CPU when further away. For the minion voices, we used a combination of sound effects and voice actors, depending on how exotic the creature was. Importantly, we used professional voice talent for the units’ various grunts, shouts, and barks even though they didn’t speak any actual lines. We worked with actors that we had experience with, people that we knew could jump immediately into character and bring presence to their voices in even the shortest of samples (a special shout-out to Steve Blum, Michael Donovan, David Lodge, and April Stewart, our immensely-talented creature actors, and Richard Ridings, the illustrious voice of Horny). We included not only the common voice samples like “spawn” and “death”, but also a variety of “pain” sounds that reflected what trap they were snagged by during combat, adding even more contextual reactivity and humor to their animations. I still loving hearing the Trolls convulse uncontrollably when struck by a bug zapper or a Warlock screaming comically when he catches on fire.
Music Fit for a Dungeon
Next, we focused on the music. Mobile game or not, “Interactive Soundtrack” was going to be our minimum specification, so we needed to find a way to make it both blend seamlessly with the action but also speak up when something interesting was happening. This again made us reflect on what we learned from playing competing games and on what we could do to lessen players’ frustrations. During the core mechanic of combat, games of this genre tend to be a mess of simultaneous actions that leave the player scrambling to pull the screen in every direction trying to figure out what’s happening. To help with this, we looked at what messages the music and UI sounds needed to emphasize so that the audio would extend the sensory experience beyond the edges of the screen and empower players with battlefield information (telemetry, of sorts). Here’s what we came up with:
- 1. Combat has begun (after transitioning from a passive “scouting” state).
- 2. A room has been destroyed.
- 3. A spell is being cast.
- 4. A high-powered “immortal” unit has spawned or been knocked out.
- 5. The Dungeon Heart (the critical structure) is being attacked.
- 6. The player is on the brink of winning or dominating with seconds left on the clock.
- 7. The scenario was won or lost.
Through this sequence, we created an emotional cadence, where segments and stingers were strung together in a musical sentence that accurately informed the player about what they should be feel and notice. The exciting, segmented messages move from Exposition (the Scouting sequence), to Rising Action (the Attack sequence), to Climatic Plateau or Peak (the Heart-Attacked and/or Brink segments), to Resolution (the Win/Loss cadences), with informative critical events sprinkled throughout (Destruction and Spell stingers). This all plays out through carefully-timed segment transitions and tempo-sync’d stingers, giving the music a smooth, continuous feel. This same care also went into designing and writing the Build Mode music, filling players with joy and excitement as they construct, upgrade, and maintain their dungeons. Like the combat music, there are distinct musical cues for each action that gradually build up the excitement and reward players for working in their lairs.
Technical Considerations
Lastly on the technical side, we focused on how to make this massive amount of sound work on phones and tablets. The raw size of the audio was a half-gig (and that’s 16-bit sounds, not 24!), which included a complex music system and 700 lines of dialog. We used every compression format and optimization Wwise makes available and managed to compress the audio package down to a mere 30MB. Obviously we made liberal use of the Vorbis codec for dialog and music, but we also used it for much of the UI and sound effects. This may sound surprising, since audio designers are frequently cautioned from using too many compressed voices, as they gulp CPU cycles greedily. However, we found that Vorbis could be our first choice and not out last resort if we grouped compressed voices into their own submix, whereupon we could programmatically throttle the number of simultaneous Vorbis voices without worrying about where the CPU ceiling was. Making liberal use of file streaming was another key choice, allowing us to use only 3.2MB of audio memory at runtime (we’re working with hard memory, which can of course take a lot more streaming punishment than a traditional hard disk without worry about thrash). Further, the HDR capabilities of Wwise were a critical component in making the whole thing work. By breaking up groups of sounds into HDR “bands”, we could paradoxically make the CPU usage drop when more voices tried to compete for audibility. Through this, we knew our CPU percentage would cap at under 12%, because even though as many as 80 voices would be playing virtually, only 10 to 15 would ever physically make it through to the speakers. This resulted in very predictable CPU loads while also providing an extremely tight mix that could speak clearly during the busiest of moments. We extended the usability of this scheme even further by implementing a Level-of-Detail system that controlled the complexity of the audio scene with a series of RTPCs to do things like change voice limits, turn DSPs on or off, pause specific submixes to free up channels, and then adjust the overall mix to compensate for the difference. This was primarily driven by a hardware poll on the app startup to trigger the appropriate “platform” event, but Wwise made it easy enough that we could even change the LoD parameters during low-memory situations, dynamically adjusting the audio if the device performance was suffering from external circumstances.
The last thing we addressed on the technical side was a significant “first” for mobile games, and that was to add an output optimizer in the audio options panel. We provided three buttons to allow the user to select the best match to their output device:
- 1. System speakers. This mix had the narrowest HDR window and the most compression, along with an EQ the skewed heavily to the top end. Naturally, if a device can’t reproduce low frequencies, we shouldn’t use up precious digital headroom by leaving them in the mix. This allowed the game to be unusually loud and clear on even the most modest of device speakers.
- 2. Small earphones. This was our target mix since we expected most players to be using the earphones that came with their device (the classic Apple earphones, in particular). The HDR window had a reasonable width and a moderate amount of compression to allow a wider dynamic range, but still a bit of restriction to keep the audio at a consistent level since small earphones tend to have poor isolation against competing noise.
- 3. Large circumaural headphones. This was our most impressive mix, with a much wider HDR window and less-aggressive compression, and an EQ that gave a slight boost to the top and bottom ends. The result was a mix that had an enormous dynamic range and a sonic quality that was positively thrilling. This made it perfectly suited for reproduction on the more isolated output of circumaural headphones (or a HiFi surround system, for that matter).
The final result of these technical features was the accomplishment of something we felt very strongly about, which was the commitment to making Dungeon Keeper sound spectacular no matter what the capabilities of the player’s device was. New, old, big, small, it didn’t matter. We wanted everyone to take part in audio delight, and we feel we met that challenge thoroughly.
Closing Up Shop
In the end, the accomplishments of Dungeon Keeper come very bittersweet; a technical marvel and an incredibly enjoyable game, hampered by what was justifiably called out for having an overly-aggressive economy that concluded with Mythic being shut down. We have to ask ourselves again, as we did in the beginning, what it all worth it? If best-in-class sound design isn’t required for a top-grossing game, and if it won’t save an otherwise struggling game, what good is it? Do we stop trying? Do we give up? My answer is still emphatically, “No.” Making audio for mobile games is hard. Damn hard. Even with traditional game hardware, audio has to fight for resource budgets and flounder at the butt-end of the development cycle. So when you’re making a mobile game and take away memory and hardware support, and try to do it all in less than a year, your mission as audio designers, engineers, and musicians gets exceedingly tough. There’s something to remember though… The systems we played on as kids and teens didn’t have the kind of capabilities that consoles and PCs have now, but we were still able to have emotional experiences with them that stuck with us, even today. When we’re clear and purposeful about the art we create, we can use audio to drive emotional experiences without feeling limited by the capabilities of the platform. No other discipline can reach players faster or more effectively, which shines a bright light on our true opportunity. Art is about audience, and when our mobile audience is tens of millions instead of hundred of thousands, we have an incredible chance to show the casual gamers of today what real emotional experiences can be and why they should keep playing our games.
So keep trying. Do more. Shoot higher. Don’t settle for what’s good enough. Select the right tools, keep your passion high, and remember who you’re doing it all for. It’s the art, and the love of the craft, and mobile gaming is the clearest reminder of both the stakes and rewards of what’s ahead of us.
Onward, Keepers.