How to Beat the Five Trickiest Aspects of Recording the Naked Voice

Recording a naked speaking voice is tough. Here's five problems you'll encounter and need to solve to meet ever-increasing marketplace standards.
Publish date:
Recording voices is always a challenge, even when working with talents who know what they’re doing, like Stan Lee, seen here recording an audiobook for Learning Ally in 2012.

If most of your work is in music production, you probably think that recording vocals is easy. For the most part, it is, with only a single mic (typically) in play, no phase or polarity issues like drumkits, punch-ins aren’t terribly difficult and there’s a lot of leeway in the gear you can use, as well as the final sound you can achieve.

Rob Tavaglione

Recording voice is a lot tougher, and by voice, I mean speaking, not singing. Speaking (whether instruction, narration, recitation, dialogue, announcing, etc.) carries a whole different set of stricter demands and requirements—intelligibility is paramount, consistent clarity a necessity and tone has to be familiar and natural, with an absence of the artsy distortions and effects we so often rely on with musical arrangements.

Music beds, nat sound and ambiences are all frequently used to not only enhance content, but to also mask the inherent ugliness that is a part of capturing voice. You don’t realize just how much unwanted sound is part of voice capture until you work on an audiobook, or similar venture that is not aided by backing tracks or masking. Naked and exposed, the human voice creates a number of compelling problems for microphones and, subsequently, audio.

Here are the five problems I routinely encounter and solve to meet ever-increasing marketplace standards:

Plosives - Unless somebody has their mouth right in your ear, plosives aren’t typically a problem. That little burst of air that accompanies Ps, Bs and sometimes Ws defines the consonant and is a great sound to convey excitement (“pow” or “bam”). But get that burst of air hitting a mic’s diaphragm (especially with a small diaphragm condenser or ribbon) and you’ve suddenly got beatboxing, subwoofers jumping at the concert, or converters clipping in the studio and amplifiers out of headroom in either locale; not pretty stuff.

Accompanying music or music beds allow preventive high pass filtering, but solo voice and the prevalence of big-deep-loud VOG (voice of God) styles necessitates only very low-frequency (or low slope) high passing. The use of omni patterns helps, as do windscreens and pop filters, but thumpers still sneak through (especially with anything less than expert-level voice actors/artists), so sometimes signal processing is still needed to wipe out occasional offenders. For instance, a simple (logarithmic) fade-in can reduce the blast to appropriate levels, while sometimes a quickly automated HPF does the trick. Often times a well-tuned multi-band compressor can get it done, but on occasion, it requires power tools like iZotope’s De-Plosive module in RX Advanced (I’m using version 5) that intelligently tames plosives.

Sibilance - Switching sides of the frequency spectrum, sibilance is a comparable issue. You gotta have it to define Ss, Cs (and Ks and Xs too), but present a microphone (again, especially a condenser) with too much sibilance and it sounds like your monitors are hurling mini-daggers at your ears with each sibilant event. Talk about a way to subconsciously repel a listener! Dynamic mics help a lot; slightly swiveling the mic off-axis really helps and windscreens sometimes completely solve the problem (but typically not) and static EQ’ing isn’t really a solution. Thankfully, de-essers work quite well, whether analog or digital; de-esser plug-ins work even better; multi-band compressors and dynamic EQs can be easily steered into a de-essing direction; and you can even make your own de-esser with a compressor, its sidechain and an EQ.

Breaths - You’ve gotta have breathing, or else voice would be…er, silent. But seriously, some audible breathing is just fine and a even a way to engage a listener with intimacy…but too much breathing (or its ugly cousin, wheezing) and words like “windbag,” “pervert” and “desperate” start popping up. A lot of this responsibility for this goes to the talent themselves, who have to be in reasonable “lung shape,” know how to work a mic and show a modicum of control, but still we must step up.

Editing is pretty easy if you only seek to remove that pre-line sucking sound that’s before a sentence, but if you’re talking mid-sentence, things get tougher. A little volume automation might solve the problem, and removal and replacement with room tone might save you, but sometimes smart processing is needed. I’m hearing very good things about Waves’ De-Breath plug-in and the new Breath Control module within iZotope’s RX Advanced 6.

Clicks and pops - Sometimes this is a problem in real life. You’ve heard it—certain people have all kinds of mouth noises that distract from conversation. Clicking, (pardon me) spit bubbles popping, smacks from lips opening and closing, swallowing, gulping…it’s bad enough over lunch, but it’s living hell in an audiobook! Certain aspects of such noise are easily controllable with a little discipline from our actors/talent, but sometimes the clicking/popping is out of control—so out of control that I’ve even heard directors ask, “Is there a clock ticking on-set?” I’m short on prevention solutions here, but I suspect the culprit is in mild dehydration. Talent has to stay hydrated before the session; drinks like tea and clear liquids, honey, lemon and so on only help a little. Hydrated tissue is needed, not just a wet mouth.

Beyond that, the only solutions I’ve found are only mitigating—dynamic mics, windscreens, careful mic placement and micro-editing. Between words, some clicks are removable without a noticeable loss of timeline (as would be seen if removing long breaths in this manner), but these little pests are often mid-word. If so, I use a combination of two RX Advanced modules—De-Click and De-Crackle. They sometimes requiring multiple passes and occasionally small traces remain even after considerable effort.

Fricatives - Sometimes called spirants, these friction-based sounds (like the F and V in English) are usually not a problem, but sometimes can be distracting. If the talent has pronunciation trouble with fricatives, they can sometimes sound swooshy, grindy, windy or kind of like a low-frequency sibilant. Re-reads are the best solution here, but sometimes volume automation or upper-mid-frequency multi-band compression can naturally reduce the prominence.

Tip of the Iceberg

There’s a lot of other things to be concerned with, too, if you need purity of solo voice recording: a complete lack of distortion, little or no room ambiance, style-appropriate proximity effect, a low noise floor, consistent levels, tasteful EQ (a single mid- dip or boost can save the day), to name a few. But let’s save that for another day; I’ve got some vocals to track—sung ones, thank goodness!

A Final Thought

I consider Amazon’s audiobook standards to be “the bar to clear” right now for voice production. Here’s a link if you’re curious as to their noise floor, average level, file type, naming conventions and “quality” requirements …