Theory of Operation

A technical deep-dive into the audio processing logic of WaxOn/WaxOff: the signal chain, the engineering rationale, and the decisions behind each stage.

This document assumes familiarity with basic audio concepts (dBFS, sample rate, dynamic range). It's written for audio engineers, curious users, and anyone who wants to understand why the app does what it does.

Signal Chains at a Glance

WaxOn WaxOn

RNNoise
ML noise reduction
(stereo: per-channel split)
High-Pass Filter
Channel Select
mono only
Phase Rotation
200 Hz allpass + Resample
Loudnorm
two-pass EBU R128
2× Oversample
→ Limit → Resample

Dashed border = optional stage. For stereo output with NR enabled, channels are split and denoised independently before rejoining. Output: 24-bit WAV.

WaxOff WaxOff

Phase Rotation
150 Hz allpass
Loudnorm
two-pass EBU R128
2× Oversample
→ Limit → Encode MP3

Phase rotation and MP3 encode are optional. Output: 24-bit WAV and/or MP3.

WaxOn Mix

Additional stages run before the standard WaxOn chain. Per-file leveling only applies when Loudness Norm is enabled.

Per-file Loudnorm
two-pass, per input
amix
normalize=0 (or 1)
WaxOn chain above

EBU R128 Loudness

EBU R128 (and the underlying ITU-R BS.1770-4 algorithm) is the measurement standard used by virtually all broadcast and streaming platforms. Spotify, Apple Podcasts, YouTube, and broadcast television worldwide all normalize to a loudness target derived from this standard. Understanding it explains most of what WaxOff does.

Why This Standard Exists: The Loudness War

The adoption of perceptual loudness metering was not driven by the music industry. It was driven by television viewer complaints about excessively loud commercials. Advertisers discovered they could aggressively compress and brick-wall limit their spots while technically staying within legacy peak-level limits, creating 4–8 dB disparities between programming and advertisements. The FCC Consumer Call Center reported "loud commercials" as a sustained top consumer complaint starting in 2002.

That pressure produced legislation. The CALM Act (Commercial Advertisement Loudness Mitigation Act) was signed into law in December 2010, requiring US broadcasters to keep commercials at the loudness level of surrounding programming. The FCC began enforcement in December 2012. The European Broadcasting Union had published EBU R128 four months earlier, in August 2010, addressing the same problem for European broadcast. Both standards were built on ITU-R BS.1770, published in 2006 as a psychoacoustically weighted metering algorithm specifically designed to correlate with perceived loudness.

Streaming platforms adopted loudness normalization as a natural extension. If a platform normalizes playback, an arms race to master louder than competitors produces only a quieter relative result, not a louder one. Mastering engineer Bob Katz declared at the AES convention in 2013 that the loudness wars were over, citing the emergence of loudness normalization across streaming. Spotify formalized its −14 LUFS target in 2021; Apple Podcasts specifies −16 LKFS; YouTube introduced normalization in 2015–2016.

The practical consequence for podcast producers: since every major distribution platform normalizes loudness at playback, delivering an aggressively loud, heavily limited master provides no benefit to listeners and costs you dynamic range. The correct goal is accurate level, a clean true peak ceiling, and preserved dynamics.

Integrated Loudness (LUFS)

Integrated loudness (denoted I) is the time-averaged loudness of a complete program, measured in LUFS (Loudness Units relative to Full Scale). LUFS is numerically identical to LKFS; both refer to the same algorithm.

Unlike peak metering or RMS, integrated loudness is:

K-Weighting

K-weighting is a two-stage filter chain applied to each channel before energy summation. It was designed to approximate the frequency-dependent sensitivity of human hearing, particularly the acoustic effect of the head on sound arriving at the ears.

  1. Pre-filter (head-related high-shelf): A second-order shelf boost with a design frequency of approximately 1682 Hz and a gain of approximately +4 dB. This models the acoustic effect of the human head, which increases high-frequency energy at the ear canals relative to a free-field measurement. The boost rises gradually above the shelf frequency and reaches its full value by roughly 5 kHz, where it remains constant through the upper spectrum. The effect is that sibilance, consonant detail, and broadband hiss in the 2–10 kHz range are weighted more heavily in the loudness measurement, matching the ear's increased sensitivity in this region.
  2. RLB weighting (high-pass): A second-order high-pass filter with a design frequency of approximately 38 Hz. This reduces the contribution of sub-bass energy to the loudness measurement. Sub-bass content below about 50 Hz contributes little to perceived loudness under normal listening conditions (particularly on the earbuds and laptop speakers that dominate podcast consumption), and leaving it in the measurement would skew the result for files with DC offset, rumble, or proximity-effect bass buildup.

The ITU specification defines the filter coefficients for a reference sample rate of 48 kHz. WaxOn/WaxOff's analyzer computes the coefficients from first principles using the bilinear transform, so the filters are accurate at any sample rate the source file uses (44.1 kHz, 48 kHz, 96 kHz, etc.). The pre-filter uses the shelf design parameters f₀ = 1681.97 Hz, gain = 3.9998 dB; the high-pass uses f₀ = 38.14 Hz, Q = 0.5003. These values are taken from the ITU reference implementation and match the pyloudnorm reference used widely in audio research.

Gating

Loudness is computed over overlapping 400 ms blocks. Adjacent blocks overlap by 75%, producing one new block every 100 ms. The 400 ms window was chosen because it corresponds closely to human short-term loudness perception. Psychoacoustic research by Zwicker and Fastl established that temporal integration of loudness occurs over approximately 200–400 ms, with 400 ms representing the time window over which the ear integrates energy to form a stable loudness impression. Shorter windows would capture transient fluctuations that don't correspond to perceived loudness; longer windows would smooth over meaningful changes in program level.

Two gating stages prevent silence and quiet passages from pulling the integrated value down:

  1. Absolute gate at −70 LUFS: Any block whose K-weighted energy falls below −70 LUFS is discarded. This removes silence, dead air, and extremely quiet room tone from the measurement. The threshold corresponds to a mean-square value of 10(−70 + 0.691) / 10 ≈ 1.95 × 10−7.
  2. Relative gate at −10 LU: From the remaining blocks, compute an ungated mean (the "absolute-gated loudness"). Then discard any block more than 10 LU below that mean. This removes quiet passages that are above the noise floor but significantly below the average program level, such as soft breaths between sentences, quiet background music under narration, or distant room ambience during pauses.

The final integrated loudness is the mean of the blocks that survive both gates. For podcast speech, this means the measurement reflects the loudness of the spoken content, not the silence between sentences.

The Offset Constant: −0.691

The integrated loudness formula includes a constant offset of −0.691 dB:

Integrated Loudness (LUFS) = −0.691 + 10 · log₁₀(Σ Gᵢ · zᵢ)

where Gᵢ is the channel weight (1.0 for front channels, 1.41 for surround) and zᵢ is the gated mean-square of channel i after K-weighting. The −0.691 dB offset calibrates the scale so that a 1 kHz sine wave at 0 dBFS reads exactly 0 LUFS after K-weighting. Without this offset, the K-weighting pre-filter's boost at 1 kHz (~+3.3 dB at the reference frequency) would cause the same sine wave to read approximately +3.3 LUFS. The offset brings the scale back into alignment with the traditional dBFS reference. For mono and stereo speech content, the channel weights are both 1.0, so the formula simplifies to the mean-square of all channels after K-weighting and gating.

True Peak (TP)

True peak is the maximum reconstructed level when the digital signal is converted to analog. It differs from sample peak because the analog waveform between samples can exceed any individual sample value; these are inter-sample peaks. See the True Peak & Oversampling section for detail.

EBU R128 specifies a maximum true peak of −1.0 dBTP for most distribution. WaxOff defaults to this value.

Loudness Range (LRA)

LRA measures the spread between loud and quiet sections of a program (its macro-dynamics) in Loudness Units. It is computed as the difference between the 95th and 10th percentiles of the short-term loudness distribution (after gating). EBU R128 does not mandate a specific LRA target but recommends keeping it below 18 LU for broadcast.

Both modes pass an LRA value to the loudnorm filter. WaxOn hardcodes LRA=20 (effectively unconstrained, no dynamic processing). WaxOff defaults to LRA=11, which allows the filter to apply gentle dynamic compression to constrain the macro-dynamic spread of a finished mix. Lower values compress more aggressively; higher values relax the constraint.

Two-Pass Normalization

EBU R128 integrated loudness requires the complete file to compute. It is a time-integrated measurement. You cannot know the correct gain adjustment until after you have read every sample. This makes single-pass normalization impossible for linear (non-dynamic) mode. Both WaxOn and WaxOff solve this with a two-pass approach.

Pass 1: Analysis

FFmpeg's loudnorm filter reads the entire file and prints a JSON block to stderr containing:

The output is discarded (-f null); only the measurements matter. The app parses the JSON from stderr and stores the values.

Pass 2: Linear Normalization

The same filter runs again, this time with the measured values injected back in and linear=true set:

loudnorm=I={target}:TP={tp}:LRA={lra}
  :measured_I={inputI}:measured_TP={inputTP}
  :measured_LRA={inputLRA}:measured_thresh={inputThresh}
  :offset={targetOffset}:linear=true

Why linear=true Matters

The loudnorm filter has two modes:

Both WaxOn and WaxOff use linear mode for pass 2. For mastering and delivery, this is the only correct approach. The goal is level adjustment, not dynamics processing.

True Peak and Oversampled Limiting

The Inter-Sample Peak Problem

Digital audio stores the waveform as discrete samples: amplitude values at regular time intervals (44,100 or 48,000 per second). A sample peak meter reads the highest sample value, which is straightforward. But the analog waveform reconstructed by a DAC continuously interpolates between those samples, and the reconstructed waveform can peak significantly higher than any individual sample value.

These are inter-sample peaks (ISPs), and they become real, audible clipping when:

A file with a sample peak of −1 dBFS can easily have a true peak above 0 dBFS, causing clipping that no sample-level meter would detect.

The Mathematics of Reconstruction

The Nyquist-Shannon sampling theorem guarantees that a band-limited signal sampled at twice its maximum frequency can be perfectly reconstructed. The reconstruction uses a sinc interpolation kernel:

x(t) = Σ x[n] · sinc((t − nT) / T)

where x[n] are the sample values, T is the sample period, and sinc(x) = sin(πx) / (πx). The key insight is that the sinc kernel oscillates. When adjacent samples have high energy and the right phase relationship, the interpolated waveform between them sums constructively and overshoots both sample values. This is not an artifact or an error; it is the mathematically correct reconstruction of the continuous signal. The samples were never the waveform; they are the minimum information needed to reconstruct it.

The worst case for inter-sample peaks occurs when consecutive samples approach full scale with alternating signs at frequencies near Nyquist (half the sample rate). At 44.1 kHz, high-frequency content near 22 kHz is especially prone. In practice, ISPs on real-world audio material are typically 0.5–3 dB above sample peak, though extreme cases can reach higher.

The Streaming Ingest Trap When you upload audio to Spotify, the platform transcodes your file to Ogg/Vorbis or AAC for streaming delivery. If your uploaded file has true peaks near 0 dBFS, the transcode itself can clip: the decoded streaming copy is distorted before a listener ever plays it, and no subsequent gain adjustment will fix it. Spotify's own artist documentation warns: "Really loud modern masters can easily register True Peak levels of +1 or +2 dBTP, and often as much as +3 or +4 dBTP. These are virtually guaranteed to cause encoder clipping if processed as-is." Research measuring 128 kbps MP3 encoding has documented decoded true peaks rising by +1.7 dBTP above the source, and pathological cases as high as +10 dBTP. This is why the −1.0 dBTP true peak ceiling is a hard requirement, not a polite suggestion.

How Oversampled Limiting Solves This

True peak limiting works by upsampling the signal before the limiter so that inter-sample peaks become visible as actual samples, then limiting those samples, then downsampling back.

Input (44.1 kHz) → Upsample 2× (88.2 kHz) → alimiter → Downsample (44.1 kHz) → Output

At 2× the sample rate, new samples are interpolated midway between each original pair. These interpolated values approximate the continuous waveform reconstruction and capture most inter-sample peaks. The limiter can see and attenuate them. When downsampled back, the true peaks of the resulting file are controlled.

2× oversampling catches the vast majority of inter-sample peaks in practice. The ITU-R BS.1770-4 true peak measurement algorithm itself uses 4× oversampling for maximum accuracy, but for a limiter (which only needs to prevent peaks from exceeding a threshold), 2× provides sufficient control. 4× oversampling is used in some mastering workflows to catch pathological edge cases, but the returns diminish quickly: the additional ISPs caught between 2× and 4× are typically less than 0.2 dB on real-world program material. For voice content with limited high-frequency energy near Nyquist, 2× is more than adequate.

WaxOn Limiter Settings

WaxOn's alimiter is configured for transparent peak control:

WaxOff Pre-Encode Limiter

WaxOff's loudnorm filter targets true peak via its TP parameter, but this is a soft target. The filter's internal gain calculation accounts for it, but it does not guarantee a hard ceiling. In practice, the loudnorm output can exceed the TP target by up to ~0.5–1.0 dB in edge cases.

For WAV-only output this is acceptable: the file plays back through a DAC and any overshoot is minor. For MP3 output it is a real problem. The encoding process adds its own inter-sample peaks (typically +0.1–1.5 dB), so a file already approaching the ceiling will clip after decode.

WaxOff solves this with a dedicated pre-encode limiter applied only when producing MP3:

Why −2 dBTP and not −1 dBTP? The loudnorm filter targets −1.0 dBTP but can overshoot by up to ~1 dB. The MP3 codec can then add another 0.1–1.5 dB of inter-sample peaks. A hard limiter at −2.0 dBTP provides 1 dB of margin that reliably keeps the decoded MP3 below −1.0 dBTP under virtually all real-world conditions.

Phase Rotation and Crest Factor

Crest Factor

Crest factor is the ratio of a signal's peak level to its RMS level, expressed in dB:

Crest Factor (dB) = Peak (dBFS) − RMS (dBFS)

Typical speech has a crest factor of 15–25 dB. High crest factor has a practical consequence for loudness normalization: to reach a loudness target without exceeding the ceiling, the limiter must apply more gain reduction (limiting). More limiting means more audible artifacts: transient softening, pumping, coloration.

Reducing crest factor before normalization means the same LUFS target can be reached with less limiting and more transparency.

How Allpass Filtering Reduces Crest Factor

A first-order allpass filter passes all frequencies at equal amplitude but shifts the phase of different frequencies by different amounts. It doesn't alter the frequency response; it only changes when different frequency components arrive relative to each other.

The transfer function of a first-order allpass is:

H(z) = (a₁ + z⁻¹) / (1 + a₁z⁻¹)

where a₁ is computed from the design frequency and sample rate. The magnitude response |H(z)| = 1 at all frequencies (unity gain). The phase response varies continuously from 0° at DC to −180° at Nyquist, with −90° at the design frequency. This means frequencies below the design frequency are shifted slightly; frequencies above it are shifted more. The relative timing of low-frequency and high-frequency components in the waveform changes, but their amplitudes do not.

Much of the peak asymmetry in voice audio comes from low-frequency content: proximity effect from cardioid microphones, low-frequency resonances in recording spaces, and bass-heavy content in finished mixes. This energy tends to create asymmetric waveforms where one polarity consistently peaks higher than the other.

Proximity effect is worth understanding in detail because it affects nearly every podcast recording. Directional microphones (cardioids, supercardioids, figure-8 patterns) exhibit increasing bass boost as the sound source moves closer, beginning around 12 inches and growing progressively stronger below approximately 100–200 Hz. The boost can reach +20 dB at very close distances. Omnidirectional microphones do not exhibit proximity effect, but the cardioid pattern dominates consumer and prosumer podcast microphones (Shure SM7B, Audio-Technica ATR2100, most USB microphones), making this a near-universal issue. Podcasters without broadcast training tend to position themselves very close to their microphones to minimize room noise, an instinct that unfortunately triggers the strongest proximity effect and produces the most bass-heavy, asymmetric waveforms. The result lands squarely in the 150–250 Hz range that phase rotation is designed to address.

An allpass filter in the low-frequency range redistributes the phase relationships between bass components and midrange components, making peaks more symmetric. The result is a lower crest factor (peaks are shorter relative to average level) without any change to the frequency response or audible character of the audio.

The effect is genuinely inaudible. Human hearing is largely insensitive to absolute phase at audio frequencies. The cochlea performs a frequency decomposition that discards phase information. This is why polarity inversion (flipping the sign of every sample) and allpass filtering (frequency-dependent phase shift) are both perceptually transparent, despite being mathematically significant transformations of the waveform.

WaxOn vs. WaxOff Frequencies

Parameter WaxOn WaxOff
Frequency 200 Hz 150 Hz
Q 0.707 (Butterworth) default
Rationale Raw recordings often have proximity effect and mic combination issues in the 150–250 Hz range. 200 Hz targets the low-mid region where these artifacts cause most of the crest factor problem. Finished mixes are already edited and processed. The remaining crest factor issue is typically pure bass energy. 150 Hz targets this lower.
Default Always on (not user-configurable) On (user can disable)

Quantifying the Effect

On typical podcast recordings with moderate proximity effect, allpass phase rotation at 200 Hz reduces crest factor by 1–4 dB. A 3 dB crest factor reduction means the limiter needs to apply 3 dB less gain reduction to stay below the same ceiling at the same loudness target. That translates directly to less audible limiting artifacts. On clean, well-recorded speech with minimal bass buildup, the crest factor reduction is smaller (0.5–1 dB), but the allpass has no downside: it costs nothing in audio quality and can only help.

Mix Summing

When two or more audio signals are summed, the combined level increases. By how much depends on the correlation between the signals:

2 tracks → ~+3 dB    4 tracks → ~+6 dB    8 tracks → ~+9 dB

Without Loudness Normalization

When Loudness Norm is off, WaxOn's Mix stage uses FFmpeg's amix with normalize=1, which scales each input by 1/N before summing. This keeps the output level consistent with the individual inputs regardless of how many files are combined. The downstream filter and limiter chain receives appropriately leveled material and behaves predictably.

With Loudness Normalization

When Loudness Norm is on, WaxOn takes a more rigorous approach. Each input file is individually normalized to the configured LUFS target before mixing: a full two-pass EBU R128 analysis and linear gain applied per file. This addresses a fundamental limitation of normalize=1: simple 1/N scaling keeps levels consistent in absolute terms, but it does not account for files recorded at different levels. A quiet guest recording and a loud host recording, both scaled by 1/2 before mixing, still arrive at the mix at different perceived loudnesses.

With per-file pre-normalization, all inputs arrive at the mix at equal perceived loudness before they are blended. The amix call then uses normalize=0: files are already level-matched, so 1/N scaling would only dilute a well-calibrated blend. The summing process will increase the level (~+3 dB for two uncorrelated sources), but the final loudnorm pass on the combined output corrects it back to the configured target.

Why not pre-normalize when Loudnorm is off? Pre-normalization requires two additional FFmpeg passes per file (analysis + normalization). When Loudness Norm is disabled, the user has opted out of level processing. In that case the original normalize=1 behavior is preserved: simple 1/N scaling keeps the mix level predictable without touching individual file dynamics.

Quantization, Dithering, and Why It Doesn't Apply Here

The Quantization Problem

Digital audio stores amplitude values as integers. A 16-bit system divides the amplitude range into 216 = 65,536 discrete steps; a 24-bit system uses 224 = 16,777,216. When a continuous floating-point value is rounded to its nearest representable integer, the difference is quantization error.

At high signal levels, quantization error is a negligible fraction of the signal amplitude. The problem surfaces at low levels: fade-outs, reverb tails, quiet passages, where the signal approaches the magnitude of a single quantization step. At that scale, the error is no longer random with respect to the signal; it becomes correlated. Correlated noise has harmonic structure. Harmonic noise is perceived as distortion.

The artifact is distinctive: as a 16-bit fade-out approaches silence, the smooth waveform begins to pixelate, crumbling into a grainy, granular texture. Engineers call it "going digital." It is most audible on sustained tones, piano decays, and reverb tails, anywhere a signal fades through the lower quantization steps rather than cutting abruptly.

The Classical Demonstration

The canonical test (sometimes called the "fade-to-black") is simple: record a tone at a moderate level and fade it gradually to silence. Without dithering, the transition through the last few quantization steps produces a sequence of audible steps, then silence where the waveform simply stops being representable. The signal doesn't fade; it falls off a cliff.

Bob Katz, in Mastering Audio: The Art and the Science, describes piano decay as one of the most revealing cases. A sustained piano note fading naturally into a quiet room exposes quantization distortion immediately when compared against a properly dithered version. The undithered note develops a gritty texture as the decay reaches the noise floor, a form of distortion introduced by the word-length reduction itself, present nowhere in the original recording. He uses this comparison in workshops and has remarked that once engineers hear the difference, the idea of shipping 16-bit masters without dithering becomes unthinkable.

A less scientific but widely replicated demonstration: open any 16-bit DAW session, generate a −40 dBFS sine wave, export at 16-bit with dithering off, then again with TPDF dither. Zoom into the waveform near the fade-out in a spectral editor. The quantized version shows visible stairstepping. The dithered version shows a smooth descent into low-level noise. This is not a subtle difference at the bit level, even when it is subtle or inaudible at normal listening levels with typical program material.

The Mathematical Fix: TPDF Dither

The solution is counterintuitive: add noise to the signal before truncating it to a lower bit depth. This noise (dither) must be added at a specific amplitude and with a specific probability distribution to be effective.

The mathematical foundation was established by Stanley Lipshitz, Robert Wannamaker, and John Vanderkooy at the University of Waterloo in a series of papers beginning in the late 1980s. Their central result: adding Triangular Probability Density Function (TPDF) noise (noise whose amplitude is distributed as a triangle between −1 and +1 LSB of the target word length) completely decorrelates the quantization error from the input signal. The quantization error becomes spectrally white and statistically independent of the audio. Harmonic distortion is eliminated; what remains is signal-independent white noise.

TPDF noise is generated by summing two independent rectangular (uniform) noise samples. The resulting amplitude distribution is triangular, hence the name. Its variance is exactly 1/6 LSB², which is the minimum required to whiten quantization error under the conditions relevant to audio. Subtractive dither (where the same dither signal is subtracted after truncation) can achieve perfect cancellation of quantization error in theory; non-subtractive TPDF (the practical, deployable form) achieves statistical decorrelation, which is sufficient for all real-world applications.

The practical consequence: a properly TPDF-dithered 16-bit file has a smooth, analog-like noise floor rather than correlated quantization distortion. The fade-to-black test produces a clean descent into white noise. The artifact is gone.

The dithering theorem in one sentence: Adding TPDF noise of variance 1/6 LSB² before truncation makes the quantization error a white, signal-independent noise process: the best possible outcome for a word-length reduction.

Noise Shaping

TPDF dithering trades correlated distortion for uncorrelated white noise. For archival files and intermediate stems, that trade is unconditionally correct. But the resulting noise floor sits at approximately −96 dBFS spread uniformly across the audio spectrum. Human hearing is not equally sensitive at all frequencies. Sensitivity peaks around 2–4 kHz and falls off substantially above ~15 kHz. Noise shaping exploits this asymmetry.

A noise-shaping filter feeds quantization error back into the system through a filter designed around a psychoacoustic model of human hearing. The filter pushes noise energy out of the 2–4 kHz sensitivity peak and into the 15–20 kHz region where hearing is least sensitive. Total noise energy is conserved (or slightly increased), but perceptually weighted noise (the noise you can actually hear) is reduced. A well-designed noise-shaped dither algorithm can achieve the perceived noise floor of a 20-bit system from a 16-bit word.

Apogee's UV22HR dithering, developed in the 1990s and built into Apogee converters and later into Logic Pro's Bounce dialog, uses an ultrasonic noise curve that concentrates dither energy above 20 kHz, technically increasing broadband noise while keeping in-band noise below the threshold of audibility. POW-r (Psychoacoustically Optimized Wordlength Reduction), developed by a consortium that included Waves and Prism Sound, offers three progressively aggressive noise-shaping modes. POW-r Type 3 is widely used in mastering for 16-bit delivery. Bob Katz has written about using POW-r in preference to flat TPDF for final 16-bit masters, on the basis that the psychoacoustic optimization is audible on critical material at moderate listening levels.

The Early CD Era

The first commercial CDs appeared in 1982. Digital mastering workflows were new territory, and understanding of quantization dithering was not yet widespread among recording engineers. Dithering had been described mathematically in signal processing literature (Lipshitz and Vanderkooy's most cited papers came in the years following), but its importance in audio mastering was not yet a settled professional consensus.

Several major remastering campaigns of the 1990s and 2000s revisited early digital recordings, and engineers working on them have commented on the difference between first-generation 16-bit masters (truncated without dithering) and properly dithered versions. Classical and acoustic jazz recordings, where genuine dynamic range, reverb tails, and instrument decays are central to the listening experience, are particularly revealing. The difference is most apparent on headphones with revealing source material: quiet passages and fade-outs in early digital releases can carry a subtle granularity that disappears in the remastered versions.

The issue extends into the digital audio workstation era. For years, some popular DAWs shipped with dithering off by default on export, or placed the dither option in a dialog that inexperienced users never opened. The result was that a significant volume of independently produced music from the 1990s and early 2000s was distributed as 16-bit audio truncated without dithering. The artifacts are often inaudible on typical program material at typical listening levels, but on a quiet room recording with a long reverb tail, they are there.

Why WaxOn/WaxOff Does Not Apply Dithering

The entire dithering question is contingent on one condition: word-length reduction. You dither when, and only when, you are truncating bits (converting from a higher to a lower bit depth). Dithering is not a general audio quality enhancement; it is a specific solution to a specific problem that arises at the moment of truncation.

WaxOn and WaxOff output 24-bit WAV. Neither mode reduces bit depth. The processing chain (loudness analysis, gain adjustment, filtering, limiting) runs in floating-point arithmetic internally. FFmpeg's audio processing pipeline operates in 32-bit or 64-bit float throughout. When the float result is written to a 24-bit integer PCM file, any truncation from float to int24 occurs at a level approximately 120+ dB below the signal, far below the audible noise floor of any recording. There is no perceptible quantization distortion to address, and no dithering is needed or appropriate.

If WaxOff produced 16-bit output, dithering would be mandatory, applied as the final stage, after all gain processing, immediately before the word-length reduction. (Dithering applied earlier would be modified by subsequent gain stages, defeating the purpose.) For 24-bit output, the question simply does not arise.

For MP3 output, dithering is inapplicable for a separate reason. MP3 encoding applies its own psychoacoustic quantization. The codec analyzes the signal using a masking model and allocates bits to spectral bands according to audibility thresholds. The quantization step sizes used by the MP3 encoder are orders of magnitude coarser than one LSB of 16-bit PCM. Any TPDF dither noise added before encoding would be completely absorbed into the codec's own quantization decisions. It contributes nothing and changes nothing. Adding dither before a lossy encoder is like whispering into a jackhammer.

RNNoise: ML Noise Reduction

Background and Origins

RNNoise was developed by Jean-Marc Valin at Mozilla in 2017–2018 and released as open source under the BSD license. Valin is also a principal author of the Opus audio codec, the codec used by WebRTC, Discord, Zoom, and virtually every real-time web audio application. His work on Opus included extensive research into perceptual audio coding and voice intelligibility under compression, which directly informed the approach taken in RNNoise.

The project grew from a practical problem in WebRTC: browser-based voice communication was plagued by background noise (keyboard clicks, HVAC, crowd noise, fan hum) that conventional noise suppression handled poorly, either leaving too much noise or introducing the characteristic warbling, underwater artifacts of aggressive spectral subtraction. Valin's hypothesis was that a machine learning approach trained specifically on speech could do better.

The original paper, A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement, was presented at ITRW on Speech Communication in 2018. It has since been widely cited in the speech enhancement literature and influenced subsequent neural audio processing work at companies including Google, Microsoft, and Amazon.

Architecture: Gated Recurrent Units

RNNoise is a recurrent neural network using Gated Recurrent Units (GRUs), a variant of LSTM that uses fewer parameters and trains faster while retaining the ability to model temporal dependencies across variable-length sequences. The key difference from LSTM is that GRU combines the forget and input gates into a single "update gate" and merges the cell state with the hidden state, reducing the parameter count by roughly 25% per layer. The architecture is deliberately small: the network has roughly 100,000 parameters total, making real-time inference feasible on hardware as constrained as embedded processors with no dedicated GPU.

The network processes audio in the frequency domain using the Opus codec's Bark-scale filterbank: 22 critical bands that approximate the frequency resolution of human hearing. This is a key design choice. Rather than learning to operate on raw waveforms (which requires modeling extremely long-range sample dependencies) or on fixed FFT bins (which don't match perceptual resolution), RNNoise works on the same perceptual frequency representation that the ear itself uses. The Bark scale groups frequencies into bands of roughly equal perceptual width: narrow bands at low frequencies (where pitch discrimination is fine) and progressively wider bands at high frequencies (where the ear integrates more broadly).

For each 10 ms frame of audio, the network computes a set of spectral gains (one per band) between 0 and 1. A gain of 1.0 means that band is passed through unmodified. A gain of 0 means it is fully suppressed. Intermediate values attenuate partially. The gains are applied multiplicatively to the band energies, and the modified spectrum is reconstructed back to a waveform. The network never synthesizes audio; it only decides how much of each perceptual band to suppress in each frame.

Training and the Model File

The bundled model (bd.rnnn from the rnnoise-models repository) was trained on a large corpus of speech (multiple speakers, multiple languages, multiple recording conditions) mixed with a wide variety of noise types: HVAC, traffic, crowd noise, fan hum, electrical interference, and broadband pink and white noise. The model learns to identify which spectral components correspond to voice and which correspond to noise, using temporal context (the GRU's hidden state) to distinguish steady-state noise from transient speech components.

Training required both clean speech recordings and noise-only recordings, which were artificially mixed at various signal-to-noise ratios. The network learned the difference between speech-shaped energy and noise-shaped energy across thousands of examples. Because the training data was multilingual and broad-spectrum, the resulting model generalizes well across different speakers, accents, and recording conditions without any per-speaker adaptation.

What It Suppresses Well, and Poorly

RNNoise excels at steady-state, spectrally diffuse noise: HVAC hum, room tone, computer fan noise, broadband electrical hiss, and low-level crowd ambience. These share a characteristic spectral profile that is relatively stable over time and distributes energy broadly, making them easy for the network to distinguish from voice. On clean recordings with consistent low-level background noise, suppression is typically very effective and inaudible.

It handles poorly, and can introduce artifacts with, the following:

The artifact profile when limits are exceeded is typically a subtle warbling or underwater quality, the same category of artifact produced by spectral subtraction noise gates, though usually less severe. On moderate-noise, clean-voice recordings, the algorithm is essentially transparent.

Why It's Off by Default

Noise reduction that works well on one recording can degrade another. The right call depends on:

WaxOn's high-pass filter already removes a significant portion of low-frequency noise energy. For many podcast recordings, this is sufficient, and adding a noise reduction pass is unnecessary processing. Enabling RNNoise on an already-clean recording will not degrade it noticeably, but it also won't help, and it adds processing time.

The setting is off by default because "no unnecessary processing" is the conservative, correct baseline. Enable it when background noise is audible and distracting, and leave it off when the recording is already clean enough.

Placement in the WaxOn Pipeline

RNNoise runs as the first stage in the WaxOn chain, before the high-pass filter. This is intentional. Noise suppression should operate on the raw signal before any other DSP stages modify it, for two reasons:

  1. The network was trained on natural speech recordings, not on high-pass-filtered audio. Presenting it with the unaltered signal gives it the spectral context it expects and produces the most accurate gain estimates.
  2. Noise in the low-frequency range contributes to the loudnorm measurement and limiter behavior downstream. Removing it first means the subsequent filter, normalization, and limiting stages operate on cleaner material: better level estimates, less limiter engagement on non-useful content.

Stereo Handling: Per-Channel Split

RNNoise was designed and trained exclusively on mono 48 kHz speech. When FFmpeg's arnndn filter receives a stereo input, it creates separate denoiser instances per channel and processes them independently. In practice, this can produce unpredictable results: the per-channel recurrent states diverge, and one channel (typically the second) may be over-gated or heavily attenuated, even when both channels carry similar content and noise levels.

The root cause is that the model's internal gain computation is frame-by-frame and depends on its recurrent hidden state. With stereo input, slight differences between channels (different mic angles, room reflections, or even minor level offsets from recording) can cause the model to classify one channel as "more noisy" than the other and gate it more aggressively. The model has no concept of channel correlation or stereo coherence.

WaxOn solves this by splitting stereo audio into independent mono channels before applying RNNoise, then rejoining the denoised channels back into stereo. This uses FFmpeg's filter_complex graph:

[0:a]channelsplit=channel_layout=stereo[L][R];
[L]arnndn=m=/path/to/model[Lnr];
[R]arnndn=m=/path/to/model[Rnr];
[Lnr][Rnr]join=inputs=2:channel_layout=stereo

Each channel receives its own fully independent denoiser instance with its own recurrent state, initialized cleanly. The model processes each as a standard mono stream — the format it was trained on — and the results are predictable and balanced. The remaining filter chain (high-pass, phase rotation, resample) runs on the rejoined stereo signal.

For mono output, the issue does not arise. When the user selects mono, a single channel is extracted via pan before (or at the same stage as) noise reduction, so arnndn always receives a mono signal. The simple -af chain is used in this case.

The same per-channel split is applied wherever arnndn touches stereo audio, including the NR-for-measurement analysis paths.

FFmpeg Implementation

The arnndn filter in FFmpeg wraps the RNNoise library. It requires an external model file provided via the m= parameter:

arnndn=m=/path/to/model

The model file is bundled in the app's resources directory. WaxOn locates it at runtime using Bundle.main.url(forResource:withExtension:) and passes the resolved path to FFmpeg. For mono output, the filter is prepended to the -af chain. For stereo output, WaxOn switches to a -filter_complex graph that splits, denoises, and rejoins the channels as described above. When Noise Reduction is off, the chain begins with the high-pass filter as usual.

Processing latency for arnndn is negligible for batch processing purposes. The network processes audio in 10 ms frames. For a 60-minute recording, the total added processing time is a few seconds on Apple Silicon.

Noise Floor Estimation

WaxOn/WaxOff estimates the noise floor of each loaded file and displays it as the FLOOR stat in the file stats panel. The estimate is computed during the same analysis pass that produces RMS, peak, crest factor, and LUFS, at no additional cost.

The Problem

Broadband background noise (HVAC, room tone, preamp hiss) occupies spectral space continuously, including during pauses between speech. This noise contributes to the integrated loudness measurement in two ways:

  1. K-weighting amplifies it. The pre-filter's ~4 dB high shelf boost above 1.7 kHz increases the measured energy of broadband hiss, which has significant energy in the 2–10 kHz range. The loudness measurement sees the noise as louder than it subjectively is.
  2. Noise fills gated blocks. The relative gate excludes blocks more than 10 LU below the ungated mean. In a clean recording, pauses between sentences fall below this threshold and are excluded. In a noisy recording, noise energy keeps those blocks above the gate threshold, and they contribute to the integrated loudness value.

The net effect: noisy files measure louder than their speech content actually is. When loudness normalization targets a specific LUFS value, the gain applied is less than the speech needs. The speech ends up under target.

Estimation Method

The analyzer divides the audio into non-overlapping 400 ms blocks (the same block size used for LUFS gating) and computes the mono RMS of each block. The noise floor estimate is the 10th percentile of these block RMS values, converted to dBFS.

The 10th percentile was chosen because it represents the quietest 10% of the file's blocks. For speech recordings, the quietest blocks are the pauses, breaths, and gaps where the microphone is capturing only the ambient environment. The 10th percentile is more robust than the absolute minimum (which might catch a single anomalously quiet block) while still reflecting the true background level rather than the speech level.

At least 5 blocks are required for a meaningful estimate (about 2 seconds of audio). Shorter files show no FLOOR stat.

Thresholds and Color Coding

The FLOOR stat is color-coded in the stats panel:

Files with an orange or red noise floor also show a ⚠️ warning badge in the file list.

NR-for-Measurement

When Loudness Norm is enabled but Noise Reduction is off, WaxOn runs RNNoise on a temporary copy of the audio for the loudnorm analysis pass (pass 1) only. The normalization pass (pass 2) and all subsequent stages operate on the original, unmodified audio. This ensures that loudness measurements reflect the speech content rather than the noise floor, without altering the output.

Why This Works

The two-pass loudnorm process measures the file's integrated loudness in pass 1, then applies a single linear gain offset in pass 2. The gain offset is determined entirely by the pass 1 measurement. If pass 1 measures a noise-inflated loudness (file appears louder than the speech actually is), the computed gain will be too small, and speech will land under target.

By measuring the NR'd copy instead, the analysis reflects the loudness of the speech content with the noise floor suppressed. The computed gain offset is then applied to the original file. Because RNNoise primarily removes energy between and underneath words (not the speech itself), the speech content in the original and NR'd versions has approximately the same loudness. The gain derived from the clean measurement lands the speech close to the target.

The noise floor in the original file does come along for the ride. It is amplified by the same gain as the speech. But the philosophy here is pragmatic: WaxOn is a prep tool for DAW editing. If the noise is bad enough to matter, it will be treated in the DAW (or in a dedicated NR tool like RX). Getting the speech to the right level for editing is the higher priority.

When It Activates

NR-for-measurement activates in three places:

When Noise Reduction is already enabled, the audio reaching the loudnorm stage has already been noise-reduced. In that case, the NR-for-measurement step is unnecessary and does not run.

For stereo output, the NR-for-measurement paths use the same per-channel split as the main NR stage: stereo is split into independent mono channels, each denoised separately, then rejoined before the loudnorm analysis. This ensures consistent, balanced noise removal for accurate measurement regardless of channel layout.

Cost

NR-for-measurement adds one additional FFmpeg pass per loudnorm analysis (running RNNoise on the intermediate audio to a temporary file). For a typical podcast recording on Apple Silicon, this adds a few seconds. The temporary NR'd files are created in the working directory and deleted automatically after processing.

WaxOn Design Rationale WaxOn

Stage Order

The WaxOn pipeline stage order is deliberate:

  1. Noise reduction first (when enabled): The network was trained on unprocessed speech. Running it before any filtering gives it the spectral context it expects. Removing noise early also benefits every downstream stage: cleaner input to the high-pass filter, more accurate loudnorm measurements, less limiter engagement on non-useful content. For stereo output, channels are split and denoised independently to avoid the per-channel divergence issues inherent in RNNoise's mono-trained model (see RNNoise: Stereo Handling).
  2. High-pass filter second: Subsonic content below 80 Hz is removed before any gain stage processes it. Low-frequency energy is disproportionately loud and would cause loudnorm to underestimate the actual loudness of content you care about, and cause the limiter to work harder than necessary on energy that isn't musically useful.
  3. Channel selection before phase rotation: If extracting mono from a stereo source, do it first so the allpass filter operates on the actual mono signal, not a wider stereo version of it. The loudnorm analysis then also measures the real output signal.
  4. Phase rotation before normalization: Reduces crest factor so that the loudnorm analysis measures a waveform that more accurately represents what the limiter will see after normalization.
  5. Limiter last: After any loudness normalization, with oversampling to catch true peaks.

Mix Stage Order

The Mix pipeline extends this logic with a pre-mix leveling step. When Loudness Norm is on, each input file is normalized to the target LUFS before the amix stage, so all sources arrive at the mix at equal perceived loudness. Using normalize=0 after pre-normalization lets the summing level increase accumulate naturally; the final loudnorm pass on the combined output then corrects the level to target, the same way it would for a single-file job. The result is a mix that is balanced by loudness measurement, not by accident of recording level.

LRA=20 in WaxOn Loudnorm

WaxOn's loudnorm hardcodes LRA=20. The LRA parameter tells the loudnorm filter how aggressively to constrain the dynamic range; lower values apply more dynamic compression. At LRA=20, the filter applies essentially no dynamic processing. It acts as a pure linear gain offset.

This is intentional for ingest. WaxOn is a pre-editing tool. You want your recordings to arrive at your DAW at consistent levels, but with their original dynamic character intact. Any dynamic processing at this stage would fight against the compression and automation you'll apply during editing. LRA=20 ensures loudnorm does exactly one thing in WaxOn: level adjustment.

Default Loudnorm Target: −30 LUFS

−30 LUFS is conservative by design. At this level, even a recording with significant dynamic range and a crest factor of 20 dB will have peaks well below −10 dBFS, giving the limiter ample headroom. The goal is to bring different recordings to a consistent level for editing, not to hit a delivery target. −30 LUFS leaves plenty of room for the final mix to breathe.

Loudnorm TP = limitDb

Both the loudnorm TP parameter and the alimiter limit are set to the same value (the configured ceiling, default −1.0 dB). These are not redundant:

Setting both to the same value means loudnorm and the limiter are working toward the same goal. If loudnorm succeeds, the limiter barely engages. If loudnorm slightly overshoots, the limiter catches it. The two stages are complementary, not redundant.

WaxOff Design Rationale WaxOff

No High-Pass Filter

WaxOff doesn't include a high-pass filter. By the time a mix reaches WaxOff, it has presumably been edited and processed in a DAW. High-pass filtering, EQ, and cleanup are part of the editing workflow. WaxOff assumes the mix is already correct and applies only the normalization needed for delivery.

Hardcoded LRA=11

WaxOff hardcodes LRA=11 rather than exposing it as a setting. For delivery, some macro-dynamic constraint is appropriate. A podcast episode should have a consistent loudness profile throughout. 11 LU is a reasonable fixed value: it constrains macro-dynamics enough to ensure consistent perceived loudness across the episode without audibly squashing the mix's dynamics.

The loudnorm filter with LRA=11 applies gentle, program-level gain changes (not sample-by-sample compression). The effect is less aggressive than any compressor you would have used during editing.

Delivery Targets

Platform Target LUFS Max True Peak
Apple Podcasts −16 LUFS (normalized) −1.0 dBTP
Spotify −14 LUFS (normalized) −1.0 dBTP
Buzzsprout −19 LUFS recommended −1.0 dBTP
YouTube −14 LUFS (normalized) −1.0 dBTP
EBU R128 (broadcast) −23 LUFS −1.0 dBTP

Most streaming platforms normalize incoming audio to their own target on playback, so delivering at −18 LUFS versus −16 LUFS won't make your episode sound quieter or louder to listeners (the platform adjusts). What matters most is staying below the true peak ceiling to avoid clipping during that normalization step.

One platform asymmetry is worth knowing: YouTube only normalizes downward. It will not boost content that is quieter than −14 LUFS. Spotify normalizes in both directions. Apple Podcasts normalizes both ways as well. This means a mix delivered at −23 LUFS will sound quieter than expected on YouTube even though it is compliant, while on Spotify it will be boosted to match −14 LUFS. For podcast delivery, this is rarely a real-world issue since vocal content at −18 LUFS will be boosted on both, but it matters if you are optimizing for a single platform.

The Audio Engineering Society recommends −16 to −20 LUFS as the appropriate range for talk-based podcast content, with −18 LUFS as the practical center. The reasoning is threefold: mobile playback amplification is limited (content at −23 LUFS is difficult to hear in noisy environments like commuting), podcast consumption typically happens in ambient noise where higher average loudness aids intelligibility, and −18 LUFS sits safely between all the major platform targets. It will be boosted modestly by Apple and Spotify rather than aggressively attenuated by either. Delivering at −14 LUFS, for example, would be attenuated by Apple Podcasts and is right at Spotify's ceiling, leaving no safety margin. The conservative −18 LUFS leaves room for platforms to boost cleanly without any risk of triggering codec clipping.

WaxOff's default of −18 LUFS with −1.0 dBTP is a safe, widely accepted podcast delivery target.

Output Format Rationale

24-bit WAV

Both WaxOn and WaxOff output 24-bit WAV as the primary format.

MP3 CBR

WaxOff's MP3 output uses CBR (constant bit rate) rather than VBR (variable bit rate). For podcast delivery:

Summary

Design Decision Rationale
RNNoise before HPF (WaxOn, optional) Noise reduction runs on the unaltered signal, the spectral context the network was trained on. Removing noise first also produces cleaner input to every downstream stage. For stereo, channels are split and denoised independently to avoid RNNoise's mono-model divergence on multi-channel audio.
HPF before all gain stages (WaxOn) Subsonic energy inflates loudness measurements and activates the limiter on content that isn't perceptually meaningful
Phase rotation before normalization Lower crest factor → loudnorm applies gain more accurately → limiter works less hard → more transparent output
Two-pass loudnorm with linear=true Single-pass normalization is inaccurate; linear mode applies a clean gain offset with no dynamic processing
NR-for-measurement (WaxOn) When NR is off, loudnorm analysis runs on a temporary NR'd copy to prevent broadband noise from inflating the loudness measurement. Speech hits the target LUFS more accurately.
Noise floor estimation 10th percentile of per-block RMS identifies background noise level. Color-coded warnings alert users when noise may affect loudness accuracy.
LRA=20 in WaxOn Ingest preprocessing should not touch dynamics; loudnorm acts as pure level adjustment
2× oversampled alimiter (WaxOn) Inter-sample peaks are invisible to a standard sample-rate limiter; oversampling makes them visible and catchable
Pre-encode −2 dBTP limiter (WaxOff MP3) Loudnorm TP is a soft target; MP3 encoding adds further inter-sample peaks; −2 dBTP provides 1 dB headroom for decoded files to land at or below −1.0 dBTP
Per-file pre-normalization (Mix + Loudnorm) All inputs arrive at the mix at equal perceived loudness; normalize=0 lets the mix sum naturally, with the final loudnorm correcting the output level to target
amix normalize=1 (Mix, Loudnorm off) When Loudnorm is off, 1/N scaling prevents level from increasing with file count; downstream chain receives consistent levels regardless of how many files are mixed
MP3 derived from WAV output Normalization happens once; MP3 is a transcode of the normalized file, not a separate re-processing of the original
24-bit WAV output Headroom for further processing (WaxOn) and lossless archive quality (WaxOff); universally compatible
No dithering applied Dithering is only required at word-length reduction (e.g., 24→16-bit); WaxOn/WaxOff output 24-bit, so no bit-depth truncation occurs. For MP3, the codec's own psychoacoustic quantization is orders of magnitude coarser than any dither signal; dithering before a lossy encoder has no effect.

WaxOn/WaxOff is free software licensed under the GPL-3.0. Built by Seven Morris with AI collaboration.