Theory of Operation

WaxOn and WaxOff address opposite ends of the podcast production workflow.

WaxOn is an ingest preprocessor. It takes raw recordings — directly from a microphone or recording interface — and prepares them for editing: high-pass filtering, phase rotation, optional leveling, loudness normalization, and true-peak limiting. The goal is recordings that arrive in your DAW at consistent, manageable levels with their dynamic character intact. WaxOn is not a mastering tool; it is a staging tool.

WaxOff is a broadcast delivery tool. It takes a finished, edited mix and prepares it for distribution: loudness normalization to a delivery target and true-peak ceiling enforcement for safe streaming ingest. The goal is a delivery file that passes platform normalization cleanly and sounds consistent across episodes.

Both modes are built on the same core DSP stages — EBU R128 loudness measurement, two-pass linear normalization, and oversampled true-peak limiting — combined with mode-specific tools: dynamic normalization (dynaudnorm, WaxOn only) and a dedicated pre-encode limiter for WaxOff's MP3 path. WaxOn also uses RNNoise internally during the loudnorm analysis pass to improve measurement accuracy on noisy recordings — not as a user-facing stage, but as an implementation detail of the two-pass normalization. Every stage is deliberately ordered. The algorithms interact, and the ordering decisions are not arbitrary.

This document explains each stage: what it does, how it works, and why it is placed where it is. The Signal Chains section gives the processing order at a glance. The following sections cover each algorithm in depth — the shared normalization and limiting concepts first, then the WaxOn processing stages in signal-chain order, then the design rationale for each mode, and finally the output format decisions.

This document assumes familiarity with basic audio concepts (dBFS, sample rate, dynamic range). It is written for audio engineers, developers, and technically curious users who want to understand the why, not just the what.

Signal Chains at a Glance

WaxOn

High-Pass Filter
Channel Select
mono only
Phase Rotation
200 Hz allpass
Resample
to target rate
Dynamic Leveling
dynaudnorm bidirectional
Loudnorm
two-pass EBU R128
2× Oversample
→ Limit → Resample

Dashed border = optional stage. RNNoise is applied internally during the Loudnorm analysis pass only; it does not appear in the main signal chain. Output: 24-bit WAV.

WaxOff

Phase Rotation
200 Hz allpass
Loudnorm
two-pass EBU R128
2× Oversample Limit
safety backstop
Encode MP3
2× oversample limiter

Phase rotation runs in both the analysis and render passes so the loudnorm TP measurement matches the output waveform. The WAV-path limiter is a backstop for inter-sample peaks loudnorm's linear mode missed; on most material it doesn't engage. Dashed nodes are optional. Output: 24-bit WAV and/or MP3.

EBU R128 Loudness

EBU R128 (and the underlying ITU-R BS.1770-4 algorithm) is the measurement standard used by virtually all broadcast and streaming platforms. Spotify, Apple Podcasts, YouTube, and broadcast television worldwide all normalize to a loudness target derived from this standard. Understanding it explains most of what WaxOff does.

History & Context Why This Standard Exists: The Loudness War

The adoption of perceptual loudness metering was not driven by the music industry. It was driven by television viewer complaints about excessively loud commercials. Advertisers discovered they could aggressively compress and brick-wall limit their spots while technically staying within legacy peak-level limits, creating 4–8 dB disparities between programming and advertisements. The FCC Consumer Call Center reported "loud commercials" as a sustained top consumer complaint starting in 2002.

That pressure produced legislation. The CALM Act (Commercial Advertisement Loudness Mitigation Act) was signed into law in December 2010, requiring US broadcasters to keep commercials at the loudness level of surrounding programming. The FCC began enforcement in December 2012. The European Broadcasting Union had published EBU R128 four months earlier, in August 2010, addressing the same problem for European broadcast. Both standards were built on ITU-R BS.1770, published in 2006 as a psychoacoustically weighted metering algorithm specifically designed to correlate with perceived loudness.

Streaming platforms adopted loudness normalization as a natural extension. If a platform normalizes playback, an arms race to master louder than competitors produces only a quieter relative result, not a louder one. Mastering engineer Bob Katz declared at the AES convention in 2013 that the loudness wars were over, citing the emergence of loudness normalization across streaming. Spotify began loudness normalization at launch in 2014, standardizing its −14 LUFS target around 2017; Apple Podcasts specifies −16 LKFS with a ±1 dB tolerance; YouTube introduced normalization in 2015–2016.

The practical consequence for podcast producers: since every major distribution platform normalizes loudness at playback, delivering an aggressively loud, heavily limited master provides no benefit to listeners and costs you dynamic range. The correct goal is accurate level, a clean true peak ceiling, and preserved dynamics.

Integrated Loudness (LUFS)

Integrated loudness (denoted I) is the time-averaged loudness of a complete program, measured in LUFS (Loudness Units relative to Full Scale). LUFS is numerically identical to LKFS; both refer to the same algorithm.

Unlike peak metering or RMS, integrated loudness is:

K-Weighting

K-weighting is a two-stage filter chain applied to each channel before energy summation. It was designed to approximate the frequency-dependent sensitivity of human hearing, particularly the acoustic effect of the head on sound arriving at the ears.

  1. Pre-filter (head-related high-shelf): A second-order shelf boost with a design frequency of approximately 1682 Hz and a gain of approximately +4 dB. This models the acoustic effect of the human head, which increases high-frequency energy at the ear canals relative to a free-field measurement. The boost rises gradually above the shelf frequency and reaches its full value by roughly 5 kHz, where it remains constant through the upper spectrum. The effect is that sibilance, consonant detail, and broadband hiss in the 2–10 kHz range are weighted more heavily in the loudness measurement, matching the ear's increased sensitivity in this region.
  2. RLB weighting (high-pass): A second-order high-pass filter with a design frequency of approximately 38 Hz. This reduces the contribution of sub-bass energy to the loudness measurement. Sub-bass content below about 50 Hz contributes little to perceived loudness under normal listening conditions (particularly on the earbuds and laptop speakers that dominate podcast consumption), and leaving it in the measurement would skew the result for files with DC offset, rumble, or proximity-effect bass buildup.
Under the Hood Filter coefficient computation

The ITU specification defines the filter coefficients for a reference sample rate of 48 kHz. WaxOn/WaxOff's analyzer computes the coefficients from first principles using the bilinear transform, so the filters are accurate at any sample rate the source file uses (44.1 kHz, 48 kHz, 96 kHz, etc.). The pre-filter uses the shelf design parameters f₀ = 1681.97 Hz, gain = 3.9998 dB; the high-pass uses f₀ = 38.14 Hz, Q = 0.5003. These values are taken from the ITU reference implementation and match the pyloudnorm reference used widely in audio research.

The current revision of the standard is ITU-R BS.1770-5 (November 2023). BS.1770-5 adds an annex defining loudness measurement for object-based audio formats (e.g., Dolby Atmos). The K-weighting filter, gating algorithm, and true-peak measurement algorithm used for stereo and mono program content are unchanged from BS.1770-4.

Gating

Loudness is computed over overlapping 400 ms blocks. Adjacent blocks overlap by 75%, producing one new block every 100 ms. The 400 ms window was chosen because it corresponds closely to human short-term loudness perception. Psychoacoustic research by Zwicker and Fastl established that temporal integration of loudness occurs over approximately 200–400 ms, with 400 ms representing the time window over which the ear integrates energy to form a stable loudness impression. Shorter windows would capture transient fluctuations that don't correspond to perceived loudness; longer windows would smooth over meaningful changes in program level.

Two gating stages prevent silence and quiet passages from pulling the integrated value down:

  1. Absolute gate at −70 LUFS: Any block whose K-weighted energy falls below −70 LUFS is discarded. This removes silence, dead air, and extremely quiet room tone from the measurement. The threshold corresponds to a mean-square value of 10(−70 + 0.691) / 10 ≈ 1.95 × 10−7.
  2. Relative gate at −10 LU: From the remaining blocks, compute an ungated mean (the "absolute-gated loudness"). Then discard any block more than 10 LU below that mean. This removes quiet passages that are above the noise floor but significantly below the average program level, such as soft breaths between sentences, quiet background music under narration, or distant room ambience during pauses.

The final integrated loudness is the mean of the blocks that survive both gates. For podcast speech, this means the measurement reflects the loudness of the spoken content, not the silence between sentences.

The Math The Offset Constant: −0.691

The integrated loudness formula includes a constant offset of −0.691 dB:

Integrated Loudness (LUFS) = −0.691 + 10 · log₁₀(Σ Gᵢ · zᵢ)

where Gᵢ is the channel weight (1.0 for front channels, 1.41 for surround) and zᵢ is the gated mean-square of channel i after K-weighting. The −0.691 dB offset calibrates the scale so that a 1 kHz sine wave at 0 dBFS reads exactly 0 LUFS after K-weighting. Without this offset, the K-weighting pre-filter's boost at 1 kHz (~+3.3 dB at the reference frequency) would cause the same sine wave to read approximately +3.3 LUFS. The offset brings the scale back into alignment with the traditional dBFS reference. For mono and stereo speech content, the channel weights are both 1.0, so the formula simplifies to the mean-square of all channels after K-weighting and gating.

True Peak (TP)

True peak is the maximum reconstructed level when the digital signal is converted to analog. It differs from sample peak because the analog waveform between samples can exceed any individual sample value; these are inter-sample peaks. See the True Peak & Oversampling section for detail.

EBU R128 specifies a maximum true peak of −1.0 dBTP for most distribution. WaxOff defaults to this value.

Loudness Range (LRA)

LRA measures the spread between loud and quiet sections of a program (its macro-dynamics) in Loudness Units. It is computed as the difference between the 95th and 10th percentiles of the short-term loudness distribution (after gating). EBU R128 does not mandate a specific LRA target but recommends keeping it below 18 LU for broadcast.

Both modes pass an LRA value to the loudnorm filter. WaxOn hardcodes LRA=20 (effectively unconstrained, no dynamic processing). WaxOff defaults to LRA=9, which sits in the typical podcast delivery range (6–10 LU) — loose enough to preserve a well-balanced mix but tight enough to constrain occasional overly-dynamic material for car and headphone listening. Lower values compress more aggressively; higher values relax the constraint.

Two-Pass Normalization

EBU R128 integrated loudness requires the complete file to compute. It is a time-integrated measurement. You cannot know the correct gain adjustment until after you have read every sample. This makes single-pass normalization impossible for linear (non-dynamic) mode. Both WaxOn and WaxOff solve this with a two-pass approach.

Pass 1: Analysis

FFmpeg's loudnorm filter reads the entire file and prints a JSON block to stderr containing:

The output is discarded (-f null); only the measurements matter. The app parses the JSON from stderr and stores the values.

Pass 2: Linear Normalization

The same filter runs again, this time with the measured values injected back in and linear=true set:

loudnorm=I={target}:TP={tp}:LRA={lra}
  :measured_I={inputI}:measured_TP={inputTP}
  :measured_LRA={inputLRA}:measured_thresh={inputThresh}
  :offset={targetOffset}:linear=true

Why linear=true Matters

The loudnorm filter has two modes:

Both WaxOn and WaxOff use linear mode for pass 2. For mastering and delivery, this is the only correct approach. The goal is level adjustment, not dynamics processing.

True Peak and Oversampled Limiting

The Inter-Sample Peak Problem

Digital audio stores the waveform as discrete samples: amplitude values at regular time intervals (44,100 or 48,000 per second). A sample peak meter reads the highest sample value, which is straightforward. But the analog waveform reconstructed by a DAC continuously interpolates between those samples, and the reconstructed waveform can peak significantly higher than any individual sample value.

These are inter-sample peaks (ISPs), and they become real, audible clipping when:

A file with a sample peak of −1 dBFS can easily have a true peak above 0 dBFS, causing clipping that no sample-level meter would detect.

The Math The Mathematics of Reconstruction

The Nyquist-Shannon sampling theorem guarantees that a band-limited signal sampled at twice its maximum frequency can be perfectly reconstructed. The reconstruction uses a sinc interpolation kernel:

x(t) = Σ x[n] · sinc((t − nT) / T)

where x[n] are the sample values, T is the sample period, and sinc(x) = sin(πx) / (πx). The key insight is that the sinc kernel oscillates. When adjacent samples have high energy and the right phase relationship, the interpolated waveform between them sums constructively and overshoots both sample values. This is not an artifact or an error; it is the mathematically correct reconstruction of the continuous signal. The samples were never the waveform; they are the minimum information needed to reconstruct it.

The worst case for inter-sample peaks occurs when consecutive samples approach full scale with alternating signs at frequencies near Nyquist (half the sample rate). At 44.1 kHz, high-frequency content near 22 kHz is especially prone. In practice, ISPs on real-world audio material are typically 0.5–3 dB above sample peak, though extreme cases can reach higher.

History & Context The Streaming Ingest Trap

When you upload audio to Spotify, the platform transcodes your file to Ogg/Vorbis or AAC for streaming delivery. If your uploaded file has true peaks near 0 dBFS, the transcode itself can clip: the decoded streaming copy is distorted before a listener ever plays it, and no subsequent gain adjustment will fix it. Spotify's own artist documentation warns: "Really loud modern masters can easily register True Peak levels of +1 or +2 dBTP, and often as much as +3 or +4 dBTP. These are virtually guaranteed to cause encoder clipping if processed as-is." Research on lossy codec encoding has documented decoded true peaks rising 1–3 dBTP above the source in typical cases, with pathological cases reaching considerably higher. This is why the −1.0 dBTP true peak ceiling is a hard requirement, not a polite suggestion.

How Oversampled Limiting Solves This

True peak limiting works by upsampling the signal before the limiter so that inter-sample peaks become visible as actual samples, then limiting those samples, then downsampling back.

Input (44.1 kHz) → Upsample 2× (88.2 kHz) → alimiter → Downsample (44.1 kHz) → Output

At 2× the sample rate, new samples are interpolated midway between each original pair. These interpolated values approximate the continuous waveform reconstruction and capture most inter-sample peaks. The limiter can see and attenuate them. When downsampled back, the true peaks of the resulting file are controlled.

Under the Hood Why 2× and not 4×

2× oversampling catches the vast majority of inter-sample peaks in practice. The ITU-R BS.1770-4 true peak measurement algorithm itself uses 4× oversampling for maximum accuracy, but for a limiter (which only needs to prevent peaks from exceeding a threshold), 2× provides sufficient control. 4× oversampling is used in some mastering workflows to catch pathological edge cases, but the returns diminish quickly: the additional ISPs caught between 2× and 4× are typically less than 0.2 dB on real-world program material. For voice content with limited high-frequency energy near Nyquist, 2× is more than adequate.

WaxOn Limiter Settings

Under the Hood alimiter parameters

WaxOn's alimiter is configured for transparent peak control:

  • limit: 0.891251 — the linear amplitude equivalent of −1.0 dBTP, hardcoded as the EBU R128 standard ceiling
  • attack=5 ms: the limiter begins attenuating 5 ms before the peak. Fast enough to prevent transient overshoot, slow enough to avoid pre-ringing artifacts on voice. For reference, a typical plosive consonant (p, b, t) has a voice onset time of roughly 10–30 ms in English speech; 5 ms catches the transient before it peaks without introducing audible artifacts on voiced content.
  • release=50 ms: gain recovers in 50 ms after the peak passes. On typical voice material, this is fast enough to be inaudible as pumping while still recovering promptly between words. For comparison, inter-word pauses in conversational speech are typically 150–300 ms, so a 50 ms release completes well before the next word begins.
  • level=disabled: prevents the limiter from applying makeup gain. Without this, alimiter compensates for gain reduction, undoing the ceiling control. With it disabled, the limiter only attenuates, never amplifies.

WaxOff Limiters

WaxOff's loudnorm filter targets true peak via its TP parameter, but this is a soft target. The filter's internal gain calculation accounts for it, but it does not guarantee a hard ceiling. In practice, the loudnorm output can exceed the TP target by up to ~0.5–1.0 dB in edge cases — particularly on dynamic material where inter-sample peaks slip past loudnorm's analysis. WaxOff applies a brick-wall limiter on both output paths to enforce the user's true-peak target reliably.

Under the Hood WAV-path limiter

A 2× oversampled true-peak limiter runs after loudnorm in the WAV render filter chain:

  • Ceiling: matches the user's true-peak setting exactly. Loudnorm's linear-mode TP analysis aims to hit the same target via gain reduction; the limiter only acts on what slips past it.
  • attack=5 ms, release=50 ms: same parameters as WaxOn's limiter — fast enough to catch transients, slow enough to stay transparent on voice. On a well-mixed source the limiter doesn't engage, so these parameters rarely matter in practice.
  • 2× oversampling at 2 × sample rate for true-peak accuracy.

The limiter is a safety backstop, not a loudness maximizer. WaxOff doesn't push gain above what loudnorm calculated; the limiter exists so the user can trust that the rendered WAV will not exceed the configured ceiling.

Under the Hood MP3-path limiter

The MP3 path applies a separate pre-encode limiter on top of the already-rendered WAV. Lossy decoding adds 0.5–1.5 dB of inter-sample peak overshoot, so a WAV that landed exactly at the user's TP ceiling would clip after MP3 decode.

  • Ceiling: TP − 1.0 dB. With the default TP of −1.0 dBTP, the MP3 limiter sits at −2.0 dBTP — providing 1 dB of margin to absorb decoder overshoot, so the decoded MP3 lands at or below the user's effective target. If the user sets TP to −0.5 dBTP, the MP3 limiter scales to −1.5 dBTP automatically.
  • attack=1 ms, release=20 ms: faster than the WAV-path limiter because this stage is pure safety — it has no creative role and only catches the residual overshoot the codec introduces.
  • 2× oversampling at 2 × 44.1 kHz. MP3 is always encoded at 44.1 kHz regardless of the WAV sample rate setting.

RNNoise: Internal Measurement Tool

RNNoise is not a user-facing processing stage in WaxOn. It runs internally during the loudnorm analysis pass on a temporary copy of the audio to improve measurement accuracy on noisy recordings. The output signal is never touched by RNNoise. See NR-for-Measurement for how and why this works.

The following covers the algorithm's design and behavior for readers who want to understand what is running under the hood.

History & Context Background and Origins

RNNoise was developed by Jean-Marc Valin at Mozilla in 2017–2018 and released as open source under the BSD license. Valin is also a principal author of the Opus audio codec, the codec used by WebRTC, Discord, Zoom, and virtually every real-time web audio application. His work on Opus included extensive research into perceptual audio coding and voice intelligibility under compression, which directly informed the approach taken in RNNoise.

The project grew from a practical problem in WebRTC: browser-based voice communication was plagued by background noise (keyboard clicks, HVAC, crowd noise, fan hum) that conventional noise suppression handled poorly, either leaving too much noise or introducing the characteristic warbling, underwater artifacts of aggressive spectral subtraction. Valin's hypothesis was that a machine learning approach trained specifically on speech could do better.

The original paper, A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement (arXiv:1709.08243), was presented at the IEEE Multimedia Signal Processing Workshop in 2018. It has since been widely cited in the speech enhancement literature and influenced subsequent neural audio processing work at companies including Google, Microsoft, and Amazon.

Under the Hood Architecture: Gated Recurrent Units

RNNoise is a recurrent neural network using Gated Recurrent Units (GRUs), a variant of LSTM that uses fewer parameters and trains faster while retaining the ability to model temporal dependencies across variable-length sequences. The key difference from LSTM is that GRU combines the forget and input gates into a single "update gate" and merges the cell state with the hidden state, reducing the parameter count by roughly 25% per layer. The architecture is deliberately small: the network has roughly 100,000 parameters total, making real-time inference feasible on hardware as constrained as embedded processors with no dedicated GPU.

The network processes audio in the frequency domain using the Opus codec's Bark-scale filterbank: 22 critical bands that approximate the frequency resolution of human hearing. This is a key design choice. Rather than learning to operate on raw waveforms (which requires modeling extremely long-range sample dependencies) or on fixed FFT bins (which don't match perceptual resolution), RNNoise works on the same perceptual frequency representation that the ear itself uses. The Bark scale groups frequencies into bands of roughly equal perceptual width: narrow bands at low frequencies (where pitch discrimination is fine) and progressively wider bands at high frequencies (where the ear integrates more broadly).

For each 10 ms frame of audio, the network computes a set of spectral gains (one per band) between 0 and 1. A gain of 1.0 means that band is passed through unmodified. A gain of 0 means it is fully suppressed. Intermediate values attenuate partially. The gains are applied multiplicatively to the band energies, and the modified spectrum is reconstructed back to a waveform. The network never synthesizes audio; it only decides how much of each perceptual band to suppress in each frame.

Under the Hood Training and the Model File

The bundled model (bd.rnnn from the rnnoise-models repository) was trained on a large corpus of speech (multiple speakers, multiple languages, multiple recording conditions) mixed with a wide variety of noise types: HVAC, traffic, crowd noise, fan hum, electrical interference, and broadband pink and white noise. The model learns to identify which spectral components correspond to voice and which correspond to noise, using temporal context (the GRU's hidden state) to distinguish steady-state noise from transient speech components.

Training required both clean speech recordings and noise-only recordings, which were artificially mixed at various signal-to-noise ratios. The network learned the difference between speech-shaped energy and noise-shaped energy across thousands of examples. Because the training data was multilingual and broad-spectrum, the resulting model generalizes well across different speakers, accents, and recording conditions without any per-speaker adaptation.

What It Suppresses Well, and Poorly

RNNoise excels at steady-state, spectrally diffuse noise: HVAC hum, room tone, computer fan noise, broadband electrical hiss, and low-level crowd ambience. These share a characteristic spectral profile that is relatively stable over time and distributes energy broadly, making them easy for the network to distinguish from voice. On clean recordings with consistent low-level background noise, suppression is typically very effective and inaudible.

It handles poorly, and can introduce artifacts with, the following:

The artifact profile when limits are exceeded is typically a subtle warbling or underwater quality, the same category of artifact produced by spectral subtraction noise gates, though usually less severe. On moderate-noise, clean-voice recordings, the algorithm is essentially transparent.

Stereo Handling: Per-Channel Split

RNNoise was designed and trained exclusively on mono 48 kHz speech. When FFmpeg's arnndn filter receives a stereo input, it creates separate denoiser instances per channel and processes them independently. In practice, this can produce unpredictable results: the per-channel recurrent states diverge, and one channel (typically the second) may be over-gated or heavily attenuated, even when both channels carry similar content and noise levels.

The root cause is that the model's internal gain computation is frame-by-frame and depends on its recurrent hidden state. With stereo input, slight differences between channels (different mic angles, room reflections, or even minor level offsets from recording) can cause the model to classify one channel as "more noisy" than the other and gate it more aggressively. The model has no concept of channel correlation or stereo coherence.

WaxOn solves this by splitting stereo audio into independent mono channels before applying RNNoise, then rejoining the denoised channels back into stereo.

Under the Hood filter_complex graph

This uses FFmpeg's filter_complex graph:

[0:a]channelsplit=channel_layout=stereo[L][R];
[L]arnndn=m=/path/to/model[Lnr];
[R]arnndn=m=/path/to/model[Rnr];
[Lnr][Rnr]join=inputs=2:channel_layout=stereo

Each channel receives its own fully independent denoiser instance with its own recurrent state, initialized cleanly. The model processes each as a standard mono stream — the format it was trained on — and the results are predictable and balanced. The remaining filter chain (high-pass, phase rotation, resample) runs on the rejoined stereo signal.

For mono output, the issue does not arise. The pipeline's first FFmpeg stage already extracts the selected channel via pan (alongside the high-pass and phase-rotation filters), so the audio reaching the loudnorm analysis pass is already a single channel. The simple -af "arnndn=m=…" chain runs over that mono signal — one denoiser instance, one stream — with no per-channel state divergence to worry about.

The same per-channel split is applied in the NR-for-measurement analysis paths.

Under the Hood FFmpeg Implementation

The arnndn filter in FFmpeg wraps the RNNoise library. It requires an external model file provided via the m= parameter:

arnndn=m=/path/to/model

The model file is bundled in the app's resources directory. WaxOn locates it at runtime using Bundle.main.url(forResource:withExtension:) and passes the resolved path to FFmpeg. In the analysis pass, for mono output the filter runs on the temp copy via a simple -af chain. For stereo output, a -filter_complex graph splits, denoises, and rejoins the channels as described above.

Processing latency for arnndn is negligible for batch processing purposes. The network processes audio in 10 ms frames. For a 60-minute recording, the total added processing time is a few seconds on Apple Silicon.

Noise Floor Estimation

WaxOn/WaxOff estimates the noise floor of each loaded file and displays it as the FLOOR stat in the file stats panel. The estimate is computed during the same analysis pass that produces RMS, peak, crest factor, and LUFS, at no additional cost.

The Problem

Broadband background noise (HVAC, room tone, preamp hiss) occupies spectral space continuously, including during pauses between speech. This noise contributes to the integrated loudness measurement in two ways:

  1. K-weighting amplifies it. The pre-filter's ~4 dB high shelf boost above 1.7 kHz increases the measured energy of broadband hiss, which has significant energy in the 2–10 kHz range. The loudness measurement sees the noise as louder than it subjectively is.
  2. Noise fills gated blocks. The relative gate excludes blocks more than 10 LU below the ungated mean. In a clean recording, pauses between sentences fall below this threshold and are excluded. In a noisy recording, noise energy keeps those blocks above the gate threshold, and they contribute to the integrated loudness value.

The net effect: noisy files measure louder than their speech content actually is. When loudness normalization targets a specific LUFS value, the gain applied is less than the speech needs. The speech ends up under target.

Estimation Method

The analyzer divides the audio into non-overlapping 400 ms blocks (the same block size used for LUFS gating) and computes the mono RMS of each block. The noise floor estimate is the 10th percentile of these block RMS values, converted to dBFS.

The 10th percentile was chosen because it represents the quietest 10% of the file's blocks. For speech recordings, the quietest blocks are the pauses, breaths, and gaps where the microphone is capturing only the ambient environment. The 10th percentile is more robust than the absolute minimum (which might catch a single anomalously quiet block) while still reflecting the true background level rather than the speech level.

At least 5 blocks are required for a meaningful estimate (about 2 seconds of audio). Shorter files show no FLOOR stat.

Thresholds and Color Coding

The FLOOR stat is color-coded in the stats panel:

Files with an orange or red noise floor also show a ⚠️ warning badge in the file list.

NR-for-Measurement

When Loudness Norm is enabled, WaxOn runs RNNoise on a temporary copy of the audio for the loudnorm analysis pass (pass 1) only. The normalization pass (pass 2) and all subsequent stages operate on the original, unmodified audio. This ensures that loudness measurements reflect the speech content rather than the noise floor, without altering the output.

Why This Works

The two-pass loudnorm process measures the file's integrated loudness in pass 1, then applies a single linear gain offset in pass 2. The gain offset is determined entirely by the pass 1 measurement. If pass 1 measures a noise-inflated loudness (file appears louder than the speech actually is), the computed gain will be too small, and speech will land under target.

By measuring the NR'd copy instead, the analysis reflects the loudness of the speech content with the noise floor suppressed. The computed gain offset is then applied to the original file. Because RNNoise primarily removes energy between and underneath words (not the speech itself), the speech content in the original and NR'd versions has approximately the same loudness. The gain derived from the clean measurement lands the speech close to the target.

The noise floor in the original file does come along for the ride. It is amplified by the same gain as the speech. But the philosophy here is pragmatic: WaxOn is a prep tool for DAW editing. If the noise is bad enough to matter, it will be treated in the DAW (or in a dedicated NR tool like RX). Getting the speech to the right level for editing is the higher priority.

For stereo output, the NR-for-measurement paths use the same per-channel split: stereo is split into independent mono channels, each denoised separately, then rejoined before the loudnorm analysis. This ensures consistent, balanced noise removal for accurate measurement regardless of channel layout.

Cost

NR-for-measurement adds one additional FFmpeg pass per loudnorm analysis (running RNNoise on the intermediate audio to a temporary file). For a typical podcast recording on Apple Silicon, this adds a few seconds. The temporary NR'd files are created in the working directory and deleted automatically after processing.

Phase Rotation and Crest Factor

Crest Factor

Crest factor is the ratio of a signal's peak level to its RMS level, expressed in dB:

Crest Factor (dB) = Peak (dBFS) − RMS (dBFS)

Typical unprocessed speech has a crest factor of 18–25 dB, with 20–23 dB most commonly cited in speech processing literature. High crest factor has a practical consequence for loudness normalization: to reach a loudness target without exceeding the ceiling, the limiter must apply more gain reduction (limiting). More limiting means more audible artifacts: transient softening, pumping, coloration.

Reducing crest factor before normalization means the same LUFS target can be reached with less limiting and more transparency.

How Allpass Filtering Reduces Crest Factor

An allpass filter passes all frequencies at equal amplitude but shifts the phase of different frequencies by different amounts. It doesn't alter the frequency response; it only changes when different frequency components arrive relative to each other.

Under the Hood Filter implementation

Both modes use FFmpeg's allpass filter at f=200, t=q, w=0.707 — a second-order (biquad) allpass with a Butterworth-Q response. The magnitude |H(z)| = 1 at all frequencies (unity gain). The phase response varies continuously from 0° at DC to −360° at Nyquist, passing through −180° at the design frequency. Frequencies below the design frequency are shifted slightly; frequencies above it are shifted more. The relative timing of low-frequency and mid-frequency components in the waveform changes, but their amplitudes do not.

Much of the peak asymmetry in voice audio comes from low-frequency content: proximity effect from cardioid microphones, low-frequency resonances in recording spaces, and bass-heavy content in finished mixes. This energy tends to create asymmetric waveforms where one polarity consistently peaks higher than the other.

Proximity effect is worth understanding in detail because it affects nearly every podcast recording. Directional microphones (cardioids, supercardioids, figure-8 patterns) exhibit increasing bass boost as the sound source moves closer, beginning around 12 inches and growing progressively stronger below approximately 100–200 Hz. For cardioids at typical close-mic distances, proximity effect typically adds 6–12 dB of bass boost; figure-8 patterns can reach 20 dB or more at the same distances due to their stronger gradient response. Omnidirectional microphones do not exhibit proximity effect, but the cardioid pattern dominates consumer and prosumer podcast microphones (Shure SM7B, Audio-Technica ATR2100, most USB microphones), making this a near-universal issue. Podcasters without broadcast training tend to position themselves very close to their microphones to minimize room noise, an instinct that unfortunately triggers the strongest proximity effect and produces the most bass-heavy, asymmetric waveforms. The result lands squarely in the 150–250 Hz range that phase rotation is designed to address.

An allpass filter in the low-frequency range redistributes the phase relationships between bass components and midrange components, making peaks more symmetric. The result is a lower crest factor (peaks are shorter relative to average level) without any change to the frequency response or audible character of the audio.

The effect is genuinely inaudible. Human hearing is largely insensitive to absolute phase at audio frequencies. The cochlea performs a frequency decomposition that discards phase information. This is why polarity inversion (flipping the sign of every sample) and allpass filtering (frequency-dependent phase shift) are both perceptually transparent, despite being mathematically significant transformations of the waveform.

Quantifying the Effect

On typical podcast recordings with moderate proximity effect, allpass phase rotation at 200 Hz reduces crest factor by 1–4 dB. A 3 dB crest factor reduction means the limiter needs to apply 3 dB less gain reduction to stay below the same ceiling at the same loudness target. That translates directly to less audible limiting artifacts. On clean, well-recorded speech with minimal bass buildup, the crest factor reduction is smaller (0.5–1 dB). On a single-microphone recording the allpass has no audible downside: it costs nothing in audio quality and can only help. The caveats below cover the exceptions, principally multi-microphone configurations in shared acoustic space.

Multiple Microphones in the Same Room

The following is a workflow consideration, not an implementation limitation. WaxOn processes each file correctly. The question is whether applying phase rotation per-track — before the tracks are mixed — is the right place in the production chain to apply it, when those tracks were captured simultaneously in the same room.

The "no downside" claim above assumes a single microphone capturing a single source. In multi-mic recordings — two podcasters at separate mics in the same room, a host-and-guest setup, a roundtable, or any configuration where one acoustic source reaches multiple mics — the picture changes.

Consider two speakers, A and B, each on their own cardioid microphone. Speaker A's voice reaches both mics: strongly into A's mic, faintly into B's. This faint copy is called bleed or spill. The bleed and the main signal are acoustically coherent — the same waveform captured at slightly different times and levels — so when the tracks are summed in the mix, the bleed adds to the main signal in a phase-related way. The result has some comb filtering already (any time a signal sums with a delayed copy of itself, it does), but the comb is fixed and a property of the room.

Apply phase rotation independently to each track and that coherence is broken. Each track receives the same allpass filter, but the bleed on track B has different spectral content than the main signal on track A — it is highpass-shaped by the off-axis response of B's mic, attenuated by distance, and colored by reflections — so it emerges from the filter in a different phase state. Summing the tracks now combines a phase-shifted version of A's voice with a differently-phase-shifted version of A's voice. Comb filtering deepens, and the notches shift in frequency relative to what the room produced naturally. The audible result is a thinness or hollowness in the low-mid range, sometimes described as "phasey."

How bad it gets depends on bleed level. Close cardioid placement at typical podcast distances usually keeps bleed −20 to −30 dB below the direct signal, where the additional comb filtering from independent phase rotation is below the threshold of audibility. Loose placement, omnidirectional mics, or talkers who lean back from the mic put bleed in the −10 to −15 dB range, where artifacts become noticeable.

The 3:1 rule. A useful guideline from broadcast practice: the distance between two microphones should be at least three times the distance from each microphone to its source. With talkers six inches from their mics, the mics themselves should be eighteen inches apart or more. This keeps bleed roughly 9–10 dB below the direct signal, low enough that downstream processing — phase rotation, EQ, compression — does not interact destructively across tracks.

Working around it. If you suspect significant bleed in a multi-track session:

Phase Rotation at the Delivery Stage

WaxOff applies the same allpass filter at the same frequency, but the context is different. By the time audio reaches WaxOff it is a finished, summed mix — every track has been combined, leveled, and committed to a single stereo file — so the per-track bleed concerns above no longer apply. The filter operates on the summed program, exactly the recommended approach when multi-mic bleed is a concern at the WaxOn stage.

The case for phase rotation is actually stronger at the delivery stage than at ingest. Music beds, stingers, archival clips, and broadcast inserts in a podcast mix tend to be more asymmetric than solo voice — bass-heavy program material, sustained low notes, and pre-mastered loud sources all introduce waveform asymmetry that the WaxOff limiter would otherwise have to absorb. Reducing crest factor before loudnorm's TP measurement means the pass-1 measurement reflects the actual peak budget of the rendered output, so the pass-2 linear gain correction is accurate. Without it, loudnorm measures the peaks of the un-rotated signal and constrains the gain to that (lower) peak budget, producing an output that fails to hit the target LUFS by the headroom phase rotation would have recovered. WaxOff applies phase rotation unconditionally for this reason — there is no toggle, since at the delivery stage there is no scenario in which skipping it produces a better result.

Stereo Recordings and Channel Coherence

WaxOn applies phase rotation as a single allpass filter across the full signal — both channels receive exactly the same coefficients simultaneously, so the inter-channel phase relationship that defines the stereo image is preserved. Unlike the RNNoise stage (which requires an explicit per-channel split because the model was trained on mono), the allpass filter operates correctly on stereo input by design and needs no special handling.

The pitfall is in pre-processing outside the app: if a stereo recording is split into two mono files, processed individually through different tools or with different settings, and then recombined, the stereo image can shift unpredictably. Mid information (L + R) and side information (L − R) both depend on phase agreement between channels. A 90° shift on one channel and not the other moves energy from the mid component into the side component at that frequency, widening or hollowing the center image and degrading mono compatibility.

This matters most for stereo room recordings (XY, ORTF, spaced pair) where the image is built from genuine inter-channel arrival-time and level differences. It matters less for stereo created by panning mono sources, where both channels carry the same waveform with only a level difference; in that case an identical allpass on both leaves the image untouched.

What Phase Rotation Cannot Fix

Allpass phase rotation addresses asymmetric peaks caused by phase relationships between low-frequency and mid-frequency content. Several other sources of waveform asymmetry look similar on a meter but are immune to phase manipulation:

Dynamic Leveling

Dynamic Leveling is an optional WaxOn processing stage that uses FFmpeg's dynaudnorm filter in bidirectional mode to even out level differences across a recording. It can both attenuate loud sections and boost quiet ones. It runs after filtering and before loudness normalization.

The Problem It Solves

Dynamic Leveling handles recordings where level variation is structural — a panel discussion where audience questions are half the volume of the speakers, a live Q&A where the presenter is near the mic and questioners are across the room, or a multi-person recording where mic placement was inconsistent. You need to lift the quiet material and tame the loud material simultaneously.

dynaudnorm is inherently a two-pass, lookahead-style leveler — it lifts quiet sections toward the target peak and attenuates loud ones. This is the right behavior for uneven source material, but it produces audible pumping on solo voice recordings with natural pauses: as the filter adjusts gain across the silence between sentences, the noise floor breathes in and out. WaxOn does not gate near-silent frames via dynaudnorm's silence threshold (t) parameter, because an active threshold causes severe attenuation at speech-to-silence transitions: the Gaussian smoothing window interpolates between gated (unity-gain) and ungated frames, producing audible fade-outs on the trailing edge of every utterance. The trade-off is that the noise floor between words is boosted by up to the maximum gain factor — acceptable on the clean source material Dynamic Leveling targets. Dynamic Leveling is a specialty tool for multi-voice sources, not a general-purpose enhancement for solo recordings.

The Aggressiveness Slider

The Aggressiveness slider is a 0–1 control that maps to three dynaudnorm parameters simultaneously:

One parameter is fixed across the slider range:

Gentle is the right starting point for most material. Aggressive is appropriate when level differences are extreme and the source already has significant background noise that boosting won't make materially worse.

Placement in the Pipeline

Dynamic Leveling runs before loudness normalization. This means the loudnorm analysis pass sees a more consistent signal — bringing the overall level balance closer to even before measurement means the integrated loudness reading reflects typical content level rather than being skewed by a structural imbalance between speakers.

Mirror-Padding Boundary Fix

The Gaussian-weighted gain computation in dynaudnorm produces boundary artifacts at both ends of the file. The naive workaround of padding with silence does not solve this: dynaudnorm assigns silent frames a gain of 1.0 (silence-threshold behavior), and those unity-gain values get averaged into the smoothing window at the audio/silence boundary, pulling the smoothed gain down and producing an audible ramp into the real audio.

The actual fix is mirror padding: prepend a reversed copy of the first 16 seconds of audio, append a reversed copy of the last 16 seconds. The smoothing window now sees real audio with matching gain values on both sides of the boundary, so the smoothed gain at the edge matches what it would be in the middle of the file. After processing, the padding is trimmed off with atrim and the output matches the original duration. For clips shorter than 16 seconds, the pad length is capped at the clip's length.

WaxOn Design Rationale WaxOn

Stage Order

The WaxOn pipeline stage order is deliberate:

  1. High-pass filter first: Subsonic content below 80 Hz is removed before any gain stage processes it. Low-frequency energy is disproportionately loud and would cause loudnorm to underestimate the actual loudness of content you care about, and cause the limiter to work harder than necessary on energy that isn't musically useful.
  2. Channel selection before phase rotation: If extracting mono from a stereo source, do it first so the allpass filter operates on the actual mono signal, not a wider stereo version of it. The loudnorm analysis then also measures the real output signal.
  3. Phase rotation before normalization: Reduces crest factor so that the loudnorm analysis measures a waveform that more accurately represents what the limiter will see after normalization.
  4. Dynamic leveling before normalization: Bidirectional leveling runs after filtering. Bringing the overall level balance closer to even before the loudnorm analysis pass means the integrated loudness measurement reflects the typical level of the material rather than being skewed by large quiet sections or loud outliers.
  5. Limiter last: After any loudness normalization, with oversampling to catch true peaks.

LRA=20 in WaxOn Loudnorm

WaxOn's loudnorm hardcodes LRA=20. The LRA parameter tells the loudnorm filter how aggressively to constrain the dynamic range; lower values apply more dynamic compression. At LRA=20, the filter applies essentially no dynamic processing. It acts as a pure linear gain offset.

This is intentional for ingest. WaxOn is a pre-editing tool. You want your recordings to arrive at your DAW at consistent levels, but with their original dynamic character intact. Any dynamic processing at this stage would fight against the compression and automation you'll apply during editing. LRA=20 ensures loudnorm does exactly one thing in WaxOn: level adjustment.

Default Loudnorm Target: −30 LUFS

−30 LUFS is conservative by design. At this level, even a recording with significant dynamic range and a crest factor of 20 dB will have peaks well below −10 dBFS, giving the limiter ample headroom. The goal is to bring different recordings to a consistent level for editing, not to hit a delivery target. −30 LUFS leaves plenty of room for the final mix to breathe.

Loudnorm TP and Limiter Ceiling

Both the loudnorm TP parameter and the alimiter limit are set to −1.0 dBTP. These are not redundant:

Setting both to the same value means loudnorm and the limiter are working toward the same goal. If loudnorm succeeds, the limiter barely engages. If loudnorm slightly overshoots, the limiter catches it. The two stages are complementary, not redundant.

WaxOff Design Rationale WaxOff

Always Stereo Output

WaxOff always outputs 2-channel stereo regardless of the input channel count. If the source is mono, FFmpeg upmixes it to dual-mono — identical left and right channels. This is correct behavior for podcast delivery: most platforms and players expect stereo files, some handle mono inconsistently, and the file size difference is negligible at typical podcast bitrates. A dual-mono stereo file sounds completely natural — the content is centered as expected, with no audible difference from a true stereo file for voice content.

No High-Pass Filter

WaxOff doesn't include a high-pass filter. By the time a mix reaches WaxOff, it has presumably been edited and processed in a DAW. High-pass filtering, EQ, and cleanup are part of the editing workflow. WaxOff assumes the mix is already correct and applies only the normalization needed for delivery.

Fixed LRA=9 (with hidden override)

WaxOff sets LRA=9 on the loudnorm filter. The value is configurable in code — WaxOffSettings.lra, default 9.0 — but no UI control exposes it, so in normal use it's a fixed value. For delivery, some macro-dynamic constraint is appropriate: a podcast episode should have a consistent loudness profile throughout, and most podcasts are consumed on earbuds, laptop speakers, or in cars where wide macro-dynamics push quiet passages below ambient noise and loud passages over the listening comfort threshold. 9 LU sits in the typical podcast delivery range (6–10 LU) — loose enough to preserve a well-balanced mix's natural dynamics, tight enough to constrain occasional overly-dynamic material.

The loudnorm filter with LRA=9 applies gentle, program-level gain changes (not sample-by-sample compression). The effect is less aggressive than any compressor you would have used during editing.

Delivery Targets

Platform Target LUFS Max True Peak
Apple Podcasts −16 LUFS (normalized) −1.0 dBTP
Spotify −14 LUFS (normalized) −1.0 dBTP
Buzzsprout −19 LUFS (mono) / −16 LUFS (stereo) −1.0 dBTP
YouTube −14 LUFS (normalized) −1.0 dBTP
EBU R128 (broadcast) −23 LUFS −1.0 dBTP

Most streaming platforms normalize incoming audio to their own target on playback, so delivering at −18 LUFS versus −16 LUFS won't make your episode sound quieter or louder to listeners (the platform adjusts). What matters most is staying below the true peak ceiling to avoid clipping during that normalization step.

One platform asymmetry is worth knowing: YouTube only normalizes downward. It will not boost content that is quieter than −14 LUFS. Spotify normalizes in both directions. Apple Podcasts normalizes both ways as well. This means a mix delivered at −23 LUFS will sound quieter than expected on YouTube even though it is compliant, while on Spotify it will be boosted to match −14 LUFS. For podcast delivery, this is rarely a real-world issue since vocal content at −18 LUFS will be boosted on both, but it matters if you are optimizing for a single platform.

The Audio Engineering Society recommends −16 to −20 LUFS as the appropriate range for talk-based podcast content, with −18 LUFS as the practical center. The reasoning is threefold: mobile playback amplification is limited (content at −23 LUFS is difficult to hear in noisy environments like commuting), podcast consumption typically happens in ambient noise where higher average loudness aids intelligibility, and −18 LUFS sits safely between all the major platform targets. It will be boosted modestly by Apple and Spotify rather than aggressively attenuated by either. Delivering at −14 LUFS, for example, would be attenuated by Apple Podcasts and is right at Spotify's ceiling, leaving no safety margin. The conservative −18 LUFS leaves room for platforms to boost cleanly without any risk of triggering codec clipping.

WaxOff's default of −18 LUFS with −1.0 dBTP is a safe, widely accepted podcast delivery target.

Output Format Rationale

24-bit WAV

Both WaxOn and WaxOff output 24-bit WAV as the primary format.

MP3 CBR

WaxOff's MP3 output uses CBR (constant bit rate) rather than VBR (variable bit rate). For podcast delivery:

Quantization, Dithering, and Why It Doesn't Apply Here

The Quantization Problem

Digital audio stores amplitude values as integers. A 16-bit system divides the amplitude range into 216 = 65,536 discrete steps; a 24-bit system uses 224 = 16,777,216. When a continuous floating-point value is rounded to its nearest representable integer, the difference is quantization error.

At high signal levels, quantization error is a negligible fraction of the signal amplitude. The problem surfaces at low levels: fade-outs, reverb tails, quiet passages, where the signal approaches the magnitude of a single quantization step. At that scale, the error is no longer random with respect to the signal; it becomes correlated. Correlated noise has harmonic structure. Harmonic noise is perceived as distortion.

The artifact is distinctive: as a 16-bit fade-out approaches silence, the smooth waveform begins to pixelate, crumbling into a grainy, granular texture. Engineers call it "going digital." It is most audible on sustained tones, piano decays, and reverb tails, anywhere a signal fades through the lower quantization steps rather than cutting abruptly.

History & Context The Classical Demonstration

The canonical test (sometimes called the "fade-to-black") is simple: record a tone at a moderate level and fade it gradually to silence. Without dithering, the transition through the last few quantization steps produces a sequence of audible steps, then silence where the waveform simply stops being representable. The signal doesn't fade; it falls off a cliff.

Bob Katz, in Mastering Audio: The Art and the Science, describes piano decay as one of the most revealing cases. A sustained piano note fading naturally into a quiet room exposes quantization distortion immediately when compared against a properly dithered version. The undithered note develops a gritty texture as the decay reaches the noise floor, a form of distortion introduced by the word-length reduction itself, present nowhere in the original recording. He uses this comparison in workshops and has remarked that once engineers hear the difference, the idea of shipping 16-bit masters without dithering becomes unthinkable.

A less scientific but widely replicated demonstration: open any 16-bit DAW session, generate a −40 dBFS sine wave, export at 16-bit with dithering off, then again with TPDF dither. Zoom into the waveform near the fade-out in a spectral editor. The quantized version shows visible stairstepping. The dithered version shows a smooth descent into low-level noise. This is not a subtle difference at the bit level, even when it is subtle or inaudible at normal listening levels with typical program material.

The Mathematical Fix: TPDF Dither

The solution is counterintuitive: add noise to the signal before truncating it to a lower bit depth. This noise (dither) must be added at a specific amplitude and with a specific probability distribution to be effective.

The mathematical foundation was established by Stanley Lipshitz, Robert Wannamaker, and John Vanderkooy at the University of Waterloo in a series of papers beginning in 1984. Their central result: adding Triangular Probability Density Function (TPDF) noise (noise whose amplitude is distributed as a triangle between −1 and +1 LSB of the target word length) completely decorrelates the quantization error from the input signal. The quantization error becomes spectrally white and statistically independent of the audio. Harmonic distortion is eliminated; what remains is signal-independent white noise.

The Math TPDF generation, variance, and subtractive dither

TPDF noise is generated by summing two independent rectangular (uniform) noise samples. The resulting amplitude distribution is triangular, hence the name. Its variance is exactly 1/6 LSB², which is the minimum required to whiten quantization error under the conditions relevant to audio. Subtractive dither (where the same dither signal is subtracted after truncation) can achieve perfect cancellation of quantization error in theory; non-subtractive TPDF (the practical, deployable form) achieves statistical decorrelation, which is sufficient for all real-world applications.

The practical consequence: a properly TPDF-dithered 16-bit file has a smooth, analog-like noise floor rather than correlated quantization distortion. The fade-to-black test produces a clean descent into white noise. The artifact is gone.

The Math The dithering theorem in one sentence

Adding TPDF noise of variance 1/6 LSB² before truncation makes the quantization error a white, signal-independent noise process: the best possible outcome for a word-length reduction.

Noise Shaping

TPDF dithering trades correlated distortion for uncorrelated white noise. For archival files and intermediate stems, that trade is unconditionally correct. But the resulting noise floor sits at approximately −96 dBFS spread uniformly across the audio spectrum. Human hearing is not equally sensitive at all frequencies. Sensitivity peaks around 2–4 kHz and falls off substantially above ~15 kHz. Noise shaping exploits this asymmetry.

A noise-shaping filter feeds quantization error back into the system through a filter designed around a psychoacoustic model of human hearing. The filter pushes noise energy out of the 2–4 kHz sensitivity peak and into the 15–20 kHz region where hearing is least sensitive. Total noise energy is conserved (or slightly increased), but perceptually weighted noise (the noise you can actually hear) is reduced. A well-designed noise-shaped dither algorithm can achieve the perceived noise floor of a 20-bit system from a 16-bit word.

History & Context UV22HR and POW-r

Apogee's UV22HR dithering, developed in the 1990s and built into Apogee converters and later into Logic Pro's Bounce dialog, uses an ultrasonic noise curve that concentrates dither energy above 20 kHz, technically increasing broadband noise while keeping in-band noise below the threshold of audibility. POW-r (Psychoacoustically Optimized Wordlength Reduction), developed by a consortium that included Waves and Prism Sound, offers three progressively aggressive noise-shaping modes. POW-r Type 3 is widely used in mastering for 16-bit delivery. Bob Katz has written about using POW-r in preference to flat TPDF for final 16-bit masters, on the basis that the psychoacoustic optimization is audible on critical material at moderate listening levels.

History & Context The Early CD Era

The first commercial CDs appeared in 1982. Digital mastering workflows were new territory, and understanding of quantization dithering was not yet widespread among recording engineers. Dithering had been described mathematically in signal processing literature (Lipshitz and Vanderkooy's most cited papers came in the years following), but its importance in audio mastering was not yet a settled professional consensus.

Several major remastering campaigns of the 1990s and 2000s revisited early digital recordings, and engineers working on them have commented on the difference between first-generation 16-bit masters (truncated without dithering) and properly dithered versions. Classical and acoustic jazz recordings, where genuine dynamic range, reverb tails, and instrument decays are central to the listening experience, are particularly revealing. The difference is most apparent on headphones with revealing source material: quiet passages and fade-outs in early digital releases can carry a subtle granularity that disappears in the remastered versions.

The issue extends into the digital audio workstation era. For years, some popular DAWs shipped with dithering off by default on export, or placed the dither option in a dialog that inexperienced users never opened. The result was that a significant volume of independently produced music from the 1990s and early 2000s was distributed as 16-bit audio truncated without dithering. The artifacts are often inaudible on typical program material at typical listening levels, but on a quiet room recording with a long reverb tail, they are there.

Why WaxOn/WaxOff Does Not Apply Dithering

The entire dithering question is contingent on one condition: word-length reduction. You dither when, and only when, you are truncating bits (converting from a higher to a lower bit depth). Dithering is not a general audio quality enhancement; it is a specific solution to a specific problem that arises at the moment of truncation.

WaxOn and WaxOff output 24-bit WAV. Neither mode reduces bit depth. The processing chain (loudness analysis, gain adjustment, filtering, limiting) runs in floating-point arithmetic internally. FFmpeg's audio processing pipeline operates in 32-bit or 64-bit float throughout. When the float result is written to a 24-bit integer PCM file, any truncation from float to int24 occurs at a level approximately 120+ dB below the signal, far below the audible noise floor of any recording. There is no perceptible quantization distortion to address, and no dithering is needed or appropriate.

If WaxOff produced 16-bit output, dithering would be mandatory, applied as the final stage, after all gain processing, immediately before the word-length reduction. (Dithering applied earlier would be modified by subsequent gain stages, defeating the purpose.) For 24-bit output, the question simply does not arise.

For MP3 output, dithering is inapplicable for a separate reason. MP3 encoding applies its own psychoacoustic quantization. The codec analyzes the signal using a masking model and allocates bits to spectral bands according to audibility thresholds. The quantization step sizes used by the MP3 encoder are orders of magnitude coarser than one LSB of 16-bit PCM. Any TPDF dither noise added before encoding would be completely absorbed into the codec's own quantization decisions. It contributes nothing and changes nothing. Adding dither before a lossy encoder is like whispering into a jackhammer.

Summary

Design Decision Rationale
HPF before all gain stages (WaxOn) Subsonic energy inflates loudness measurements and activates the limiter on content that isn't perceptually meaningful
Phase rotation before normalization Lower crest factor → loudnorm applies gain more accurately → limiter works less hard → more transparent output
Two-pass loudnorm with linear=true Single-pass normalization is inaccurate; linear mode applies a clean gain offset with no dynamic processing
NR-for-measurement (WaxOn) Loudnorm analysis always runs on a temporary RNNoise-processed copy to prevent broadband noise from inflating the loudness measurement. The output signal is unaffected. Speech hits the target LUFS more accurately.
Noise floor estimation 10th percentile of per-block RMS identifies background noise level. Color-coded warnings alert users when noise may affect loudness accuracy.
LRA=20 in WaxOn Ingest preprocessing should not touch dynamics; loudnorm acts as pure level adjustment
2× oversampled alimiter (WaxOn) Inter-sample peaks are invisible to a standard sample-rate limiter; oversampling makes them visible and catchable
Brick-wall limiter on WAV path (WaxOff) Loudnorm linear-mode TP is a soft target — inter-sample peaks can slip past its analysis; a 2× oversampled limiter at the user's TP target enforces the ceiling reliably so the rendered WAV honors the configured true peak
MP3 limiter at TP − 1 dB (WaxOff) Lossy decode adds 0.5–1.5 dB of inter-sample peak overshoot; placing the MP3 limiter 1 dB below the WAV ceiling means the decoded MP3 lands at or below the user's effective TP target
MP3 derived from WAV output Normalization happens once; MP3 is a transcode of the normalized file, not a separate re-processing of the original
24-bit WAV output Headroom for further processing (WaxOn) and lossless archive quality (WaxOff); universally compatible
No dithering applied Dithering is only required at word-length reduction (e.g., 24→16-bit); WaxOn/WaxOff output 24-bit, so no bit-depth truncation occurs. For MP3, the codec's own psychoacoustic quantization is orders of magnitude coarser than any dither signal; dithering before a lossy encoder has no effect.

WaxOn/WaxOff is free software licensed under the GPL-3.0. Built by Seven Morris with AI collaboration.