Theory of Operation

WaxOn and WaxOff address opposite ends of the podcast production workflow.

WaxOn is an ingest preprocessor. It takes raw recordings — directly from a microphone or recording interface — and prepares them for editing. The pipeline always applies a high-pass filter (20 Hz floor minimum; 80 Hz when enabled) and always enforces a fixed −1.0 dBTP true-peak ceiling — via a 2× oversampled limiter when Loudness Norm is on, or downward-only linear peak normalization when it is off. Phase rotation, dynamic leveling, and loudness normalization are optional — see Defaults & Parameters. The goal is recordings that arrive in your DAW at consistent, manageable levels with their dynamic character intact. WaxOn is not a mastering tool; it is a staging tool.

WaxOff is a broadcast delivery tool. It takes a finished, edited mix and prepares it for distribution: loudness normalization to a delivery target and true-peak ceiling enforcement for safe streaming ingest. The goal is a delivery file that passes platform normalization cleanly and sounds consistent across episodes.

Both modes are built on the same core DSP stages — EBU R128 loudness measurement, two-pass linear normalization, and oversampled true-peak limiting — combined with mode-specific tools: dynamic normalization (dynaudnorm, WaxOn only) and a dedicated pre-encode limiter for WaxOff's MP3 path. When WaxOn's Loudness Norm toggle is on, RNNoise runs internally during the loudnorm analysis pass only to improve measurement accuracy on noisy recordings — not as a user-facing stage, but as an implementation detail of the two-pass normalization. Every stage is deliberately ordered. The algorithms interact, and the ordering decisions are not arbitrary.

This document explains each stage: what it does, how it works, and why it is placed where it is. The Signal Chains section gives the processing order at a glance. The following sections cover each algorithm in depth — the shared normalization and limiting concepts first, then the WaxOn processing stages in signal-chain order, then the design rationale for each mode, output format decisions, and reference sections on analysis vs processing, defaults, and edge cases.

This document assumes familiarity with basic audio concepts (dBFS, sample rate, dynamic range). It is written for audio engineers, developers, and technically curious users who want to understand the why, not just the what.

Signal Chains at a Glance

WaxOn

High-Pass Filter
20 Hz floor; 80 Hz when on

→

Channel / Downmix
mono: L/R or downmix · stereo: pass-through

→

Phase Rotation
200 Hz allpass

→

Resample
to target rate

→

Dynamic Leveling
dynaudnorm

→

Loudnorm
two-pass EBU R128

→

2× Oversample
→ Peak Control → Resample

Solid border = always runs. Dashed border = optional stage (Phase Rotation and Dynamic Leveling default on/off respectively; Loudnorm defaults off). RNNoise runs only when Loudnorm is enabled — on a temporary copy during pass 1 analysis, after all prior stages; it does not touch the output signal. The final stage always enforces the fixed −1.0 dBTP ceiling — a 2× oversampled limiter when Loudness Norm is on, downward-only linear peak normalization when off. Output: 24-bit WAV.

WaxOff

Phase Rotation
200 Hz allpass

→

Loudnorm
two-pass EBU R128

→

2× Oversample Limit
safety backstop

→

Encode MP3
2× oversample limiter

Phase rotation runs in both the analysis and render passes so the loudnorm TP measurement matches the output waveform. Loudnorm is skipped when pass 1 detects silence or near-silence. The WAV-path limiter is a backstop for inter-sample peaks loudnorm's linear mode missed; on most material it doesn't engage. Dashed nodes are optional. Output: 24-bit WAV and/or MP3.

EBU R128 Loudness

EBU R128 (and the underlying ITU-R BS.1770-4 algorithm) is the measurement standard used by virtually all broadcast and streaming platforms. Spotify, Apple Podcasts, YouTube, and broadcast television worldwide all normalize to a loudness target derived from this standard. Understanding it explains most of what WaxOff does.

History & Context Why This Standard Exists: The Loudness War

The adoption of perceptual loudness metering was not driven by the music industry. It was driven by television viewer complaints about excessively loud commercials. Advertisers discovered they could aggressively compress and brick-wall limit their spots while technically staying within legacy peak-level limits, creating 4–8 dB disparities between programming and advertisements. The FCC Consumer Call Center reported "loud commercials" as a sustained top consumer complaint starting in 2002.

That pressure produced legislation. The CALM Act (Commercial Advertisement Loudness Mitigation Act) was signed into law in December 2010, requiring US broadcasters to keep commercials at the loudness level of surrounding programming. The FCC began enforcement in December 2012. The European Broadcasting Union had published EBU R128 four months earlier, in August 2010, addressing the same problem for European broadcast. Both standards were built on ITU-R BS.1770, published in 2006 as a psychoacoustically weighted metering algorithm specifically designed to correlate with perceived loudness.

Streaming platforms adopted loudness normalization as a natural extension. If a platform normalizes playback, an arms race to master louder than competitors produces only a quieter relative result, not a louder one. Mastering engineer Bob Katz declared at the AES convention in 2013 that the loudness wars were over, citing the emergence of loudness normalization across streaming. Spotify began loudness normalization at launch in 2014, standardizing its −14 LUFS target around 2017; Apple Podcasts specifies −16 LKFS with a ±1 dB tolerance; YouTube introduced normalization in 2015–2016.

The practical consequence for podcast producers: since every major distribution platform normalizes loudness at playback, delivering an aggressively loud, heavily limited master provides no benefit to listeners and costs you dynamic range. The correct goal is accurate level, a clean true peak ceiling, and preserved dynamics.

Integrated Loudness (LUFS)

Integrated loudness (denoted I) is the time-averaged loudness of a complete program, measured in LUFS (Loudness Units relative to Full Scale). LUFS is numerically identical to LKFS; both refer to the same algorithm.

Unlike peak metering or RMS, integrated loudness is:

Perceptually weighted: a K-weighting filter is applied before measurement, approximating human hearing (boosts high-frequency content above ~1.7 kHz, rolls off sub-bass)
Gated: silence and very quiet passages are excluded from the average so they don't misrepresent the loudness of the actual content
Integrated over the full duration: requires the complete file to compute

K-Weighting

K-weighting is a two-stage filter chain applied to each channel before energy summation. It was designed to approximate the frequency-dependent sensitivity of human hearing, particularly the acoustic effect of the head on sound arriving at the ears.

Pre-filter (head-related high-shelf): A second-order shelf boost with a design frequency of approximately 1682 Hz and a gain of approximately +4 dB. This models the acoustic effect of the human head, which increases high-frequency energy at the ear canals relative to a free-field measurement. The boost rises gradually above the shelf frequency and reaches its full value by roughly 5 kHz, where it remains constant through the upper spectrum. The effect is that sibilance, consonant detail, and broadband hiss in the 2–10 kHz range are weighted more heavily in the loudness measurement, matching the ear's increased sensitivity in this region.
RLB weighting (high-pass): A second-order high-pass filter with a design frequency of approximately 38 Hz. This reduces the contribution of sub-bass energy to the loudness measurement. Sub-bass content below about 50 Hz contributes little to perceived loudness under normal listening conditions (particularly on the earbuds and laptop speakers that dominate podcast consumption), and leaving it in the measurement would skew the result for files with DC offset, rumble, or proximity-effect bass buildup.

Under the Hood Filter coefficient computation

The ITU specification defines the filter coefficients for a reference sample rate of 48 kHz. WaxOn/WaxOff's analyzer computes the coefficients from first principles using the bilinear transform, so the filters are accurate at any sample rate the source file uses (44.1 kHz, 48 kHz, 96 kHz, etc.). The pre-filter uses the shelf design parameters f₀ = 1681.97 Hz, gain = 3.9998 dB; the high-pass uses f₀ = 38.14 Hz, Q = 0.5003. These values are taken from the ITU reference implementation and match the pyloudnorm reference used widely in audio research.

The current revision of the standard is ITU-R BS.1770-5 (November 2023). BS.1770-5 adds an annex defining loudness measurement for object-based audio formats (e.g., Dolby Atmos). The K-weighting filter, gating algorithm, and true-peak measurement algorithm used for stereo and mono program content are unchanged from BS.1770-4.

Gating

Loudness is computed over overlapping 400 ms blocks. Adjacent blocks overlap by 75%, producing one new block every 100 ms. The 400 ms window was chosen because it corresponds closely to human short-term loudness perception. Psychoacoustic research by Zwicker and Fastl established that temporal integration of loudness occurs over approximately 200–400 ms, with 400 ms representing the time window over which the ear integrates energy to form a stable loudness impression. Shorter windows would capture transient fluctuations that don't correspond to perceived loudness; longer windows would smooth over meaningful changes in program level.

Two gating stages prevent silence and quiet passages from pulling the integrated value down:

Absolute gate at −70 LUFS: Any block whose K-weighted energy falls below −70 LUFS is discarded. This removes silence, dead air, and extremely quiet room tone from the measurement. The threshold corresponds to a mean-square value of 10^{(−70 + 0.691) / 10} ≈ 1.95 × 10⁻⁷.
Relative gate at −10 LU: From the remaining blocks, compute an ungated mean (the "absolute-gated loudness"). Then discard any block more than 10 LU below that mean. This removes quiet passages that are above the noise floor but significantly below the average program level, such as soft breaths between sentences, quiet background music under narration, or distant room ambience during pauses.

The final integrated loudness is the mean of the blocks that survive both gates. For podcast speech, this means the measurement reflects the loudness of the spoken content, not the silence between sentences.

The Math The Offset Constant: −0.691

The integrated loudness formula includes a constant offset of −0.691 dB:

Integrated Loudness (LUFS) = −0.691 + 10 · log₁₀(Σ Gᵢ · zᵢ)

where Gᵢ is the channel weight (1.0 for front channels, 1.41 for surround) and zᵢ is the gated mean-square of channel i after K-weighting. The −0.691 dB offset calibrates the scale so that a 1 kHz sine wave at 0 dBFS reads exactly 0 LUFS after K-weighting. Without this offset, the K-weighting pre-filter's boost at 1 kHz (~+3.3 dB at the reference frequency) would cause the same sine wave to read approximately +3.3 LUFS. The offset brings the scale back into alignment with the traditional dBFS reference. For mono and stereo speech content, the channel weights are both 1.0, so the formula simplifies to the mean-square of all channels after K-weighting and gating.

True Peak (TP)

True peak is the maximum reconstructed level when the digital signal is converted to analog. It differs from sample peak because the analog waveform between samples can exceed any individual sample value; these are inter-sample peaks. See the True Peak & Oversampling section for detail.

EBU R128 specifies a maximum true peak of −1.0 dBTP for most distribution. WaxOff defaults to this value.

Loudness Range (LRA)

LRA measures the spread between loud and quiet sections of a program (its macro-dynamics) in Loudness Units. It is computed as the difference between the 95th and 10th percentiles of the short-term loudness distribution (after gating). EBU R128 does not mandate a specific LRA target but recommends keeping it below 18 LU for broadcast.

Both modes pass an LRA value to the loudnorm filter. In dynamic mode (linear=false), the LRA target controls how aggressively the filter constrains macro-dynamics — lower values compress more, higher values relax the constraint. Neither mode uses dynamic mode for pass 2: both WaxOn and WaxOff set linear=true, which applies a single constant gain offset and preserves the source's macro-dynamics regardless of the LRA parameter.

WaxOn hardcodes LRA=20 (a permissive value used with linear mode so loudnorm acts purely as level adjustment). WaxOff passes LRA=9 by default (WaxOffSettings.lra, no UI control) — this value is required by the filter string but does not constrain delivery dynamics in linear mode. Pass 1 still measures and logs input_lra, which is useful for assessing how dynamic a mix is before delivery.

Two-Pass Normalization

EBU R128 integrated loudness requires the complete file to compute. It is a time-integrated measurement. You cannot know the correct gain adjustment until after you have read every sample. This makes single-pass normalization impossible for linear (non-dynamic) mode. Both WaxOn and WaxOff solve this with a two-pass approach.

Pass 1: Analysis

FFmpeg's loudnorm filter reads the entire file and prints a JSON block to stderr containing:

input_i: integrated loudness of the source
input_tp: measured true peak
input_lra: measured loudness range
input_thresh: gating threshold used
target_offset: computed gain offset to reach the target

The output is discarded (-f null); only the measurements matter. The app parses the JSON from stderr and stores the values.

Pass 2: Linear Normalization

The same filter runs again, this time with the measured values injected back in and linear=true set:

loudnorm=I={target}:TP={tp}:LRA={lra}
  :measured_I={inputI}:measured_TP={inputTP}
  :measured_LRA={inputLRA}:measured_thresh={inputThresh}
  :offset={targetOffset}:linear=true

Why `linear=true` Matters

The loudnorm filter has two modes:

Dynamic mode (default, linear=false): applies time-varying gain. This is a form of dynamic range processing where quieter passages are boosted more and louder passages less. It introduces subtle distortion and changes the character of the audio. Intended for live streaming where only one pass is possible.
Linear mode (linear=true): applies a single, constant gain offset across the entire file. No dynamic processing, no change to transients or stereo image. The measured target_offset tells the filter exactly what gain to apply.

Both WaxOn and WaxOff use linear mode for pass 2. For mastering and delivery, this is the only correct approach. The goal is level adjustment, not dynamics processing.

True Peak and Oversampled Limiting

The Inter-Sample Peak Problem

Digital audio stores the waveform as discrete samples: amplitude values at regular time intervals (44,100 or 48,000 per second). A sample peak meter reads the highest sample value, which is straightforward. But the analog waveform reconstructed by a DAC continuously interpolates between those samples, and the reconstructed waveform can peak significantly higher than any individual sample value.

These are inter-sample peaks (ISPs), and they become real, audible clipping when:

The file is played back through a DAC (analog reconstruction fills in between samples)
The file is encoded with a lossy codec: MP3, AAC, and OGG encoding and decoding can increase peak levels by 0.1–1.5 dB or more above the source PCM values

A file with a sample peak of −1 dBFS can easily have a true peak above 0 dBFS, causing clipping that no sample-level meter would detect.

The Math The Mathematics of Reconstruction

The Nyquist-Shannon sampling theorem guarantees that a band-limited signal sampled at twice its maximum frequency can be perfectly reconstructed. The reconstruction uses a sinc interpolation kernel:

x(t) = Σ x[n] · sinc((t − nT) / T)

where x[n] are the sample values, T is the sample period, and sinc(x) = sin(πx) / (πx). The key insight is that the sinc kernel oscillates. When adjacent samples have high energy and the right phase relationship, the interpolated waveform between them sums constructively and overshoots both sample values. This is not an artifact or an error; it is the mathematically correct reconstruction of the continuous signal. The samples were never the waveform; they are the minimum information needed to reconstruct it.

The worst case for inter-sample peaks occurs when consecutive samples approach full scale with alternating signs at frequencies near Nyquist (half the sample rate). At 44.1 kHz, high-frequency content near 22 kHz is especially prone. In practice, ISPs on real-world audio material are typically 0.5–3 dB above sample peak, though extreme cases can reach higher.

History & Context The Streaming Ingest Trap

When you upload audio to Spotify, the platform transcodes your file to Ogg/Vorbis or AAC for streaming delivery. If your uploaded file has true peaks near 0 dBFS, the transcode itself can clip: the decoded streaming copy is distorted before a listener ever plays it, and no subsequent gain adjustment will fix it. Spotify's own artist documentation warns: "Really loud modern masters can easily register True Peak levels of +1 or +2 dBTP, and often as much as +3 or +4 dBTP. These are virtually guaranteed to cause encoder clipping if processed as-is." Research on lossy codec encoding has documented decoded true peaks rising 1–3 dBTP above the source in typical cases, with pathological cases reaching considerably higher. This is why the −1.0 dBTP true peak ceiling is a hard requirement, not a polite suggestion.

How Oversampled Limiting Solves This

True peak limiting works by upsampling the signal before the limiter so that inter-sample peaks become visible as actual samples, then limiting those samples, then downsampling back.

Input (44.1 kHz) → Upsample 2× (88.2 kHz) → alimiter → Downsample (44.1 kHz) → Output

At 2× the sample rate, new samples are interpolated midway between each original pair. These interpolated values approximate the continuous waveform reconstruction and capture most inter-sample peaks. The limiter can see and attenuate them. When downsampled back, the true peaks of the resulting file are controlled.

Under the Hood Why 2× and not 4×

2× oversampling catches the vast majority of inter-sample peaks in practice. The ITU-R BS.1770-4 true peak measurement algorithm itself uses 4× oversampling for maximum accuracy, but for a limiter (which only needs to prevent peaks from exceeding a threshold), 2× provides sufficient control. 4× oversampling is used in some mastering workflows to catch pathological edge cases, but the returns diminish quickly: the additional ISPs caught between 2× and 4× are typically less than 0.2 dB on real-world program material. For voice content with limited high-frequency energy near Nyquist, 2× is more than adequate.

WaxOn Limiter Settings

Under the Hood alimiter parameters

WaxOn's alimiter is configured for transparent peak control:

limit: ≈ 0.891251 — pow(10, −1.0/20), the linear-amplitude equivalent of −1.0 dBTP. The dBTP ceiling itself is hardcoded in WaxOn at the EBU R128 standard value and is not user-adjustable (contrast WaxOff's True Peak knob).
attack=5 ms: the limiter begins attenuating 5 ms before the peak. Fast enough to prevent transient overshoot, slow enough to avoid pre-ringing artifacts on voice. For reference, a typical plosive consonant (p, b, t) has a voice onset time of roughly 10–30 ms in English speech; 5 ms catches the transient before it peaks without introducing audible artifacts on voiced content.
release=50 ms: gain recovers in 50 ms after the peak passes. On typical voice material, this is fast enough to be inaudible as pumping while still recovering promptly between words. For comparison, inter-word pauses in conversational speech are typically 150–300 ms, so a 50 ms release completes well before the next word begins.
level=disabled: prevents the limiter from applying makeup gain. Without this, alimiter compensates for gain reduction, undoing the ceiling control. With it disabled, the limiter only attenuates, never amplifies.

WaxOff Limiters

WaxOff's loudnorm filter targets true peak via its TP parameter, but this is a soft target. The filter's internal gain calculation accounts for it, but it does not guarantee a hard ceiling. In practice, the loudnorm output can exceed the TP target by up to ~0.5–1.0 dB in edge cases — particularly on dynamic material where inter-sample peaks slip past loudnorm's analysis. WaxOff applies a brick-wall limiter on both output paths to enforce the user's true-peak target reliably.

Under the Hood WAV-path limiter

A 2× oversampled true-peak limiter runs after loudnorm in the WAV render filter chain:

Ceiling: matches the user's true-peak setting exactly. Loudnorm's linear-mode TP analysis aims to hit the same target via gain reduction; the limiter only acts on what slips past it.
attack=5 ms, release=50 ms: same parameters as WaxOn's limiter — fast enough to catch transients, slow enough to stay transparent on voice. On a well-mixed source the limiter doesn't engage, so these parameters rarely matter in practice.
2× oversampling at 2 × sample rate for true-peak accuracy.

The limiter is a safety backstop, not a loudness maximizer. WaxOff doesn't push gain above what loudnorm calculated; the limiter exists so the user can trust that the rendered WAV will not exceed the configured ceiling.

A consequence: if enough inter-sample peaks require attenuation, the output integrated loudness lands slightly below the configured LUFS target — typically less than 0.5 LU on well-mixed material. This is correct behavior. Slightly under target with a clean True Peak ceiling is better than exactly on target with inter-sample clipping risk at the streaming encoder.

Under the Hood MP3-path limiter

The MP3 path applies a separate pre-encode limiter on top of the already-rendered WAV. Lossy decoding adds 0.5–1.5 dB of inter-sample peak overshoot, so a WAV that landed exactly at the user's TP ceiling would clip after MP3 decode.

Ceiling: TP − 1.0 dB. With the default TP of −1.0 dBTP, the MP3 limiter sits at −2.0 dBTP — providing 1 dB of margin to absorb decoder overshoot, so the decoded MP3 lands at or below the user's effective target. If the user sets TP to −0.5 dBTP, the MP3 limiter scales to −1.5 dBTP automatically.
attack=1 ms, release=20 ms: faster than the WAV-path limiter because this stage is pure safety — it has no creative role and only catches the residual overshoot the codec introduces.
2× oversampling at 2 × 44.1 kHz. MP3 is always encoded at 44.1 kHz regardless of the WAV sample rate setting.

RNNoise: Internal Measurement Tool

RNNoise is not a user-facing processing stage in WaxOn. It runs internally during the loudnorm analysis pass on a temporary copy of the audio to improve measurement accuracy on noisy recordings. The output signal is never touched by RNNoise. See NR-for-Measurement for how and why this works.

The following covers the algorithm's design and behavior for readers who want to understand what is running under the hood.

History & Context Background and Origins

RNNoise was developed by Jean-Marc Valin at Mozilla in 2017–2018 and released as open source under the BSD license. Valin is also a principal author of the Opus audio codec, the codec used by WebRTC, Discord, Zoom, and virtually every real-time web audio application. His work on Opus included extensive research into perceptual audio coding and voice intelligibility under compression, which directly informed the approach taken in RNNoise.

The project grew from a practical problem in WebRTC: browser-based voice communication was plagued by background noise (keyboard clicks, HVAC, crowd noise, fan hum) that conventional noise suppression handled poorly, either leaving too much noise or introducing the characteristic warbling, underwater artifacts of aggressive spectral subtraction. Valin's hypothesis was that a machine learning approach trained specifically on speech could do better.

The original paper, A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement (arXiv:1709.08243), was presented at the IEEE Multimedia Signal Processing Workshop in 2018. It has since been widely cited in the speech enhancement literature and influenced subsequent neural audio processing work at companies including Google, Microsoft, and Amazon.

Under the Hood Architecture: Gated Recurrent Units

RNNoise is a recurrent neural network using Gated Recurrent Units (GRUs), a variant of LSTM that uses fewer parameters and trains faster while retaining the ability to model temporal dependencies across variable-length sequences. The key difference from LSTM is that GRU combines the forget and input gates into a single "update gate" and merges the cell state with the hidden state, reducing the parameter count by roughly 25% per layer. The architecture is deliberately small: the network has roughly 100,000 parameters total, making real-time inference feasible on hardware as constrained as embedded processors with no dedicated GPU.

The network processes audio in the frequency domain using the Opus codec's Bark-scale filterbank: 22 critical bands that approximate the frequency resolution of human hearing. This is a key design choice. Rather than learning to operate on raw waveforms (which requires modeling extremely long-range sample dependencies) or on fixed FFT bins (which don't match perceptual resolution), RNNoise works on the same perceptual frequency representation that the ear itself uses. The Bark scale groups frequencies into bands of roughly equal perceptual width: narrow bands at low frequencies (where pitch discrimination is fine) and progressively wider bands at high frequencies (where the ear integrates more broadly).

For each 10 ms frame of audio, the network computes a set of spectral gains (one per band) between 0 and 1. A gain of 1.0 means that band is passed through unmodified. A gain of 0 means it is fully suppressed. Intermediate values attenuate partially. The gains are applied multiplicatively to the band energies, and the modified spectrum is reconstructed back to a waveform. The network never synthesizes audio; it only decides how much of each perceptual band to suppress in each frame.

Under the Hood Training and the Model File

The bundled model (resource name rnnoise, rnnoise-nu format) is derived from community-trained weights in the rnnoise-models repository. It was trained on a large corpus of speech (multiple speakers, multiple languages, multiple recording conditions) mixed with a wide variety of noise types: HVAC, traffic, crowd noise, fan hum, electrical interference, and broadband pink and white noise. The model learns to identify which spectral components correspond to voice and which correspond to noise, using temporal context (the GRU's hidden state) to distinguish steady-state noise from transient speech components.

Training required both clean speech recordings and noise-only recordings, which were artificially mixed at various signal-to-noise ratios. The network learned the difference between speech-shaped energy and noise-shaped energy across thousands of examples. Because the training data was multilingual and broad-spectrum, the resulting model generalizes well across different speakers, accents, and recording conditions without any per-speaker adaptation.

What It Suppresses Well, and Poorly

RNNoise excels at steady-state, spectrally diffuse noise: HVAC hum, room tone, computer fan noise, broadband electrical hiss, and low-level crowd ambience. These share a characteristic spectral profile that is relatively stable over time and distributes energy broadly, making them easy for the network to distinguish from voice. On clean recordings with consistent low-level background noise, suppression is typically very effective and inaudible.

It handles poorly, and can introduce artifacts with, the following:

Highly variable or impulsive noise: loud intermittent events (door slams, keyboard bursts, chair squeaks) are often only partially removed and sometimes produce brief processing artifacts at the onset
Noise-dominated recordings: when SNR is very low (background noise significantly louder than the voice), the network sometimes misclassifies voice components as noise and attenuates them, producing a thin, slightly hollow quality
Music or complex tonal content: the network is trained on speech and may suppress legitimate musical content that doesn't match speech energy patterns
Narrowband tonal hum (mains hum at 50/60 Hz): handled better by a dedicated notch filter. RNNoise can suppress it but is not specifically optimized for a single frequency

The artifact profile when limits are exceeded is typically a subtle warbling or underwater quality, the same category of artifact produced by spectral subtraction noise gates, though usually less severe. On moderate-noise, clean-voice recordings, the algorithm is essentially transparent.

Stereo Handling: Per-Channel Split

RNNoise was designed and trained exclusively on mono 48 kHz speech. When FFmpeg's arnndn filter receives a stereo input, it creates separate denoiser instances per channel and processes them independently. In practice, this can produce unpredictable results: the per-channel recurrent states diverge, and one channel (typically the second) may be over-gated or heavily attenuated, even when both channels carry similar content and noise levels.

The root cause is that the model's internal gain computation is frame-by-frame and depends on its recurrent hidden state. With stereo input, slight differences between channels (different mic angles, room reflections, or even minor level offsets from recording) can cause the model to classify one channel as "more noisy" than the other and gate it more aggressively. The model has no concept of channel correlation or stereo coherence.

WaxOn solves this by splitting stereo audio into independent mono channels before applying RNNoise, then rejoining the denoised channels back into stereo.

Under the Hood filter_complex graph

This uses FFmpeg's filter_complex graph:

[0:a]channelsplit=channel_layout=stereo[L][R];
[L]arnndn=m=/path/to/model[Lnr];
[R]arnndn=m=/path/to/model[Rnr];
[Lnr][Rnr]join=inputs=2:channel_layout=stereo

Each channel receives its own fully independent denoiser instance with its own recurrent state, initialized cleanly. The model processes each as a standard mono stream — the format it was trained on — and the results are predictable and balanced. NR-for-measurement runs on the already-processed intermediate (after high-pass, channel handling, phase rotation, resample, and optional dynamic leveling) — not at ingest. Only the loudnorm pass-1 analysis reads that NR'd temp copy; pass 2 and all subsequent stages operate on the unmodified intermediate.

For mono output, the issue does not arise. The pipeline's first FFmpeg stage already extracts the selected channel via pan (or downmixes multichannel sources) alongside the high-pass and phase-rotation filters, so the audio reaching the NR-for-measurement step is already a single channel. The simple -af "arnndn=m=…" chain runs over that mono signal — one denoiser instance, one stream — with no per-channel state divergence to worry about.

The same per-channel split is applied in the NR-for-measurement analysis paths.

Under the Hood FFmpeg Implementation

The arnndn filter in FFmpeg wraps the RNNoise library. It requires an external model file provided via the m= parameter:

arnndn=m=/path/to/model

The model file is bundled in the app as a resource named rnnoise (no file extension). WaxOn locates it at runtime using Bundle.main.url(forResource: "rnnoise", withExtension: nil) and passes the resolved path to FFmpeg. In the analysis pass, for mono output the filter runs on the temp copy via a simple -af chain. For stereo output, a -filter_complex graph splits, denoises, and rejoins the channels as described above. If the model cannot be loaded, loudnorm proceeds without NR-for-measurement and a warning is logged.

Processing latency for arnndn is negligible for batch processing purposes. The network processes audio in 10 ms frames. For a 60-minute recording, the total added processing time is a few seconds on Apple Silicon.

Noise Floor Estimation

WaxOn/WaxOff estimates the noise floor of each loaded file and displays it as the FLOOR stat in the file stats panel. The estimate is computed during the same analysis pass that produces RMS, peak, crest factor, and LUFS, at no additional cost.

The Problem

Broadband background noise (HVAC, room tone, preamp hiss) occupies spectral space continuously, including during pauses between speech. This noise contributes to the integrated loudness measurement in two ways:

K-weighting amplifies it. The pre-filter's ~4 dB high shelf boost above 1.7 kHz increases the measured energy of broadband hiss, which has significant energy in the 2–10 kHz range. The loudness measurement sees the noise as louder than it subjectively is.
Noise fills gated blocks. The relative gate excludes blocks more than 10 LU below the ungated mean. In a clean recording, pauses between sentences fall below this threshold and are excluded. In a noisy recording, noise energy keeps those blocks above the gate threshold, and they contribute to the integrated loudness value.

The net effect: noisy files measure louder than their speech content actually is. When loudness normalization targets a specific LUFS value, the gain applied is less than the speech needs. The speech ends up under target.

Estimation Method

The analyzer divides the audio into non-overlapping 400 ms blocks (the same block size used for LUFS gating). For FLOOR only, each block is first summed to mono and passed through a Butterworth high-pass filter whose cutoff tracks the WaxOn High Pass toggle: 80 Hz when High Pass is on, 20 Hz when off — so the FLOOR estimate reflects the same frequency content the processing pipeline sees. The noise floor is the 10th percentile of these filtered block RMS values, converted to dBFS. RMS, peak, crest factor, and LUFS stats use unfiltered audio.

The 10th percentile was chosen because it represents the quietest 10% of the file's blocks. For speech recordings, the quietest blocks are the pauses, breaths, and gaps where the microphone is capturing only the ambient environment. The 10th percentile is more robust than the absolute minimum (which might catch a single anomalously quiet block) while still reflecting the true background level rather than the speech level.

At least 5 blocks are required for a meaningful estimate (about 2 seconds of audio). Shorter files show no FLOOR stat.

Thresholds and Color Coding

In WaxOn, the FLOOR stat is color-coded in the stats panel:

Normal (no color): noise floor below −50 dBFS. Unlikely to affect loudness measurements meaningfully.
Orange (above −50 dBFS): the noise floor may be contributing to the integrated loudness measurement. Speech targets may be slightly undershot. When Loudness Norm is enabled, WaxOn's internal NR-for-measurement pass will help accuracy; the noise in the output itself is unaffected.
Red (above −40 dBFS): the noise floor is high enough to significantly skew loudness measurements. At this level, the noise is almost certainly audible to listeners as well. Consider dedicated noise reduction in iZotope RX or similar before processing in WaxOn.

In WaxOn, files with an orange or red noise floor also show a ⚠️ warning badge in the file list.

WaxOff suppresses both the color-coding and the badge. The thresholds above are calibrated for a raw, unedited speech floor — the gap between words where the mic captures only the room. A finished delivery mix legitimately carries continuous low-level content there: music beds, room tone deliberately left under the voice, sustained ambience. That content reads as a high floor without being a defect, so flagging it on a delivery file would be a false alarm. WaxOff still computes and displays the same FLOOR value — the measurement is identical to WaxOn's — and only withholds the warning/critical styling. WaxOn's thresholds (−50 / −40 dBFS) and behavior are unchanged.

NR-for-Measurement

When Loudness Norm is enabled, WaxOn runs RNNoise on a temporary copy of the post-filtering intermediate (after high-pass, channel handling, phase rotation, resample, and optional dynamic leveling) for the loudnorm analysis pass (pass 1) only. The normalization pass (pass 2) and all subsequent stages operate on the unmodified intermediate — not the NR'd copy. This ensures that loudness measurements reflect the speech content rather than the noise floor, without altering the output.

Why This Works

The two-pass loudnorm process measures the file's integrated loudness in pass 1, then applies a single linear gain offset in pass 2. The gain offset is determined entirely by the pass 1 measurement. If pass 1 measures a noise-inflated loudness (file appears louder than the speech actually is), the computed gain will be too small, and speech will land under target.

By measuring the NR'd copy instead, the analysis reflects the loudness of the speech content with the noise floor suppressed. The computed gain offset is then applied to the original file. Because RNNoise primarily removes energy between and underneath words (not the speech itself), the speech content in the original and NR'd versions has approximately the same loudness. The gain derived from the clean measurement lands the speech close to the target.

The noise floor in the original file does come along for the ride. It is amplified by the same gain as the speech. But the philosophy here is pragmatic: WaxOn is a prep tool for DAW editing. If the noise is bad enough to matter, it will be treated in the DAW (or in a dedicated NR tool like RX). Getting the speech to the right level for editing is the higher priority.

For stereo output, the NR-for-measurement paths use the same per-channel split: stereo is split into independent mono channels, each denoised separately, then rejoined before the loudnorm analysis. This ensures consistent, balanced noise removal for accurate measurement regardless of channel layout.

The NR temporary file is written at the output sample rate (44.1 or 48 kHz), not a fixed 48 kHz, so loudnorm pass-1 measurements match the rate of pass 2 — required for valid two-pass measured_I / offset values on 44.1 kHz exports.

Cost

NR-for-measurement adds one additional FFmpeg pass per loudnorm analysis (running RNNoise on the intermediate audio to a temporary file). For a typical podcast recording on Apple Silicon, this adds a few seconds. The temporary NR'd files are created in the working directory and deleted automatically after processing.

Phase Rotation and Crest Factor

Crest Factor

Crest factor is the ratio of a signal's peak level to its RMS level, expressed in dB:

Crest Factor (dB) = Peak (dBFS) − RMS (dBFS)

Typical unprocessed speech has a crest factor of 18–25 dB, with 20–23 dB most commonly cited in speech processing literature. High crest factor has a practical consequence for loudness normalization: to reach a loudness target without exceeding the ceiling, the limiter must apply more gain reduction (limiting). More limiting means more audible artifacts: transient softening, pumping, coloration.

Reducing crest factor before normalization means the same LUFS target can be reached with less limiting and more transparency.

How Allpass Filtering Reduces Crest Factor

An allpass filter passes all frequencies at equal amplitude but shifts the phase of different frequencies by different amounts. It doesn't alter the frequency response; it only changes when different frequency components arrive relative to each other.

Under the Hood Filter implementation

Both modes use FFmpeg's allpass filter at f=200, t=q, w=0.707 — a second-order (biquad) allpass with a Butterworth-Q response. The magnitude |H(z)| = 1 at all frequencies (unity gain). The phase response varies continuously from 0° at DC to −360° at Nyquist, passing through −180° at the design frequency. Frequencies below the design frequency are shifted slightly; frequencies above it are shifted more. The relative timing of low-frequency and mid-frequency components in the waveform changes, but their amplitudes do not.

Much of the peak asymmetry in voice audio comes from low-frequency content: proximity effect from cardioid microphones, low-frequency resonances in recording spaces, and bass-heavy content in finished mixes. This energy tends to create asymmetric waveforms where one polarity consistently peaks higher than the other.

Proximity effect is worth understanding in detail because it affects nearly every podcast recording. Directional microphones (cardioids, supercardioids, figure-8 patterns) exhibit increasing bass boost as the sound source moves closer, beginning around 12 inches and growing progressively stronger below approximately 100–200 Hz. For cardioids at typical close-mic distances, proximity effect typically adds 6–12 dB of bass boost; figure-8 patterns can reach 20 dB or more at the same distances due to their stronger gradient response. Omnidirectional microphones do not exhibit proximity effect, but the cardioid pattern dominates consumer and prosumer podcast microphones (Shure SM7B, Audio-Technica ATR2100, most USB microphones), making this a near-universal issue. Podcasters without broadcast training tend to position themselves very close to their microphones to minimize room noise, an instinct that unfortunately triggers the strongest proximity effect and produces the most bass-heavy, asymmetric waveforms. The result lands squarely in the 150–250 Hz range that phase rotation is designed to address.

An allpass filter in the low-frequency range redistributes the phase relationships between bass components and midrange components, making peaks more symmetric. The result is a lower crest factor (peaks are shorter relative to average level) without any change to the frequency response or audible character of the audio.

The effect is genuinely inaudible. Human hearing is largely insensitive to absolute phase at audio frequencies. The cochlea performs a frequency decomposition that discards phase information. This is why polarity inversion (flipping the sign of every sample) and allpass filtering (frequency-dependent phase shift) are both perceptually transparent, despite being mathematically significant transformations of the waveform.

Quantifying the Effect

On typical podcast recordings with moderate proximity effect, allpass phase rotation at 200 Hz reduces crest factor by 1–4 dB. A 3 dB crest factor reduction means the limiter needs to apply 3 dB less gain reduction to stay below the same ceiling at the same loudness target. That translates directly to less audible limiting artifacts. On clean, well-recorded speech with minimal bass buildup, the crest factor reduction is smaller (0.5–1 dB). On a single-microphone recording the allpass has no audible downside: it costs nothing in audio quality and can only help. The caveats below cover the exceptions, principally multi-microphone configurations in shared acoustic space.

Multiple Microphones in the Same Room

The following is a workflow consideration, not an implementation limitation. WaxOn processes each file correctly. The question is whether applying phase rotation per-track — before the tracks are mixed — is the right place in the production chain to apply it, when those tracks were captured simultaneously in the same room.

The "no downside" claim above assumes a single microphone capturing a single source. In multi-mic recordings — two podcasters at separate mics in the same room, a host-and-guest setup, a roundtable, or any configuration where one acoustic source reaches multiple mics — the picture changes.

Consider two speakers, A and B, each on their own cardioid microphone. Speaker A's voice reaches both mics: strongly into A's mic, faintly into B's. This faint copy is called bleed or spill. The bleed and the main signal are acoustically coherent — the same waveform captured at slightly different times and levels — so when the tracks are summed in the mix, the bleed adds to the main signal in a phase-related way. The result has some comb filtering already (any time a signal sums with a delayed copy of itself, it does), but the comb is fixed and a property of the room.

Apply phase rotation independently to each track and that coherence is broken. Each track receives the same allpass filter, but the bleed on track B has different spectral content than the main signal on track A — it is highpass-shaped by the off-axis response of B's mic, attenuated by distance, and colored by reflections — so it emerges from the filter in a different phase state. Summing the tracks now combines a phase-shifted version of A's voice with a differently-phase-shifted version of A's voice. Comb filtering deepens, and the notches shift in frequency relative to what the room produced naturally. The audible result is a thinness or hollowness in the low-mid range, sometimes described as "phasey."

How bad it gets depends on bleed level. Close cardioid placement at typical podcast distances usually keeps bleed −20 to −30 dB below the direct signal, where the additional comb filtering from independent phase rotation is below the threshold of audibility. Loose placement, omnidirectional mics, or talkers who lean back from the mic put bleed in the −10 to −15 dB range, where artifacts become noticeable.

The 3:1 rule. A useful guideline from broadcast practice: the distance between two microphones should be at least three times the distance from each microphone to its source. With talkers six inches from their mics, the mics themselves should be eighteen inches apart or more. This keeps bleed roughly 9–10 dB below the direct signal, low enough that downstream processing — phase rotation, EQ, compression — does not interact destructively across tracks.

Working around it. If you suspect significant bleed in a multi-track session:

Skip per-track phase rotation in WaxOn and apply a single allpass at the mix-bus stage in your DAW after the tracks are summed. Operating on the summed mix shifts the bleed and main signal together, preserving their original phase relationship.
If per-track processing is necessary, use identical allpass settings on every track captured in the same room. Coherent inputs through identical filters remain coherent; coherent inputs through differing filters do not.
Apply a noise gate or downward expander before phase rotation so that bleed below a threshold is suppressed and the allpass operates predominantly on the dominant signal.
At the recording stage, address the root cause: tighter mic placement, hypercardioid patterns, acoustic baffles between speakers, or — when feasible — recording each speaker in a separate room.

Phase Rotation at the Delivery Stage

WaxOff applies the same allpass filter at the same frequency, but the context is different. By the time audio reaches WaxOff it is a finished, summed mix — every track has been combined, leveled, and committed to a single stereo file — so the per-track bleed concerns above no longer apply. The filter operates on the summed program, exactly the recommended approach when multi-mic bleed is a concern at the WaxOn stage.

The case for phase rotation is actually stronger at the delivery stage than at ingest. Music beds, stingers, archival clips, and broadcast inserts in a podcast mix tend to be more asymmetric than solo voice — bass-heavy program material, sustained low notes, and pre-mastered loud sources all introduce waveform asymmetry that the WaxOff limiter would otherwise have to absorb. Reducing crest factor before loudnorm's TP measurement means the pass-1 measurement reflects the actual peak budget of the rendered output, so the pass-2 linear gain correction is accurate. Without it, loudnorm measures the peaks of the un-rotated signal and constrains the gain to that (lower) peak budget, producing an output that fails to hit the target LUFS by the headroom phase rotation would have recovered. WaxOff applies phase rotation unconditionally for this reason — there is no toggle, since at the delivery stage there is no scenario in which skipping it produces a better result.

Stereo Recordings and Channel Coherence

WaxOn applies phase rotation as a single allpass filter across the full signal — both channels receive exactly the same coefficients simultaneously, so the inter-channel phase relationship that defines the stereo image is preserved. Unlike the RNNoise stage (which requires an explicit per-channel split because the model was trained on mono), the allpass filter operates correctly on stereo input by design and needs no special handling.

The pitfall is in pre-processing outside the app: if a stereo recording is split into two mono files, processed individually through different tools or with different settings, and then recombined, the stereo image can shift unpredictably. Mid information (L + R) and side information (L − R) both depend on phase agreement between channels. A 90° shift on one channel and not the other moves energy from the mid component into the side component at that frequency, widening or hollowing the center image and degrading mono compatibility.

This matters most for stereo room recordings (XY, ORTF, spaced pair) where the image is built from genuine inter-channel arrival-time and level differences. It matters less for stereo created by panning mono sources, where both channels carry the same waveform with only a level difference; in that case an identical allpass on both leaves the image untouched.

What Phase Rotation Cannot Fix

Allpass phase rotation addresses asymmetric peaks caused by phase relationships between low-frequency and mid-frequency content. Several other sources of waveform asymmetry look similar on a meter but are immune to phase manipulation:

DC offset. A constant bias in the recorded signal shifts the entire waveform up or down, so peaks on one polarity exceed the other by a fixed amount. An allpass filter cannot remove a constant offset; the fix is a high-pass filter (which WaxOn applies at 80 Hz before phase rotation when High Pass is enabled, with a 20 Hz DC floor always enforced regardless) or an explicit DC blocker.
Polarity inversion. A miswired XLR cable or a mic with reversed polarity flips the sign of every sample. The waveform is just as asymmetric afterward, only mirrored, and downstream summing with another mic that captured the same source coherently will produce dramatic cancellation. Phase rotation does not correct polarity; it must be flipped at the source or with an explicit polarity invert before further processing.
Inherently asymmetric source material. Voiced speech is fundamentally asymmetric — the glottal pulse that drives the vocal folds is impulsive on closure and gradual on opening, producing waveforms with consistently larger peaks on one polarity. Brass instruments and some synth waveforms behave similarly. Phase rotation can reduce the visible asymmetry of the recorded waveform, but it cannot make a fundamentally asymmetric source symmetric. The residual asymmetry that remains after phase rotation is a property of the source itself.
Clipping. Once a sample has been hard-clipped at the converter or by a previous processing stage, the lost peak information cannot be recovered. Phase rotation may slightly redistribute the energy of a clipped event in time, but the distortion is already baked in.

Dynamic Leveling

Dynamic Leveling is an optional WaxOn processing stage that uses FFmpeg's dynaudnorm filter to even out level differences across a recording. It can both attenuate loud sections and boost quiet ones. It runs after filtering and before loudness normalization. Clips shorter than 2 seconds skip this stage entirely.

The Problem It Solves

Dynamic Leveling handles recordings where level variation is structural — a panel discussion where audience questions are half the volume of the speakers, a live Q&A where the presenter is near the mic and questioners are across the room, or a multi-person recording where mic placement was inconsistent. You need to lift the quiet material and tame the loud material simultaneously.

dynaudnorm is inherently a two-pass, lookahead-style leveler — it lifts quiet sections toward the target peak and attenuates loud ones. This is the right behavior for uneven source material, but it produces audible pumping on solo voice recordings with natural pauses: as the filter adjusts gain across the silence between sentences, the noise floor breathes in and out. WaxOn does not gate near-silent frames via dynaudnorm's silence threshold (t) parameter, because an active threshold causes severe attenuation at speech-to-silence transitions: the Gaussian smoothing window interpolates between gated (unity-gain) and ungated frames, producing audible fade-outs on the trailing edge of every utterance. The trade-off is that the noise floor between words is boosted by up to the maximum gain factor — acceptable on the clean source material Dynamic Leveling targets. Dynamic Leveling is a specialty tool for multi-voice sources, not a general-purpose enhancement for solo recordings.

The Strength Knob

The Strength knob is a 0–1 control that maps to three dynaudnorm parameters simultaneously:

Frame size (f): 500 ms at Gentle, 150 ms at Aggressive. Shorter frames mean the filter can respond to level changes more quickly.
Gaussian window (g): 31 frames at Gentle, 15 frames at Aggressive (forced odd). Smaller windows mean the filter uses less surrounding context when computing gain — faster response, less smoothing.
Max gain (m): 2× at Gentle, 6× at Aggressive (≈ +6 dB to +15.6 dB). Controls how much the filter will boost a quiet frame to reach the target peak.

One parameter is fixed across the slider range:

p=0.95 — target peak amplitude per frame (≈ −0.4 dBFS).

Gentle is the right starting point for most material. Aggressive is appropriate when level differences are extreme and the source already has significant background noise that boosting won't make materially worse.

Placement in the Pipeline

Dynamic Leveling runs before loudness normalization. This means the loudnorm analysis pass sees a more consistent signal — bringing the overall level balance closer to even before measurement means the integrated loudness reading reflects typical content level rather than being skewed by a structural imbalance between speakers.

Mirror-Padding Boundary Fix

The Gaussian-weighted gain computation in dynaudnorm produces boundary artifacts at both ends of the file. The naive workaround of padding with silence does not solve this: dynaudnorm assigns silent frames a gain of 1.0 (silence-threshold behavior), and those unity-gain values get averaged into the smoothing window at the audio/silence boundary, pulling the smoothed gain down and producing an audible ramp into the real audio.

The actual fix is mirror padding: prepend a reversed copy of the first 16 seconds of audio, append a reversed copy of the last 16 seconds. The smoothing window now sees real audio with matching gain values on both sides of the boundary, so the smoothed gain at the edge matches what it would be in the middle of the file. After processing, the padding is trimmed off with atrim and the output matches the original duration. For clips shorter than 16 seconds, the pad length is capped at the clip's length. If file duration cannot be determined, mirror padding is skipped and a warning is logged.

WaxOn Design Rationale WaxOn

Stage Order

The WaxOn pipeline stage order is deliberate:

High-pass filter first: A high-pass at 80 Hz (when enabled) or 20 Hz (DC floor when off) runs before any gain stage. Low-frequency energy is disproportionately loud and would cause loudnorm to underestimate the actual loudness of content you care about, and cause the limiter to work harder than necessary on energy that isn't perceptually meaningful.
Channel selection before phase rotation: If extracting mono from a stereo source, do it first so the allpass filter operates on the actual mono signal, not a wider stereo version of it. Multichannel sources (>2 ch) downmix to mono with a warning. Stereo output passes both channels through unchanged. The loudnorm analysis then also measures the real output signal.
Phase rotation before normalization: Reduces crest factor so that the loudnorm analysis measures a waveform that more accurately represents what the limiter will see after normalization.
Dynamic leveling before normalization: Dynamic leveling runs after filtering. Bringing the overall level balance closer to even before the loudnorm analysis pass means the integrated loudness measurement reflects the typical level of the material rather than being skewed by large quiet sections or loud outliers.
Peak control last (always on): After any optional loudness normalization. When Loudness Norm is on, a 2× oversampled limiter catches true peaks; when off, a downward-only linear peak normalization brings the true peak to the ceiling instead. The ceiling is fixed at −1.0 dBTP either way — not user-configurable in WaxOn (unlike WaxOff's true-peak knob).

Factory Defaults

WaxOn's factory defaults bias toward a safe ingest path: High Pass and Phase Rotation on, Dynamic Leveling and Loudness Norm off. The built-in Edit Prep preset enables Loudness Norm at −30 LUFS for batch ingest. See Defaults & Parameters for the full list of UI-exposed settings.

LRA=20 in WaxOn Loudnorm

WaxOn's loudnorm hardcodes LRA=20. Combined with linear=true in pass 2, loudnorm applies a single constant gain offset — no macro-dynamic compression. The LRA value is a permissive filter parameter; the linear mode is what guarantees dynamics preservation.

This is intentional for ingest. WaxOn is a pre-editing tool. You want your recordings to arrive at your DAW at consistent levels, but with their original dynamic character intact. Any dynamic processing at this stage would fight against the compression and automation you'll apply during editing.

Default Loudnorm Target: −30 LUFS

−30 LUFS is the default target when Loudness Norm is enabled (factory default: off). At this level, even a recording with significant dynamic range and a crest factor of 20 dB will have peaks well below −10 dBFS, giving the limiter ample headroom. The goal is to bring different recordings to a consistent level for editing, not to hit a delivery target. −30 LUFS leaves plenty of room for the final mix to breathe. The UI range is −35 to −16 LUFS in 1 LU steps.

Loudnorm TP and Limiter Ceiling

When Loudness Norm is enabled, both the loudnorm TP parameter and the final alimiter limit are set to −1.0 dBTP — fixed, not user-adjustable in WaxOn. These are not redundant:

Loudnorm's TP influences its internal gain calculation. It tries to avoid exceeding the ceiling, but this is a soft target.
The alimiter is the hard ceiling. It guarantees no output sample (at 2× sample rate) exceeds the limit.

Setting both to the same value means loudnorm and the limiter are working toward the same goal. If loudnorm succeeds, the limiter barely engages. If loudnorm slightly overshoots, the limiter catches it. The two stages are complementary, not redundant. When Loudness Norm is off, the limiter does not run at all — a downward-only linear peak normalization brings the true peak to the same fixed −1.0 dBTP ceiling instead (a transparent pass-through when the source is already under it), so the ceiling is honored without any limiting.

There is a deliberate trade-off here: on dynamic source material where inter-sample peaks push past −1.0 dBTP after loudnorm, the limiter must attenuate those peaks and the output integrated loudness lands slightly below the configured target — typically less than 0.5 LU on well-recorded speech, more on heavily compressed or clipped sources. True Peak protection is the harder constraint; the LUFS target is the goal. For WaxOn's editing-prep purpose this is entirely acceptable: material landing 0.5–1 LU under target is fully usable, and the editor applies further gain and dynamics in the DAW anyway. The ceiling guarantee is what matters — it prevents digital overs at every downstream stage.

WaxOff Design Rationale WaxOff

Stereo Output and Mono Delivery

WaxOff outputs 2-channel stereo by default. Mono input is accepted: WaxOff upmixes it to dual-mono as the first filter stage — before phase rotation and loudnorm. Both the pass-1 loudness measurement and the pass-2 gain correction then operate on the stereo signal, so the output lands at the configured LUFS target regardless of whether the source was mono or stereo.

The position of the upmix matters. The intuitive approach — normalize, then duplicate channels — produces output that reads approximately 3 LU above target. The reason: loudnorm's gain correction is computed against the mono signal. When -ac 2 then duplicates the channels at encode time, the stereo loudness measurement sums both channels' energy; a dual-mono file where L and R carry identical content reads louder than the mono signal the gain was chosen for. Moving the upmix before loudnorm avoids this: both passes normalize the stereo signal they will actually output.

Mono delivery is the inverse. When the Channels control selects mono for a mono source, the upmix is skipped outright — not applied and then undone — in both the analysis and render passes. With no dual-mono pair created at any point, the +3 LU quirk above simply does not arise: there is nothing to correct for. Loudnorm measures and normalizes the native single channel, and the output is written as that one channel. The same invariant holds as in the dual-mono case — both passes operate on an identical channel layout, so the pass-1 gain stays valid for the pass-2 render. The control is batch-wide rather than per-file: it is offered only when every loaded source is mono, so the choice unambiguously applies to the whole run. And it is not a downmix — a stereo source is always delivered as stereo, untouched by this setting.

No High-Pass Filter

WaxOff doesn't include a high-pass filter. By the time a mix reaches WaxOff, it has presumably been edited and processed in a DAW. High-pass filtering, EQ, and cleanup are part of the editing workflow. WaxOff assumes the mix is already correct and applies only the normalization needed for delivery.

LRA=9 (measured, not constrained)

WaxOff passes LRA=9 to the loudnorm filter string. The value lives in WaxOffSettings.lra (default 9.0) but has no UI control — it is not user-configurable in normal use.

Because WaxOff always uses linear=true for pass 2, this LRA target does not apply macro-dynamic compression at delivery. Pass 2 is a single linear gain offset; the mix's loud/quiet spread is preserved. Pass 1 still measures and reports input_lra, which appears in the processing log — useful for checking whether a finished mix sits in a typical podcast range (roughly 6–10 LU for well-balanced speech content). If dynamic loudnorm normalization were ever needed, it would require linear=false; WaxOff deliberately does not use that mode.

Delivery Targets

Platform targets below are approximate as of early 2026 — verify against current host and platform documentation before optimizing for a single service.

Platform	Target LUFS	Max True Peak
Apple Podcasts	−16 LUFS (normalized)	−1.0 dBTP
Spotify	−14 LUFS (normalized)	−1.0 dBTP
Buzzsprout	−16 LUFS	−1.0 dBTP
YouTube	−14 LUFS (normalized)	−1.0 dBTP
EBU R128 (broadcast)	−23 LUFS	−1.0 dBTP

Most streaming platforms normalize incoming audio to their own target on playback, so delivering at −18 LUFS versus −16 LUFS won't make your episode sound quieter or louder to listeners (the platform adjusts). What matters most is staying below the true peak ceiling to avoid clipping during that normalization step.

One platform asymmetry is worth knowing: YouTube only normalizes downward. It will not boost content that is quieter than −14 LUFS. Spotify normalizes in both directions. Apple Podcasts normalizes both ways as well. This means a mix delivered at −23 LUFS will sound quieter than expected on YouTube even though it is compliant, while on Spotify it will be boosted to match −14 LUFS. For podcast delivery, this is rarely a real-world issue since vocal content at −18 LUFS will be boosted on both, but it matters if you are optimizing for a single platform.

The Audio Engineering Society recommends −16 to −20 LUFS as the appropriate range for talk-based podcast content, with −18 LUFS as the practical center. The reasoning is threefold: mobile playback amplification is limited (content at −23 LUFS is difficult to hear in noisy environments like commuting), podcast consumption typically happens in ambient noise where higher average loudness aids intelligibility, and −18 LUFS sits safely between all the major platform targets. It will be boosted modestly by Apple and Spotify rather than aggressively attenuated by either. Delivering at −14 LUFS, for example, would be attenuated by Apple Podcasts and is right at Spotify's ceiling, leaving no safety margin. The conservative −18 LUFS leaves room for platforms to boost cleanly without any risk of triggering codec clipping.

WaxOff's default of −18 LUFS with −1.0 dBTP is a safe, widely accepted podcast delivery target.

Output Format Rationale

24-bit WAV

Both WaxOn and WaxOff output 24-bit WAV as the primary format.

24-bit vs 16-bit: 24-bit provides 144 dB of theoretical dynamic range vs 96 dB for 16-bit. For WaxOn (pre-editing ingest), the additional headroom is essential: you'll apply further processing in a DAW, and each gain stage consumes headroom. For WaxOff delivery, 16-bit would technically suffice, but 24-bit is the standard for intermediate and archive files and costs nothing on modern storage.
No dithering required: Dithering is only necessary when reducing bit depth (converting 24-bit to 16-bit, for example). Since WaxOn and WaxOff output 24-bit, no word-length reduction occurs and dithering is not applicable. See the Dithering section for the full background.
WAV vs FLAC: WAV is uncompressed and universally compatible. FLAC offers lossless compression but adds decoding overhead and has narrower compatibility in broadcast/video tools. WAV is the correct choice for intermediate and delivery files in a podcast production workflow.

MP3 CBR

WaxOff's MP3 output uses CBR (constant bit rate) rather than VBR (variable bit rate). For podcast delivery:

CBR vs VBR: VBR can achieve better quality-per-byte for music, but podcast players, hosting platforms, and embedded systems have well-documented seeking and duration problems with VBR MP3. In a 2016 post that circulated widely in the podcasting community, developer Marco Arment (Overcast, Instapaper) measured a concrete example: a share-at-timestamp link pointing to 1:24:30 in a real podcast episode encoded as VBR caused AVFoundation-based players to seek to 1:25:16, 46 seconds off, with the displayed timecode showing the correct value and no indication to the listener that anything was wrong. This happens because AVFoundation calculates seek position with a constant-bitrate formula; in a variable-bitrate file, the math is wrong. He concluded: "Jumping to a timestamp in an hour-long VBR podcast can result in an error of over a minute." Major podcast hosting platforms and player developers recommend CBR for exactly this reason. CBR is universally compatible.
160 kbps as default: 128 kbps can produce audible artifacts on sibilants and high-frequency consonants in speech. 192 kbps is essentially transparent for voice and is worth using if file size is not a constraint. 160 kbps is the practical midpoint: transparent for the vast majority of voice material and widely used as a podcast delivery standard.
MP3 from WAV, not from source: WaxOff always generates the WAV first and derives the MP3 from it. This means the normalization and filtering happens exactly once; the MP3 is a faithful transcode of the normalized WAV, not a separate re-processing of the original file.

Quantization, Dithering, and Why It Doesn't Apply Here

The Quantization Problem

Digital audio stores amplitude values as integers. A 16-bit system divides the amplitude range into 2¹⁶ = 65,536 discrete steps; a 24-bit system uses 2²⁴ = 16,777,216. When a continuous floating-point value is rounded to its nearest representable integer, the difference is quantization error.

At high signal levels, quantization error is a negligible fraction of the signal amplitude. The problem surfaces at low levels: fade-outs, reverb tails, quiet passages, where the signal approaches the magnitude of a single quantization step. At that scale, the error is no longer random with respect to the signal; it becomes correlated. Correlated noise has harmonic structure. Harmonic noise is perceived as distortion.

The artifact is distinctive: as a 16-bit fade-out approaches silence, the smooth waveform begins to pixelate, crumbling into a grainy, granular texture. Engineers call it "going digital." It is most audible on sustained tones, piano decays, and reverb tails, anywhere a signal fades through the lower quantization steps rather than cutting abruptly.

History & Context The Classical Demonstration

The canonical test (sometimes called the "fade-to-black") is simple: record a tone at a moderate level and fade it gradually to silence. Without dithering, the transition through the last few quantization steps produces a sequence of audible steps, then silence where the waveform simply stops being representable. The signal doesn't fade; it falls off a cliff.

Bob Katz, in Mastering Audio: The Art and the Science, describes piano decay as one of the most revealing cases. A sustained piano note fading naturally into a quiet room exposes quantization distortion immediately when compared against a properly dithered version. The undithered note develops a gritty texture as the decay reaches the noise floor, a form of distortion introduced by the word-length reduction itself, present nowhere in the original recording. He uses this comparison in workshops and has remarked that once engineers hear the difference, the idea of shipping 16-bit masters without dithering becomes unthinkable.

A less scientific but widely replicated demonstration: open any 16-bit DAW session, generate a −40 dBFS sine wave, export at 16-bit with dithering off, then again with TPDF dither. Zoom into the waveform near the fade-out in a spectral editor. The quantized version shows visible stairstepping. The dithered version shows a smooth descent into low-level noise. This is not a subtle difference at the bit level, even when it is subtle or inaudible at normal listening levels with typical program material.

The Mathematical Fix: TPDF Dither

The solution is counterintuitive: add noise to the signal before truncating it to a lower bit depth. This noise (dither) must be added at a specific amplitude and with a specific probability distribution to be effective.

The mathematical foundation was established by Stanley Lipshitz, Robert Wannamaker, and John Vanderkooy at the University of Waterloo in a series of papers beginning in 1984. Their central result: adding Triangular Probability Density Function (TPDF) noise (noise whose amplitude is distributed as a triangle between −1 and +1 LSB of the target word length) completely decorrelates the quantization error from the input signal. The quantization error becomes spectrally white and statistically independent of the audio. Harmonic distortion is eliminated; what remains is signal-independent white noise.

The Math TPDF generation, variance, and subtractive dither

TPDF noise is generated by summing two independent rectangular (uniform) noise samples. The resulting amplitude distribution is triangular, hence the name. Its variance is exactly 1/6 LSB², which is the minimum required to whiten quantization error under the conditions relevant to audio. Subtractive dither (where the same dither signal is subtracted after truncation) can achieve perfect cancellation of quantization error in theory; non-subtractive TPDF (the practical, deployable form) achieves statistical decorrelation, which is sufficient for all real-world applications.

The practical consequence: a properly TPDF-dithered 16-bit file has a smooth, analog-like noise floor rather than correlated quantization distortion. The fade-to-black test produces a clean descent into white noise. The artifact is gone.

The Math The dithering theorem in one sentence

Adding TPDF noise of variance 1/6 LSB² before truncation makes the quantization error a white, signal-independent noise process: the best possible outcome for a word-length reduction.

Noise Shaping

TPDF dithering trades correlated distortion for uncorrelated white noise. For archival files and intermediate stems, that trade is unconditionally correct. But the resulting noise floor sits at approximately −96 dBFS spread uniformly across the audio spectrum. Human hearing is not equally sensitive at all frequencies. Sensitivity peaks around 2–4 kHz and falls off substantially above ~15 kHz. Noise shaping exploits this asymmetry.

A noise-shaping filter feeds quantization error back into the system through a filter designed around a psychoacoustic model of human hearing. The filter pushes noise energy out of the 2–4 kHz sensitivity peak and into the 15–20 kHz region where hearing is least sensitive. Total noise energy is conserved (or slightly increased), but perceptually weighted noise (the noise you can actually hear) is reduced. A well-designed noise-shaped dither algorithm can achieve the perceived noise floor of a 20-bit system from a 16-bit word.

History & Context UV22HR and POW-r

Apogee's UV22HR dithering, developed in the 1990s and built into Apogee converters and later into Logic Pro's Bounce dialog, uses an ultrasonic noise curve that concentrates dither energy above 20 kHz, technically increasing broadband noise while keeping in-band noise below the threshold of audibility. POW-r (Psychoacoustically Optimized Wordlength Reduction), developed by a consortium that included Waves and Prism Sound, offers three progressively aggressive noise-shaping modes. POW-r Type 3 is widely used in mastering for 16-bit delivery. Bob Katz has written about using POW-r in preference to flat TPDF for final 16-bit masters, on the basis that the psychoacoustic optimization is audible on critical material at moderate listening levels.

History & Context The Early CD Era

The first commercial CDs appeared in 1982. Digital mastering workflows were new territory, and understanding of quantization dithering was not yet widespread among recording engineers. Dithering had been described mathematically in signal processing literature (Lipshitz and Vanderkooy's most cited papers came in the years following), but its importance in audio mastering was not yet a settled professional consensus.

Several major remastering campaigns of the 1990s and 2000s revisited early digital recordings, and engineers working on them have commented on the difference between first-generation 16-bit masters (truncated without dithering) and properly dithered versions. Classical and acoustic jazz recordings, where genuine dynamic range, reverb tails, and instrument decays are central to the listening experience, are particularly revealing. The difference is most apparent on headphones with revealing source material: quiet passages and fade-outs in early digital releases can carry a subtle granularity that disappears in the remastered versions.

The issue extends into the digital audio workstation era. For years, some popular DAWs shipped with dithering off by default on export, or placed the dither option in a dialog that inexperienced users never opened. The result was that a significant volume of independently produced music from the 1990s and early 2000s was distributed as 16-bit audio truncated without dithering. The artifacts are often inaudible on typical program material at typical listening levels, but on a quiet room recording with a long reverb tail, they are there.

Why WaxOn/WaxOff Does Not Apply Dithering

The entire dithering question is contingent on one condition: word-length reduction. You dither when, and only when, you are truncating bits (converting from a higher to a lower bit depth). Dithering is not a general audio quality enhancement; it is a specific solution to a specific problem that arises at the moment of truncation.

WaxOn and WaxOff output 24-bit WAV. Neither mode reduces bit depth. The processing chain (loudness analysis, gain adjustment, filtering, limiting) runs in floating-point arithmetic internally. FFmpeg's audio processing pipeline operates in 32-bit or 64-bit float throughout. When the float result is written to a 24-bit integer PCM file, any truncation from float to int24 occurs at a level approximately 120+ dB below the signal, far below the audible noise floor of any recording. There is no perceptible quantization distortion to address, and no dithering is needed or appropriate.

If WaxOff produced 16-bit output, dithering would be mandatory, applied as the final stage, after all gain processing, immediately before the word-length reduction. (Dithering applied earlier would be modified by subsequent gain stages, defeating the purpose.) For 24-bit output, the question simply does not arise.

For MP3 output, dithering is inapplicable for a separate reason. MP3 encoding applies its own psychoacoustic quantization. The codec analyzes the signal using a masking model and allocates bits to spectral bands according to audibility thresholds. The quantization step sizes used by the MP3 encoder are orders of magnitude coarser than one LSB of 16-bit PCM. Any TPDF dither noise added before encoding would be completely absorbed into the codec's own quantization decisions. It contributes nothing and changes nothing. Adding dither before a lossy encoder is like whispering into a jackhammer.

Analysis vs Processing

The file stats panel (RMS, Peak, ISP (est.), Crest, LUFS, FLOOR) and the FFmpeg processing pipeline both implement ITU-R BS.1770 concepts, but they are separate code paths and are not guaranteed to match exactly.

Stats Panel: Native Analyzer

When you drop a file into WaxOn or WaxOff, the app analyzes it with an in-process Swift implementation (AudioAnalyzer): K-weighting coefficients computed via the bilinear transform at the file's sample rate, 400 ms / 75%-overlap gating for LUFS, and the 10th-percentile noise-floor estimate described in Noise Floor Estimation. This runs once at load time and costs no extra FFmpeg passes.

The RMS stat uses a mono fold-down before squaring: each stereo frame is summed to (L+R)/2 before the RMS accumulation. For correlated material (mono or near-mono sources) this closely tracks per-channel power, but for wide or decorrelated stereo the fold-down under-reads true RMS by up to 3 dB, and for out-of-phase material it collapses toward −∞. Because Peak is the per-channel maximum (not a fold-down), Crest (Peak − RMS) reads higher than expected on wide mixes and substantially higher on phase-problematic sources. This is a known characteristic of the stat, not a measurement error.

Processing: FFmpeg Loudnorm

Actual normalization uses FFmpeg's loudnorm filter in a two-pass pipeline. Pass 1 may optionally run on an RNNoise-processed temp copy (WaxOn only, when Loudness Norm is enabled). The gain applied in pass 2 is derived from FFmpeg's measurements, not from the stats panel values.

Why They Can Differ

Different implementations: pyloudnorm-aligned Swift analyzer vs FFmpeg's loudnorm — both BS.1770-compliant, but not bit-identical.
NR-for-measurement: WaxOn's processing pass 1 may measure a denoised copy; the stats panel always measures the original file.
Pre-processing in WaxOff: WaxOff's loudnorm analysis includes the 200 Hz allpass; the stats panel does not apply phase rotation before measuring LUFS.
FLOOR vs LUFS: FLOOR uses 80 Hz high-pass filtered mono blocks; LUFS in the stats panel uses unfiltered K-weighted audio.

Use the stats panel to compare files and spot problems (hot peaks, high noise floor, low LUFS). Trust the processing log's measured: line for what loudnorm actually computed.

Defaults & Parameters

Which stages run, and which knobs are exposed, differs between WaxOn and WaxOff.

WaxOn

Setting	UI range	Factory default
High Pass	On (80 Hz) · Off (20 Hz DC floor)	On
Phase Rotation	On · Off	On
Dynamic Leveling	On · Off	Off
Strength	Gentle → Aggressive (9 positions)	Medium (0.5)
Loudness Norm	On · Off	Off
Target LUFS	−35 to −16 (1 LU steps)	−30
Sample Rate	44.1 kHz · 48 kHz	44.1 kHz
Channels	Mono · Stereo	Mono
Channel (mono)	Left · Right	Left
True-peak ceiling	—	Fixed −1.0 dBTP (not adjustable)

User manual: WaxOn Settings · Built-in Edit Prep preset enables Loudness Norm at −30 LUFS.

WaxOff

Setting	UI range	Factory default
Target LUFS	−24 to −14 (1 LU steps)	−18
True Peak	−3.0 to −0.5 dBTP (0.5 dB steps)	−1.0 dBTP
Output	WAV · MP3 · Both	Both
MP3 Bitrate	128 · 160 · 192 kbps CBR	160 kbps
WAV Sample Rate	44.1 kHz · 48 kHz	44.1 kHz
Phase rotation	—	Always on (200 Hz allpass)
LRA	—	9 LU in code only; no effect in linear mode
Output channels	Mono · Stereo (mono sources only)	Stereo (mono upmixed to dual-mono)

MP3 is always encoded at 44.1 kHz regardless of the WAV sample rate setting — stereo by default, mono when Mono delivery is selected.

Edge Cases & Limits

Silent or Near-Silent Input

When loudnorm pass 1 reports non-finite measurements (-inf, inf, or NaN) — typical for silence or near-silence — pass 2 is skipped in both modes. WaxOn passes the pre-loudnorm audio to the limiter; WaxOff still applies phase rotation and the brick-wall limiter but skips loudnorm gain. A warning is logged.

Short Clips

Dynamic Leveling: skipped entirely for clips under 2 seconds.
Noise floor (FLOOR stat): requires at least 5 complete 400 ms blocks (~2 seconds); shorter files show no FLOOR value.
Mirror padding: pad length caps at clip duration when under 16 seconds; clips shorter than dynaudnorm's smoothing radius log an approximate-boundary warning.

Multichannel Input (>2 channels)

WaxOn mono output on sources with more than two channels (5.1, 7.1, ambisonic) uses FFmpeg's default mono downmix rather than left/right channel pick — so center dialogue on surround stems is not accidentally discarded. A warning is logged. Stereo output passes all channels through unchanged up to the output channel count setting.

Missing RNNoise Model

If the bundled rnnoise resource cannot be loaded, WaxOn skips NR-for-measurement and runs loudnorm analysis on the unmodified intermediate. Processing continues; a warning notes that measurement may be noise-biased.

Batch Output Collisions

WaxOn output names follow {stem}-{rate}waxon.wav. When multiple source files in one batch would collide, a short path-hash suffix is inserted before the rate tag. See the user manual for output naming and directory fallbacks.

Batch Concurrency

Both WaxOn and WaxOff process multiple files in parallel, capped at max(2, CPU cores ÷ 2). The same FFmpegRunner-backed task group drives both modes. With parallel WaxOff, the “Analyzing loudness… / Normalizing… / Encoding MP3…” phase indicator in the waveform pane reflects the most recently emitted phase across all in-flight files, not a single file's progression.

Summary

Design Decision	Rationale
HPF before all gain stages (WaxOn)	Subsonic energy inflates loudness measurements and activates the limiter on content that isn't perceptually meaningful
Phase rotation before normalization	Lower crest factor → loudnorm applies gain more accurately → limiter works less hard → more transparent output
Two-pass loudnorm with linear=true	Single-pass normalization is inaccurate; linear mode applies a clean gain offset with no dynamic processing
NR-for-measurement (WaxOn)	When Loudness Norm is enabled, pass 1 runs on a temporary RNNoise-processed copy (after all prior stages) to prevent broadband noise from inflating the measurement. Pass 2 and the output signal are unaffected.
Noise floor estimation	10th percentile of 80 Hz HPF mono block RMS identifies background noise level. Color-coded warnings alert users when noise may affect loudness accuracy.
LRA=20 in WaxOn / LRA=9 in WaxOff	Filter parameters passed to loudnorm; with `linear=true` in pass 2, neither constrains macro-dynamics — only integrated loudness is adjusted
Fixed −1.0 dBTP true-peak ceiling (WaxOn)	Always-enforced true-peak ceiling at the EBU R128 standard level — a 2× oversampled limiter when Loudness Norm is on, downward-only linear peak normalization when off; not user-adjustable (contrast with WaxOff's configurable TP knob)
Stats panel vs FFmpeg loudnorm	Separate BS.1770 implementations; use stats for comparison, processing log for actual normalization measurements
2× oversampled alimiter (WaxOn)	Inter-sample peaks are invisible to a standard sample-rate limiter; oversampling makes them visible and catchable
Brick-wall limiter on WAV path (WaxOff)	Loudnorm linear-mode TP is a soft target — inter-sample peaks can slip past its analysis; a 2× oversampled limiter at the user's TP target enforces the ceiling reliably so the rendered WAV honors the configured true peak
MP3 limiter at TP − 1 dB (WaxOff)	Lossy decode adds 0.5–1.5 dB of inter-sample peak overshoot; placing the MP3 limiter 1 dB below the WAV ceiling means the decoded MP3 lands at or below the user's effective TP target
MP3 derived from WAV output	Normalization happens once; MP3 is a transcode of the normalized file, not a separate re-processing of the original
24-bit WAV output	Headroom for further processing (WaxOn) and lossless archive quality (WaxOff); universally compatible
No dithering applied	Dithering is only required at word-length reduction (e.g., 24→16-bit); WaxOn/WaxOff output 24-bit, so no bit-depth truncation occurs. For MP3, the codec's own psychoacoustic quantization is orders of magnitude coarser than any dither signal; dithering before a lossy encoder has no effect.

WaxOn/WaxOff is free software licensed under the GPL-3.0. Built by Seven Morris with AI collaboration.

Theory of Operation

Signal Chains at a Glance

WaxOn

WaxOff

EBU R128 Loudness

Integrated Loudness (LUFS)

K-Weighting

Gating

True Peak (TP)

Loudness Range (LRA)

Two-Pass Normalization

Pass 1: Analysis

Pass 2: Linear Normalization

Why linear=true Matters

True Peak and Oversampled Limiting

The Inter-Sample Peak Problem

How Oversampled Limiting Solves This

WaxOn Limiter Settings

WaxOff Limiters

RNNoise: Internal Measurement Tool

What It Suppresses Well, and Poorly

Stereo Handling: Per-Channel Split

Noise Floor Estimation

The Problem

Estimation Method

Thresholds and Color Coding

NR-for-Measurement

Why This Works

Cost

Phase Rotation and Crest Factor

Crest Factor

How Allpass Filtering Reduces Crest Factor

Quantifying the Effect

Multiple Microphones in the Same Room

Phase Rotation at the Delivery Stage

Stereo Recordings and Channel Coherence

What Phase Rotation Cannot Fix

Dynamic Leveling

The Problem It Solves

The Strength Knob

Placement in the Pipeline

Mirror-Padding Boundary Fix

WaxOn Design Rationale WaxOn

Stage Order

Factory Defaults

LRA=20 in WaxOn Loudnorm

Default Loudnorm Target: −30 LUFS

Loudnorm TP and Limiter Ceiling

WaxOff Design Rationale WaxOff

Stereo Output and Mono Delivery

No High-Pass Filter

LRA=9 (measured, not constrained)

Delivery Targets

Output Format Rationale

24-bit WAV

MP3 CBR

Quantization, Dithering, and Why It Doesn't Apply Here

The Quantization Problem

The Mathematical Fix: TPDF Dither

Noise Shaping

Why WaxOn/WaxOff Does Not Apply Dithering

Analysis vs Processing

Stats Panel: Native Analyzer

Processing: FFmpeg Loudnorm

Why They Can Differ

Defaults & Parameters

WaxOn

WaxOff

Edge Cases & Limits

Silent or Near-Silent Input

Short Clips

Multichannel Input (>2 channels)

Missing RNNoise Model

Batch Output Collisions

Batch Concurrency

Summary

Why `linear=true` Matters