Theory of Operation

What each processing stage actually does, and why it's designed the way it is.

← Manual

Why ClipHack Exists

I'm a freelance audio engineer, not a software developer. I record remote podcasts via Zoom, and part of that job involves playing broadcast clips — news stories, political ads, promo inserts — live into the call while it's being recorded. The clips come from broadcast sources: TV stations, networks, political campaigns. They sound fine on air, but they're not prepared for injection into a Zoom session through a virtual audio chain.

The problem is that broadcast clips are wildly inconsistent. A local news package might peak at −1 dBFS; the next clip might average 12 dB quieter. Some are heavily compressed, some are barely processed at all. Dropped raw into a Zoom call, they're either too loud, too quiet, or dynamically all over the place relative to the hosts' voices. On the receiving end — especially when the final output is a podcast — those level jumps are jarring and unprofessional.

My workflow before ClipHack was to prep clips manually in a DAW before the show: normalize, limit, export. It worked, but it was tedious for a batch of eight clips the morning of a recording. I wanted a dedicated tool that understood the specific problem: not mastering, not dialog prep, but broadcast clip conforming — getting a pile of inconsistently leveled clips to a consistent playback level as quickly as possible.

The routing is: clips play from Farrago (a soundboard app), through Loopback (a virtual audio cable), into Zoom as a second audio input. ClipHack sits at the beginning of that chain — it's the step that prepares the clips before they ever enter Farrago. By the time a clip is in the soundboard, it should already be leveled, limited, and ready to play at the right level without any further adjustment.

Why these specific processing stages?

Leveling (dynaudnorm) tames clips that are internally inconsistent — a reporter who starts quiet and builds, or a clip that cuts between two sources at very different levels. It's the stage that does the most to make broadcast clips behave predictably.

Loudness normalization (EBU R128) sets the overall integrated loudness to a consistent target. After leveling, normalization ensures every clip lands at approximately the same perceived loudness regardless of how the source was produced.

Brick-wall limiting catches any peaks that survived the previous stages and prevents them from clipping the codec or causing sudden loudness spikes on the receiving end.

Noise reduction (RNNoise) is optional and situation-specific. Some clips — particularly local TV packages — have audible room noise or HVAC hum that becomes more noticeable when the clip is leveled up. RNNoise handles this reasonably well on voice content. It's off by default because it can introduce artifacts on heavily processed broadcast material.

De-esser is useful for clips that will pass through Zoom's codec, which can exaggerate sibilance. A gentle pass at 7.5 kHz takes the edge off without affecting the overall character of the clip.

Zoom & Audio Quality

Since ClipHack is specifically designed for a Zoom injection workflow, it's worth understanding what Zoom actually does to audio — and what that means for how you should prepare your clips.

The signal chain

Audio routed into Zoom via Loopback passes through several conversion stages before it reaches the other end of the call:

  1. Core Audio (macOS) — all audio on macOS is converted to 32-bit float by Core Audio before any application sees it. This happens transparently. Whether your source file is 16-bit, 24-bit, or 32-bit float, Core Audio hands Zoom a 32-bit float stream.
  2. ZoomAudioDevice — Zoom's virtual audio driver operates at a fixed 48 kHz. Audio at any other sample rate is resampled at this stage.
  3. Zoom's audio engine — applies echo cancellation, noise suppression, and AGC (automatic gain control) unless you've disabled them via High Fidelity Music Mode.
  4. Opus encoder — this is the real quality ceiling. Zoom uses the Opus codec for audio transmission, confirmed by Zoom's own developer documentation. Standard Zoom audio runs at 32–64 kbps. With High Fidelity Music Mode enabled, Zoom can reach up to 96 kbps mono or 192 kbps stereo, with echo cancellation and noise suppression disabled.
  5. Network transmission, decode, and playback — the listener receives an Opus-decoded signal, not your original file.

Why bit depth doesn't matter

A question that came up during ClipHack's development was whether to offer a 16-bit output option alongside 24-bit, on the theory that 16-bit might be "more compatible" with Zoom. The answer, after research, is that it makes no difference.

Here's why: by the time your audio reaches the Opus encoder, it has already been converted to 32-bit float by Core Audio — regardless of the source bit depth. A 16-bit WAV and a 24-bit WAV are both converted to the same 32-bit float representation at the OS level. The 24-bit signal maps losslessly into 32-bit float (which has a 24-bit effective mantissa: 23 explicitly stored fraction bits plus 1 implicit leading bit); the 16-bit signal maps losslessly with the lower 8 bits zeroed. Zoom receives the same quality signal in both cases.

More importantly: the Opus codec is lossy. Even at its highest supported bitrate (192 kbps stereo with High Fidelity Music Mode), Opus permanently discards audio information as part of the encoding process. The distortion introduced by Opus compression is orders of magnitude larger than any theoretical quality difference between 16-bit and 24-bit source material. Zoom itself only describes audio quality in terms of bitrate — there is no mention of bit depth anywhere in Zoom's audio documentation.

ClipHack outputs 24-bit WAV not because 24-bit matters to Zoom, but because 24-bit preserves full quality through the processing chain itself. All internal stages operate at 24-bit PCM; keeping the output at 24-bit means the file can be used elsewhere (in a DAW, in WaxOff, in any other tool) without any loss.

What actually matters for Zoom audio quality

If you care about how your clips sound through Zoom, these are the things worth paying attention to:

Overview

ClipHack is a thin orchestration layer over FFmpeg. It builds filter chains, runs FFmpeg subprocesses, and manages temp files. There is no custom audio DSP in the app — all signal processing is done by FFmpeg and the bundled RNNoise model.

The pipeline is strictly sequential. Each optional stage writes an intermediate WAV to a temp directory, which becomes the input for the next stage. This avoids compounding rounding errors from chaining filters in a single FFmpeg invocation — each stage gets a fresh 24-bit PCM buffer.

RNNoise / arnndn

RNNoise is a noise suppression algorithm developed at Mozilla Research, published in 2018. It uses a recurrent neural network (GRU-based) trained on speech and noise samples. FFmpeg exposes it as the arnndn filter, which requires a model file.

What it does

arnndn processes audio in short frames (~10 ms), classifying each frame as speech or noise and applying per-band gain reduction to suppress non-speech content. It's designed for voice — it suppresses broadband noise (HVAC, room hiss, fan noise) while preserving speech intelligibility.

It is not a general-purpose noise suppressor. It works poorly on music, non-speech content, or intermittent transient noise (clicks, pops). For broadcast voice clips, it's typically effective — for anything else, leave it off.

Stereo handling

arnndn is a mono filter. For stereo files, ClipHack splits the channels with channelsplit, runs arnndn independently on L and R, then recombines them with join:

[0:a]channelsplit=channel_layout=stereo[L][R];
[L]arnndn=m=rnnoise[Lnr];
[R]arnndn=m=rnnoise[Rnr];
[Lnr][Rnr]join=inputs=2:channel_layout=stereo

Processing channels independently preserves the stereo image — a stereo NR pass on the downmixed signal would collapse phase relationships.

Model file

The RNNoise model is bundled inside the app as a binary resource file. It's the standard model distributed with FFmpeg's arnndn filter. The model is loaded at runtime via its file path — it's not compiled into FFmpeg itself.

dynaudnorm

dynaudnorm (Dynamic Audio Normalizer) is an FFmpeg filter that applies time-varying gain normalization. Unlike static normalization (which applies a single gain to the whole file), dynaudnorm adjusts gain frame-by-frame to even out level variation over time.

What it does

The filter divides the signal into overlapping frames, measures the RMS level of each frame, computes a target gain, and applies it with Gaussian smoothing to avoid abrupt gain changes. The result is a clip where loud and quiet sections are brought closer together in level — without the pumping or breathing of a conventional compressor.

When to use it

dynaudnorm works well on broadcast clips with inconsistent levels — a reporter clip that starts quiet and gets louder, or a clip with multiple speakers at different distances. It is not recommended for:

Aggressiveness parameters

ClipHack exposes three dynaudnorm parameters via the Aggressiveness slider. All three scale together:

Three parameters are always fixed regardless of the aggressiveness setting:

EBU R128 / loudnorm

EBU R128 is the European Broadcasting Union's standard for loudness normalization in broadcast. It defines loudness in LUFS (Loudness Units Full Scale), measured using ITU-R BS.1770 gating — a perceptually weighted measurement that ignores silence and very quiet passages.

Two-pass normalization

ClipHack uses FFmpeg's loudnorm filter in two-pass mode:

  1. Analysis pass — measures the integrated loudness, true peak, and loudness range of the file
  2. Normalization pass — applies a linear gain to bring the integrated loudness to the target LUFS

This is linear normalization — the gain applied is a single constant value calculated from the analysis. No dynamic processing occurs. The stereo image and dynamics are completely unchanged; only the overall level shifts.

Single-pass loudnorm (FFmpeg's default) uses an estimation algorithm that can introduce subtle dynamic artifacts. Two-pass avoids this entirely.

Interaction with the limiter

Loudness normalization runs before the limiter. If the normalization gain pushes any peaks above the limiter ceiling, the limiter will catch them. In practice, for most broadcast clips at sensible targets (−16 to −23 LUFS), this is rare — but it's the correct order: normalize loudness, then constrain peaks.

Target selection

Brick-Wall Limiter

The limiter is the final stage and cannot be disabled. It prevents any sample from exceeding the configured ceiling, regardless of what the upstream stages produced.

True peak vs. sample peak

Digital audio is stored as discrete samples. When you convert samples back to a continuous signal (D/A conversion), the reconstructed waveform can exceed the highest sample value — this is called an inter-sample peak or true peak. A sample peak limiter will show 0 dBFS on a meter, but the analog signal may clip.

ClipHack uses alimiter with 2× oversampling. Upsampling to 2× the sample rate reveals inter-sample peaks that aren't visible at the original rate. The limiter constrains the signal at the oversampled rate before downsampling back — so the output's true peak (not just sample peak) is bounded by the ceiling. Note that ITU-R BS.1770-4 requires 4× oversampling at 44.1/48 kHz to keep true peak measurement error below ±0.1 dB; 2× is a practical engineering compromise that catches the majority of inter-sample peaks and is sufficient for Zoom delivery, where the downstream Opus encoder is lossy regardless.

Ceiling selection

−1 dBTP is the most common setting. Podcast platforms typically require true peak at −1 or −2 dBTP. For clips being dropped into a broadcast chain that will do its own normalization, −3 dBTP gives more headroom. Lower settings (−6 dBTP) are rarely needed and will begin to reduce average loudness noticeably.

LUFS Measurement

ClipHack measures the loudness of each file before processing. This tells you what you're starting with, and lets you compare the original to the output.

ITU-R BS.1770 gating

Integrated loudness is measured using the BS.1770 algorithm: the signal is K-weighted (a frequency weighting that models human loudness perception), then measured in 400 ms blocks with 75% overlap. Blocks below −70 LUFS are excluded as silence. Blocks below the relative threshold (−10 LU below the mean of the ungated measurement) are also excluded — this prevents quiet passages from pulling the integrated loudness down.

The result is a single number in LUFS representing the perceived loudness of the perceptible content in the file.

Noise floor warning

When the noise floor is high relative to the speech content, the gating algorithm may include noise frames in the loudness measurement, making the result artificially high. ClipHack detects this condition and shows a warning badge on the file. The measurement is still displayed — it's just flagged as potentially unreliable.

FFmpeg Pipeline

ClipHack bundles a full FFmpeg binary inside the app. Each processing stage runs FFmpeg as a subprocess, reading the output of the previous stage. Intermediate files are 24-bit PCM WAV to preserve full quality through the chain.

Why separate subprocesses?

The alternative would be a single FFmpeg invocation with a long filter chain. This is technically possible, but:

Temp files

Intermediate files are written to a temporary directory that's cleaned up after processing. If processing fails, ClipHack attempts to clean up temp files but may leave them in /tmp if the process is interrupted.

Concurrency

Files in the queue are processed concurrently using Swift's structured concurrency (async/await with a task group). Each file gets its own FFmpeg subprocess chain. The number of concurrent files is not artificially limited — it scales with the system's CPU and I/O capacity.