Theory of Operation

Film audio is engineered for a specific environment — a calibrated theatrical space at 85 dB SPL — that has nothing in common with headphones, laptop speakers, or a living room couch. This page explains the full signal chain FilmStrip uses to bridge that gap, and the decades of audio engineering history behind every step.

This document assumes familiarity with basic audio concepts (dBFS, sample rate, dynamic range). It's written for audio engineers, curious users, and anyone who wants to understand why the app does what it does.

Signal Chain at a Glance

Source File
Dialog Guard
5.1/7.1 only
center channel norm
High Pass Filter
80 Hz / 24 dB/oct
Downmix + Resample
44.1 kHz stereo
Level Riding
dynaudnorm on stereo
Peak Limit
−1 dBTP
Loudnorm
two-pass EBU R128
Output
WAV / M4A

Dashed border = optional stage. Dialog Guard, Level Riding, and Loudnorm run only when enabled in settings. The source file is never modified.

A Brief History of Film Sound

The audio embedded in a modern Blu-ray is the product of seventy years of competing standards, corporate rivalries, and engineering breakthroughs. Understanding the format tells you a great deal about why the audio sounds the way it does — and why it needs processing to work outside a cinema.

1927

The Jazz Singer — Synchronized Sound

Warner Bros. Vitaphone

Al Jolson's improvised ad libs in The Jazz Singer marked the end of the silent era. The technology — Warner's Vitaphone — recorded sound onto large lacquer discs synchronized to the projector. It was cumbersome and fragile, but it worked well enough to make silence commercially untenable within three years.

By the early 1930s, optical sound tracks — a photographed waveform printed along the edge of the 35mm film — had replaced disc-based systems. That optical mono track, essentially unchanged in principle, remained the universal delivery format for decades. A projector lamp, a slit, and a photocell. The frequency response was narrow, the dynamic range limited, and the whole thing was exactly one channel.

1975

Dolby Stereo — Four Channels from Two Tracks

Dolby Laboratories

Lisztomania (1975) was the first theatrical film to use Dolby's matrix-encoded stereo format, followed by the full four-channel LCRS system on A Star Is Born (1976). The engineering trick: encode Left, Center, Right, and Surround information onto just two optical tracks on the existing 35mm print using phase relationships, then decode in the theater. No new film format required.

Star Wars (1977) is the landmark most people cite — and for good reason. The combination of Ben Burtt's sound design and Dolby's format made the theater experience radically different from anything that came before. But the audio was still matrix-encoded, meaning the channels were mathematically derived from two tracks rather than discrete. The surround channel, in particular, was mono and limited in bandwidth.

1992

Dolby Digital — Discrete 5.1 Between the Sprocket Holes

AC-3 · 5.1 Discrete

Batman Returns (1992) was the first wide theatrical release with Dolby Digital, also known by its codec name AC-3. The six discrete audio channels — Left, Right, Center, Left Surround, Right Surround, and a dedicated subwoofer channel designated LFE (Low Frequency Effects) — were encoded digitally and printed between the sprocket holes of the 35mm film.

The "5.1" naming convention comes from this: five full-range channels plus one band-limited LFE channel. The .1 channel is only 120 Hz wide and exists specifically for bass effects — explosions, impacts, rumbles — where extreme SPL is needed but directional information is irrelevant.

1993

DTS & SDDS — Three Digital Systems, One Print

DTS Digital Theater SystemsSony SDDS

Steven Spielberg was among the early investors in Digital Theater Systems and deliberately chose DTS for Jurassic Park (1993) as its commercial debut. Unlike Dolby Digital — which encoded audio directly onto the film — DTS stored the audio on a separate CD-ROM that theaters received alongside the print. A timecode track on the optical strip kept the two synchronized.

For roughly a decade, major studio prints carried all three digital formats simultaneously: Dolby Digital between the sprocket holes, a DTS timecode stripe, and SDDS data on the outer edges — along with the original analog optical track as a fallback. A projector would try each format in priority order and drop back to the next if it failed.

2006

Blu-ray — Lossless Audio in the Home

Dolby TrueHDDTS-HD Master Audio

Blu-ray (and briefly HD DVD) introduced lossless audio to home video. Dolby TrueHD and DTS-HD Master Audio are both mathematically lossless — they produce bit-for-bit identical output to the studio master. Previous home formats (DVD's Dolby Digital and DTS) were lossy, discarding audio information to fit in the available data budget.

This matters for FilmStrip: a Blu-ray rip in MKV format will typically contain a TrueHD or DTS-HD Master Audio track at the source sample rate and channel count of the original master. That's usually 48 kHz and 7.1 channels — eight discrete channels at studio quality, compressed losslessly. When FilmStrip extracts and downmixes this to 44.1 kHz stereo WAV, it's working with the best possible source material.

2012

Dolby Atmos — Objects in Space

Object-Based AudioPixar Brave

Pixar's Brave (June 2012) was the first feature mixed entirely for Dolby Atmos, premiering at the Dolby Theatre in Los Angeles. Atmos represents a fundamental shift in how cinema audio is authored: rather than assigning sounds to fixed channels, a mixer positions sounds as objects in three-dimensional space with X/Y/Z coordinates and metadata. The theatrical renderer interprets those coordinates and routes audio to whatever speaker configuration the specific room has — up to 64 channels.

On Blu-ray, Atmos is delivered as a TrueHD track with an additional metadata layer for height information. A compatible receiver unpacks the objects; an incompatible player simply decodes the core TrueHD 7.1 mix. For FilmStrip's purposes, an Atmos track is a TrueHD track at 48 kHz, and the downmix is handled identically.

Why Film Audio Sounds Wrong at Home

Film audio is mixed in a room calibrated to a specific playback level. When you take that mix out of its intended environment, the dynamic range that was designed to create dramatic impact becomes an obstacle to simple comprehension.

The 85 dB SPL Reference

SMPTE RP200, the standard for theatrical sound calibration, specifies that pink noise at −20 dBFS should measure 85 dB SPL (C-weighted, slow) at the primary listening position — every channel, independently calibrated to the same level. This is the reference level at which theatrical mixers do their work.

With 20 dB of headroom above that reference, a theatrical system can produce 105 dB peak SPL per channel — louder than a rock concert, louder than a jet at 300 feet. The dynamic range between a whispered line of dialogue and a full-scale action sequence is typically 20–25 dB. By design.

Home Listening Reality

Home theater receivers are typically calibrated to 79–82 dB SPL at the listening position — 3 to 6 dB below theatrical reference. Headphone and laptop listening happens at 70–78 dB or lower. That gap isn't just about volume: the human ear's frequency and dynamic sensitivity changes with absolute playback level (the Fletcher-Munson effect), which means the perceived dynamic range shrinks as you turn down.

At 78 dB, a scene's dynamic range effectively compresses by about 7 dB. What read as a whisper and a shout in the theater becomes a murmur and a raised voice at home. The problem compounds with ambient noise — traffic, HVAC, household activity — masking the quiet end of the range entirely.

The Dialogue Problem

Film dialogue lives in the center channel, which in a 5.1 or 7.1 mix carries the primary voice frequencies (roughly 200 Hz – 4 kHz). Small speakers, earbuds, and laptop transducers have poor response in this midrange range. Music and effects tend to occupy lower and higher frequencies where small speakers often perform better — or at least differently — making dialogue feel recessed even when it's not.

The theatrical mix assumes you can hear at −31 LUFS because your room is quiet, your speakers are large, and you're calibrated. None of those assumptions hold on headphones during your commute.

The Numbers

A typical theatrical film integrates around −27 LUFS, which is Netflix's delivery standard. Spotify normalizes music to −14 LUFS. Apple Music targets −16 LUFS. That difference — up to 13 LUFS between a film mix and a streaming pop track — represents a factor of roughly 4.5× in average loudness. The film is simply much quieter by design, intended for a much louder room.

Typical Integrated Loudness Targets

Theatrical film
−31 LUFS
Blu-ray / streaming film
−27 LUFS
Broadcast (EBU R128)
−23 LUFS
Apple Music
−16 LUFS
Spotify
−14 LUFS

Loudness normalization in FilmStrip bridges this gap: it brings the integrated loudness of the extracted audio up to a target that matches your listening context.

From Surround to Stereo

Extracting a 7.1 audio track from an MKV and rendering it as stereo requires folding eight channels into two. The math is defined by international standards and is not lossless in any perceptual sense — but it is principled.

The Channel Layout

A 7.1 mix contains eight discrete channels arranged around the listener:

Channel Abbrev. Position Primary Content
Front LeftFLFront-left speakerMusic (left), sound effects, ambience
Front RightFRFront-right speakerMusic (right), sound effects, ambience
CenterFCCenter screen speakerDialogue — almost exclusively
LFELFESubwoofer (any location)Bass effects: explosions, impacts (≤120 Hz)
Back LeftBLRear-left speakerAmbient surround, rear effects
Back RightBRRear-right speakerAmbient surround, rear effects
Side LeftSLSide-left speakerWide surround, room tone, passing sounds
Side RightSRSide-right speakerWide surround, room tone, passing sounds

The LoRo Downmix Matrix

FilmStrip uses a LoRo (Left-Only/Right-Only) downmix, the standard defined by ITU-R BS.775 and ATSC A/52. It is called "Left-Only/Right-Only" because the output channels are independently derived — there is no phase encoding between them. The result plays correctly on any stereo system without special decoding.

The alternative is LtRt (Left Total/Right Total), which uses phase manipulation to encode surround information in a way that a Dolby Pro Logic decoder can later extract. LtRt produces a wider apparent soundstage on a compatible system but degrades on simple stereo playback. For a file you intend to listen to on headphones or a stereo player, LoRo is correct.

Implementation note: rather than relying on FFmpeg's default downmix matrix, FilmStrip applies an explicit pan filter so the center-channel weight can be set deliberately. The coefficients are:

Lout = 1.000 × FC + 0.707 × FL + 0.707 × BL + 0.500 × SL Rout = 1.000 × FC + 0.707 × FR + 0.707 × BR + 0.500 × SR ↑ unity ↑ −3.01 dB ↑ −3.01 dB ↑ −6.02 dB LFE is omitted entirely from the stereo output.

Why These Gain Values?

The textbook LoRo matrix mixes the center channel at 0.707 (−3 dB) — the equal-power sum point. The argument: if FC were summed at full gain alongside content in FL, the sum at the left output could exceed either source alone by 3 dB, risking a louder and spatially confused image. Mixing FC at 0.707 keeps the front soundstage balanced.

FilmStrip deviates from this convention and folds FC at unity gain. Films, unlike music, place dialog almost exclusively in the center channel (see Dialog Guard, below), and a 3 dB attenuation makes voices feel recessed in stereo playback. The brick-wall limiter at the end of the chain catches any inter-sample peaks that result from summing FC at full level.

The side channels get an additional −3 dB reduction (0.5, or −6 dB total) because they carry primarily ambient content — room tone, diffuse reverb, passing sounds. Folding them in at full surround level would fill the stereo image with diffuse content and reduce its clarity.

Why LFE is dropped The LFE channel is band-limited to approximately 120 Hz and carries content specifically mixed for a subwoofer running at cinema reference levels. Adding it to the stereo output at any significant gain would cause uncontrolled low-frequency energy on systems that cannot handle it — small speakers, earbuds, laptop transducers — risking distortion or damage. ITU explicitly specifies that LFE should not be included in a stereo downmix. The full-range channels already contain the low-frequency content of the music and ambient effects.

What You Actually Hear

After the downmix, dialogue arrives equally in both channels at unity gain, placing it as a centered phantom image between the speakers and at the same level it had in the original mix. Music and scored elements retain their left/right panning from the front channels. Surround ambience folds into both channels at reduced level, adding a sense of space without dominating the image. Sub-bass effects disappear — explosions and impacts remain audible through the full-range channels but lose their low-frequency weight.

The result is a serviceable, intelligible stereo mix that prioritizes the most important element — dialogue — and preserves the music and primary effects in a form that translates well to headphones.

The Processing Pipeline

Each step runs in sequence, writing intermediate results to a temporary directory. The source file is never modified.

Pass 1: Extraction

ffmpeg decodes the selected audio stream. If Dialog Guard is enabled on a 5.1 or 7.1 source, it first splits out the center channel, normalizes it independently, and reassembles the multichannel signal before the downmix. For stereo or mono sources, Stereo Dialog Assist performs the equivalent operation on the mid (L+R) signal. If the High Pass Filter is enabled, two cascaded 80 Hz biquads (highpass=f=80,highpass=f=80) run next, rolling off subsonic content at 24 dB/oct before the downmix. The filter chain then downmixes to stereo via an explicit pan filter using the matrix described in From Surround to Stereo, resamples to 44.1 kHz, and applies Level Riding (dynaudnorm) on the stereo mix when enabled. A brick-wall limiter (alimiter=limit=0.99:attack=5:release=50:level=false) runs last to catch any inter-sample peaks from channel summation. The result is a temporary 24-bit PCM WAV file.

24-bit depth is chosen rather than 16-bit to preserve headroom for subsequent processing steps. Loudness normalization may apply a gain of several dB; 24 bits provides 144 dB of dynamic range so there is no meaningful resolution loss at any gain setting.

Pass 2 (optional): Loudness Analysis

The loudness normalization process requires two ffmpeg passes. The first pass runs ffmpeg with the loudnorm filter in analysis mode, outputting its measurements to /dev/null — no audio is written. At the end of the pass, the filter prints a JSON block to stderr containing the measured integrated loudness (LUFS), true peak (dBTP), loudness range (LRA), and threshold values.

This analysis runs against the already-downmixed and (optionally) level-ridden WAV from Pass 1 — so the loudness measurement reflects the final dynamics of the audio, not the original theatrical mix.

Pass 3 (optional): Loudness Normalize

The second pass provides the measured values back to the loudnorm filter along with the target LUFS. With linear=true, the filter computes a single static gain value: the difference between the measured integrated loudness and the target, accounting for true peak ceiling. That gain is applied uniformly across the entire file.

Linear mode preserves dynamics exactly. The output has the same waveform shape as the input — only louder or quieter by a fixed amount. This is ideal after level riding, which has already done the dynamic work.

Why level riding before loudness normalization? The order is intentional. Level riding reduces the dynamic range of the audio. Loudness normalization then measures the resulting signal and sets the overall level. If the order were reversed — normalize first, then ride levels — the final output level would be unpredictable, because dynaudnorm's gain changes would shift the integrated loudness away from the target. The current order guarantees the output hits the target LUFS regardless of what level riding does to the dynamics.

Level Riding: Dynamic Audio Normalizer

Before automated systems existed, a broadcast engineer sat at a console and manually rode the fader — turning down peaks, pulling up quiet passages — to keep audio levels consistent for home listeners. The goal was the same as it is today: tame the dynamic range of content mixed for a different context. dynaudnorm is a modern, principled version of that process.

Frame Analysis

The filter divides the audio into short frames — 500 milliseconds by default. For each frame, it computes the peak magnitude. This is not RMS (average energy) but true peak — the highest instantaneous sample level in the frame. The gain needed to bring that peak to the target level p is calculated: gain = p / peak.

If the result exceeds the maximum allowed gain m, the gain is clamped to m. This prevents very quiet frames (near-silence, room noise) from being boosted so aggressively that the noise floor becomes intrusive.

Gaussian Smoothing

A raw per-frame gain would cause audible pumping — volume jumping every half-second in sync with the analysis window. dynaudnorm prevents this by computing a Gaussian-weighted average across a neighborhood of frames (the window size g) and using the smoothed gain value instead of the raw one.

The Gaussian weighting means frames near the current position in time have more influence than frames far away. With the default window of 31 frames (about 15.5 seconds of lookahead and lookbehind), gain transitions are spread over several seconds — imperceptible as discrete steps. The filter effectively "knows" what's coming and begins adjusting early.

The p Parameter: Peak Target

The peak target controls how aggressively loud frames are attenuated. A value of p = 0.95 means the loudest frame in each window will be brought to 95% of full scale — very gentle, leaving most of the original dynamic range intact. At p = 0.55, loud frames are pulled down to 55% of full scale — a reduction of about −5.2 dB — noticeably more aggressive.

The m Parameter: Maximum Gain

This is the parameter that actually closes dynamic range rather than just attenuating peaks. m sets the maximum gain that can be applied to a frame. At m = 1.0, gain can never exceed 1.0 — the filter can only attenuate loud frames, never boost quiet ones. Quiet dialogue stays quiet.

At m = 10.0, the filter can amplify a quiet frame by up to 10× (roughly +20 dB). Scenes where a character whispers, or where the only content is ambient room tone, get boosted significantly — narrowing the perceived gap between quiet and loud. This is the effective dynamic range compression that makes film audio comfortable to listen to without a home theater.

Headphone-Tuned Settings

FilmStrip ships a single, fixed tuning chosen for headphone listening:

dynaudnorm=p=0.90:m=1.5:g=31 ↑ peak target ↑ max gain ↑ smoothing window
Parameter Value Effect
p0.90Loud frames are pulled toward 90% of full scale (~−0.92 dBFS). Light peak compression — most of the original dynamics survive.
m1.5×Maximum upward gain capped at ~+3.5 dB. Quiet scenes get a small, polite lift; the noise floor and room tone don't get dragged into the mix.
g3115.5 seconds of Gaussian-weighted lookahead/lookbehind. Gain transitions are too slow to perceive as pumping.

This tuning is effectively downward-only. With m = 1.5, the filter still passes the dynamics of a typical scene through largely intact: dialogue at −20 dBFS needs a gain of 4.5× to reach the target, so it stays where the mix put it. The few decibels of available upward gain rescue brief whispers and quiet exchanges without amplifying ambient hiss or HVAC rumble during a silent intro. The wider work of bringing quiet film dialog forward in the mix is handled separately by Dialog Guard (5.1/7.1) and Stereo Dialog Assist (stereo/mono), which act on the center / mid signal where dialog actually lives.

Why not pure downward-only (m = 1.0)? With m = 1.0, dynaudnorm becomes a one-way valve — it can only attenuate. A whispered line at −35 dBFS in an otherwise quiet scene receives no gain at all and stays buried. Setting m = 1.5 gives the filter just enough authority to rescue these moments (about 3.5 dB of headroom) without the side effect that higher m values cause: lifting the noise floor during the silence before the audio starts. It's the smallest amount of upward gain that still does useful work.

Dialog Guard: Center Channel Normalization

Film dialog lives almost exclusively in the center channel — a deliberate mixing decision that dates to the earliest days of surround sound. When a 5.1 or 7.1 mix is downmixed to stereo, the center channel is attenuated by −3 dB before being summed into both output channels. If a scene's dialog was already mixed quietly relative to the action, that attenuation can push it below the threshold of comfortable comprehension. Dialog Guard addresses this at the source, before the downmix happens.

Why the Center Channel Specifically

In any standard 5.1 or 7.1 theatrical mix, the center channel (FC) carries dialog with essentially no other content. Explosions are in the LFE. Music is spread across FL and FR. Ambience and effects fill BL, BR, SL, and SR. The center is reserved for voices — and in the channel layout for both 5.1 and 7.1, FC is always index 2, making it unambiguously targetable regardless of the specific surround configuration.

Full-mix normalization tools like dynaudnorm with a wide window can miss brief quiet passages — a character whispering for a few seconds in an otherwise loud scene, for instance. The window average is dominated by the loud content surrounding the passage, so the gain assigned to that window remains low. The whisper stays quiet. Targeting the center channel independently, with a shorter window, allows these brief dips to be caught and corrected before they get buried further by the downmix.

How It Works

When Dialog Guard is enabled on a 5.1 or 7.1 source, the entire extraction pass switches from a simple -af filter chain to a -filter_complex graph that processes channels as parallel streams:

[0:a:N] channelsplit=channel_layout=5.1 [c0][c1][c2][c3][c4][c5] ↑ split all six channels into separate streams [c2] dynaudnorm=p=0.88:m=3:g=15 [c2n] ↑ normalize center channel (FC) independently [c0][c1][c2n][c3][c4][c5] amerge=inputs=6 ↑ reassemble with normalized center, all other channels unchanged

The reassembled multichannel signal then continues through the normal filter chain: optional level riding, resample to 44.1 kHz, stereo downmix, and peak limiting. The center channel enters the downmix already normalized — the −3 dB attenuation in the downmix matrix is applied to a signal that has already been leveled, rather than to one that may have been quietly mixed to begin with.

Window Size: g=15

The Gaussian window of 15 frames covers approximately 0.5 seconds — half the default 31-frame window used by Level Riding. This shorter window allows the filter to react to brief quiet passages that a wider window would smooth over. The tradeoff is slightly more audible gain transitions, but on a mono channel carrying dialog the effect is rarely perceptible.

Maximum Gain: m=3×

The maximum gain of 3× (~+9.5 dB) is applied only when the center channel's peak is very low — near-silence or extremely quiet dialog. In practice, the filter boosts passages by a few dB in most cases. The ceiling is set low enough that ambient room tone or background noise on the center channel doesn't become audible hiss during pauses in dialog, but high enough to lift a whispered line into intelligibility.

Stereo Dialog Assist

Stereo and mono sources have no separable center channel, but dialog still concentrates in the centered (mono) component of the mix. Stereo Dialog Assist performs an equivalent operation by decomposing the stereo signal into mid (sum) and side (difference) components, applying the same dynaudnorm=p=0.88:m=3:g=15 filter to the mid signal only, and reassembling. The side signal — which carries the stereo width: ambience, music spread, panned effects — is untouched, so the stereo image stays intact while the dialog (which sits in the mid) gets normalized. Mono sources are treated as pure-mid and run the same filter across the entire signal.

Works With Level Riding

Using Dialog Guard (or Stereo Dialog Assist) alongside Level Riding is the intended combination. The dialog-side filter normalizes a single mono signal where dialog lives, at a fast window optimized for short passages. Level Riding then normalizes the stereo mix after the downmix, at a slower window that smooths transitions across longer scenes. They operate at different stages, on different signals, for different purposes.

Loudness Normalization: EBU R128

For most of the digital audio era, engineers chased loudness by bricking signals against 0 dBFS with limiters. The result was audio that was measurably distorted, perceptually fatiguing, and impossible to compare across sources. EBU R128 changed the unit of measurement from peak level to perceived loudness — and eventually ended the war.

The Loudness War

The problem began with CD mastering in the 1990s. Peak normalization ensured that no sample exceeded 0 dBFS, and for a brief period, that was sufficient. Then labels began competing for shelf presence: a song that sounded louder in a record store — even by a few dB — was perceived as more powerful, more professional, more worth buying.

Engineers responded by compressing dynamic range and limiting peaks with increasing aggression, allowing average levels to rise without exceeding 0 dBFS. By 2008, some releases averaged −9 dBFS — saturated, clipping, often visibly distorted in a waveform editor. Metallica's Death Magnetic (2008) became the canonical example of extreme loudness war damage: the album version famously measured worse on dynamic range tests than the Guitar Hero video game version of the same recording.

The problem was the measurement. Peak level tells you nothing about how loud something sounds to a human listener. Two signals can share the same peak level and differ by 10 dB in perceived loudness. A new measurement standard was needed.

ITU-R BS.1770 and K-Weighting

The International Telecommunication Union published BS.1770 as a method for measuring the subjective loudness of audio programs. The core of the algorithm is K-weighting — a frequency filter that approximates how the human auditory system responds to sound at normal listening levels.

K-weighting applies two stages of filtering before the RMS measurement:

Stage 1: High-shelf filter (+4 dB above ~2 kHz) Models the acoustic effect of the head — the skull reflects high frequencies toward the ear, boosting their perceived level. Stage 2: High-pass filter (−3 dB at 80 Hz) Models reduced sensitivity to very low frequencies at normal listening levels (per Fletcher-Munson curves).

After K-weighting, the algorithm measures mean square energy across overlapping 400 ms gating blocks, applies a two-stage threshold to exclude silence and near-silence, and reports the result in LUFS — Loudness Units relative to Full Scale. LUFS and LKFS (the ITU acronym: Loudness, K-weighted, relative to Full Scale) are identical measurements; the naming convention differs between the ITU and EBU standards but the values are interchangeable.

EBU R128: Loudness by Law

The European Broadcasting Union's Recommendation R128, first published in 2010, applied the ITU-R BS.1770 measurement algorithm to broadcast and mandated a target of −23 LUFS integrated loudness for all content delivered to European broadcasters. For the first time, a regulatory body had replaced peak normalization with loudness normalization as the delivery standard.

The immediate effect was that a heavily compressed pop record and a lightly processed spoken-word program, if both mastered to −23 LUFS, would play at the same perceived loudness on broadcast. The incentive to compress was removed: you could not get louder on air than a competitor by bricking your track harder, because the broadcaster's normalizer would bring both to the same level.

Streaming platforms followed. Spotify, Apple Music, YouTube, and others all adopted loudness normalization with targets ranging from −14 to −23 LUFS depending on their playback philosophy. The loudness war, at least for the formats that carry normalization metadata, was over.

Two-Pass Normalization

Accurate loudness normalization requires knowing the measured loudness before applying the correction — you cannot normalize in a single real-time pass without reading the entire file first. FilmStrip uses ffmpeg's loudnorm filter in two-pass mode.

Pass 1 — Analysis: ffmpeg processes the full audio file with the loudnorm filter in measurement mode, writing output to /dev/null. At the end, the filter reports the measured integrated loudness (I), true peak (TP), loudness range (LRA), and gating threshold. For a two-hour film this pass takes time proportional to the duration.

Pass 2 — Normalize: The second pass provides the measured values back to the loudnorm filter along with the target LUFS. With linear=true, the filter computes a single static gain value. That gain is applied uniformly across the entire file. Linear mode preserves dynamics exactly — the output has the same waveform shape as the input, only louder or quieter by a fixed amount.

Choosing a Target

The LUFS target determines how loud the output sits relative to other audio on your device. The right value depends on your listening context:

TargetContextNotes
−23 LUFSBroadcast standard (EBU R128)Matches European broadcast TV. Feels quiet next to streaming music.
−18 LUFSFilmStrip defaultLouder than broadcast, quieter than streaming. Preserves more dynamic headroom than streaming-music targets while remaining comfortable alongside other content.
−16 LUFSApple Music / TidalMatches streaming music normalization on Apple platforms. Dialogue feels natural in a playlist context.
−14 LUFSSpotify standardLoudest major streaming target. Best if you listen to music on Spotify with normalization enabled.

If the measured loudness is already below the true-peak ceiling (−1 dBTP by default), the normalization is a simple gain increase. If the source peaks would clip at the target loudness, the gain is reduced slightly to keep the output within the ceiling — so the integrated loudness may land slightly below target on very dynamic material. This is correct behavior: it prevents digital clipping at the cost of fractionally missing the loudness target.

Settings Reference

Output Format

FormatSpecsNotes
WAV 24-bit PCM, 44.1 kHz Uncompressed linear PCM. Every sample is stored exactly as computed — no encoding loss. 24 bits provides 144 dB of theoretical dynamic range. File size: approximately 15 MB per minute. A two-hour film produces roughly 1.8 GB.
M4A AAC, 128 / 192 / 256 kbps Lossy compression using a psychoacoustic model. At 256 kbps, artifacts are essentially inaudible on normal program material. File size: approximately 2 MB/min (128 kbps) to 3.8 MB/min (256 kbps).

Dialog Guard

SettingParameterEffect
Enabled filter_complex + dynaudnorm on FC For 5.1 and 7.1 source tracks, normalizes the center channel independently before the stereo downmix using a fast-reacting dynaudnorm (p=0.88:m=3:g=15). Silently skipped for stereo and mono sources. Runs in the same ffmpeg pass as extraction with no additional processing time.

Stereo Dialog Assist

SettingParameterEffect
Enabled mid/side split + dynaudnorm on mid For stereo sources, decomposes the signal into mid (L+R) and side (L−R), runs p=0.88:m=3:g=15 on the mid only, and reassembles. Mono sources receive the same filter applied across the whole signal. The stereo image is preserved because the side signal is untouched. No effect on 5.1 / 7.1 sources — Dialog Guard handles those.

High Pass Filter

SettingParameterEffect
Enabled highpass=f=80,highpass=f=80 Two cascaded 2-pole Butterworth biquad filters at 80 Hz, producing a 24 dB/oct rolloff. Runs before the stereo downmix on multichannel sources (after Dialog Guard when active). Removes subwoofer rumble and LFE content before fold-down. On by default.

Level Riding

SettingParameterEffect
Enabled dynaudnorm=p=0.90:m=1.5:g=31 Adds dynamic normalization to the extraction pass with a fixed, headphone-tuned setting. Light peak compression and ~+3.5 dB of available upward gain — effectively downward-only. On surround sources, runs on the stereo mix after downmix. Processing time is unchanged — it runs in the same pass as extraction.

Loudness Normalization

SettingParameterEffect
Enabled loudnorm two-pass Adds two ffmpeg passes after extraction. Pass 1 analyzes the full file (takes additional time proportional to duration). Pass 2 applies a linear gain. Total processing time approximately doubles for a full-length film.
Target: −23 LUFS I=−23 Broadcast standard. Appropriate if the output will play alongside other film audio or in a calibrated home theater context.
Target: −18 LUFS I=−18 Default. Louder than broadcast, quieter than streaming music. Preserves more dynamic headroom than streaming-music targets while remaining comfortable alongside other content.
Target: −16 LUFS I=−16 Matches Apple Music and Tidal. Dialogue feels natural in a mixed playlist context on most devices.
Target: −14 LUFS I=−14 Matches Spotify's normalization target. Setting FilmStrip to −14 LUFS will make film audio play at the same perceived level as your Spotify music library.
Default settings Dialog Guard, Stereo Dialog Assist, High Pass, Level Riding, and Loudness Normalization at −18 LUFS are all enabled by default. The dynamics tuning is fixed — chosen for headphone listening so dialog stays intelligible, peaks are tamed, and the noise floor isn't lifted into the mix. Raise the loudness target to −16 or −14 LUFS if you want film audio to match streaming-music levels.

Choosing a Source Codec

Not all audio codecs are equal inputs for FilmStrip. When you have a choice of which track to use, the codec embedded in the source determines whether processing introduces an extra lossy transcode or not. The four you'll encounter most often on Blu-ray rips are AAC, E-AC3, AC3, and DTS.

AAC — Best Choice

AAC is the same codec FilmStrip produces for M4A output. If you're exporting to M4A, selecting an AAC source track means no transcode at all for that format — the file stays within the AAC family the entire time. WAV output decodes and re-encodes regardless of source, so the advantage is most significant for M4A output. When an AAC track is available, always prefer it.

E-AC3 — Second Choice

Dolby Digital Plus (E-AC3) is the dominant format on modern Blu-ray rips. It supports up to 1.5 Mbps and carries full 7.1 surround. Quality is high at typical encode bitrates. The decode → process → re-encode chain introduces a single generation of lossy conversion, which is audibly transparent at 192+ kbps M4A output. E-AC3 at 640 kbps is an excellent source.

AC3 — Acceptable

Dolby Digital (AC3) is the older 5.1 format with a ceiling of 640 kbps. It's perceptually lossier than E-AC3 at comparable bitrates, but in practice most AC3 tracks at 448–640 kbps sound fine after extraction. The extra generation loss is the same as E-AC3 — one decode and one re-encode. If AC3 is the only option, it will work well.

DTS — No Practical Advantage

DTS-HD Master Audio carries lossless audio on Blu-ray, but MKV rips typically demux to lossy DTS Core at 1.5 Mbps. The decode → process → re-encode chain is identical to E-AC3 — FilmStrip produces the same output quality regardless. There is no reason to seek out DTS specifically; E-AC3 or AAC are better or equal in every case.

Priority Order

When a file contains both AAC and E-AC3 tracks (common in remuxes that preserve all streams), select the AAC track in FilmStrip. You'll skip one decode/re-encode cycle for M4A output and get the same WAV quality for no extra cost.