CMFA: What is Common Media Application Format and what do we use it for?

The** Common Media Application Format (CMAF)** is an extensible standard for encoding, packaging and decoding segmented media objects. In other words, a standard media streaming format that uses a single set of files for all target platforms and devices. Its goal is to consolidate the competing codecs, protocols, media formats and platforms into a single format. Doing so removes the need for providing multiple copies of the same content and enables higher efficiency and theoretical saving of over 70% of the encoding, packaging and storing demand.

Streaming using current protocols.

CMAF uses a fragmented MP4 file format that enables the use of only one set of video and audio files along a lightweight manifest that maps video, audio and other meta information together for presentation and rights management. Online Video Platforms (OVPs) can encode or transcode, package and store the video only once and thus saving on storage and processing needs. CMAF also provides low latency streaming and efficient caching due to its fragmented encoding and transfer paradigm.

Streaming using CMAF.

CMAF also provides Digital Rights Management (DRM) for streamed content it supports AES-CTR - Counter (Widevine and PlayReady) and AES-CBC - Cipher Block Chaining (Apple FairPlay), which are both parts of the Common Encryption Scheme (CENC).

Standard allows a wide range of implementations including HTTP Live Streaming (HLS) and MPEG's Dynamic Adaptive Streaming over HTTP (MPEG DASH). CMAF specifications define the following media objects:

  • CMAF Track contains encoded media samples, including audio, video and subtitles. It is split into a CMAF Header and one or more CMAF Fragments. Media samples are stored in a CMAF-specific container derived from the ISO Base Media File Format (ISO BMFF) and can optionally be protected by MPEG Common Encryption.
  • CMAF Switching Set presents alternative tracks that can be switched and spliced at CMAF Fragment boundaries to adaptively stream the same content at different bit rates and resolutions.
  • Aligned CMAF Switching Set is a special case of two or more CMAF Switching Sets that are encoded from the same source with alternative encodings and time aligned to each other.
  • CMAF Selection Set is a group of CMAF Switching Sets of the same media type that may include alternative content or alternative encodings.
  • CMAF Presentation is one or more time-synchronized CMAF Selection Sets. A presentation is the first point where different media types can be combined.

The CMAF Hypothetical Reference Model defines how tracks can be delivered, combined and synchronized in CMAF Presentations. Due to it being a hypothetical and reference model, it allows the use of any compatible implementation. Different implementations can share the same resources, CMAF Addressable Objects, thereby allowing efficient caching even when delivering to multiple platforms. CMAF Addressable Media objects consist of:

  • CMAF Header contains information for initializing a track.
  • CMAF Segment is a sequence of one or more consecutive fragments from the same track.
  • CMAF Chunk contains a sequential subset of samples from a fragment.
  • CMAF Track File is a complete track in one CMAF-specific container derived from ISO BMFF.

CMAF Addressable Media object.

Manifests, Resources and CMAF Presentations

In HLS, Manifest references HLS Multivariant Playlist and Media Playlists, which describe a single or a sequence of CMAF presentations. HLS Multivariant Playlist defines different tiers of the presentation using the EXT-X-STREAM-INF tags. Tiers differ in bit rate, required codes, resolution, other attributes and HLS Media Playlist they specify. Each tier may also have additional HLS Renditions, which are also Media Playlists determined by EXT-X-MEDIA tags. HLS Rendition can present either video, audio or subtitles. EXT-X-MEDIA tags are used to associate video, audio and subtitles together and present a single EXT-X-STREAM-INF tag. This enables the use of a single HLS Rendition by several EXT-X-STREAM-INF tags.

CMAF Tracks, Segments, Headers and Fragments

Each CMAF Track has one HLS Media Playlist, which contains CMAF Segments. Each CMAF segment has an EXT-X-MAP tag that references the CMAF Header and accompanying CMAF Fragments. The EXT-X-INDEPENDENT-SEGMENTS tag should be included in HLS Media Playlist since all CMAF Fragments are independently decodable. The EXT-X-SESSION-KEY tag should be included in HLS Multivariant Playlist if the data is encrypted to enable prefetching of keys. The EXT-X-BYTERANGE tag determines if CMAF Segment is a byte range inside a larger resource. The EXT-X-I-FRAMES-ONLY tag determines that CMAF Segments start on a CMAF Fragment boundary. The EXT-X-DISCONTINUITY tag is used to concatenate multiple CMAF Tracks of the same media type in a Media Playlist.

CMAF Switching Sets

Each Track in a video CMAF Switching Set should appear in the Multivariant Playlist as a Media Playlist URI. The URI is prefixed by an EXT-X-STREAM-INF that describes the Track and specifies additional renditions that are intended to play with video by indicating the appropriate EXT-X-MEDIA tag.

Each Track in an audio CMAF Switching Set should be represented in the Multivariant Playlist by an EXT-X-MEDIA tag. The URI attribute of the tag references one or more Track's Media Playlists.

CMAF Selection Sets

CMAF Selection Sets can offer either alternate encodings of the same source content or homogenous encodings of different versions of the source content. In the first case, each Switching Set in the Selection Set appears as a set of EXT-X-STREAM-INF tags, for video, or a set of EXT-X-MEDIA tags, for other media types. In the second case, each Track of a member Switching Set should appear as an EXT-X-MEDIA tag.

HLS Example

An example of an HLS Media Playlist for video (video.m3u8) CMAF Track that is built from 2 seconds long CMAF Fragments (VF1, VF2, VF3):

#EXTM3U
#EXT-X-TARGETDURATION:4
#EXT-X-VERSION:6
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MAP:URI="VIDEO-HEADER"
#EXTINF:2.0,
VF1
#EXTINF:2.0,
VF2
#EXTINF:2.0,
VF3
#EXT-X-ENDLIST

The Media Playlist for the video is accompanied by the Media playlist for the audio (audio.m3u8) CMAF Track that is built from 2 seconds long CMAF Fragments (AF1, AF2, AF3):

#EXTM3U
#EXT-X-TARGETDURATION:4
#EXT-X-VERSION:6
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-MAP:URI="AUDIO-HEADER"
#EXTINF:2.0,
AF1
#EXTINF:2.0,
AF2
#EXTINF:2.0,
AF3
#EXT-X-ENDLIST

Media Playlists can be grouped into Multivariant Playlist that enables the selection of appropriate configuration. The example presents CMAF Selection Sets can appear as separate Renditions (english.m3u8 and slovene.m3u8), or as separate sets of tiers determined by different codecs (video.m3u8 and video-hq.m3u8 together present the AVC Switching Set and video.m3u8 and hevc-video-hq.m3u8 together present the HEVC Switching Set.) Together they form a Selection Set that allows the selection of codec.

#EXTM3U
#EXT-X-VERSION:6
#EXT-X-INDEPENDENT-SEGMENTS

#EXT-X-MEDIA:NAME="English",TYPE=AUDIO,GROUP-ID="audio-stereo-64",LANGUAGE="en",DEFAULT=YES,AUTOSELECT=YES,URI="english.m3u8"
#EXT-X-MEDIA:NAME="Slovene",TYPE=AUDIO,GROUP-ID="audio-stereo-64",LANGUAGE="si",DEFAULT=NO,AUTOSELECT=YES,URI="slovene.m3u8"

#EXT-X-MEDIA:NAME="English",TYPE=AUDIO,GROUP-ID="audio-stereo-128",LANGUAGE="en",DEFAULT=YES,AUTOSELECT=YES,URI="english-hi.m3u8"
#EXT-X-MEDIA:NAME="Slovene",TYPE=AUDIO,GROUP-ID="audio-stereo-128",LANGUAGE="si",DEFAULT=NO,AUTOSELECT=YES,URI="slovene-hi.m3u8"

#EXT-X-STREAM-INF:BANDWIDTH=1123000,CODECS="avc1.64001f,mp4a.40.2", AUDIO="audio-stereo-64",RESOLUTION=620x334
video.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=8187000,CODECS="avc1.640028,mp4a.40.2", AUDIO="audio-stereo-128",RESOLUTION=1916x1032
video-hq.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=623000, CODECS="hvc1.1.6.L120.B0,mp4a.40.2",AUDIO="audio-stereo-64", RESOLUTION=620x334
hevc-video.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=4187000, CODECS="hvc1.1.6.L120.B0,mp4a.40.2",AUDIO="audio-stereo-128", RESOLUTION=1916x1032
hevc-video-hq.m3u8

Glossary

CDN

A CDN, or "Content Delivery Network," is a network of servers (typically placed around the world) used for the purpose of delivering content (videos, photos, CSS, etc..).