April 6, 20269 min read

Supported File Formats: Which Audio and Video Files You Can Transcribe

One of the most common questions before the first upload: “Is my file format supported?” The short answer: with very high probability, yes. scryp accepts over 50 audio formats and over 50 video formats – from common standards like MP3 and MP4 all the way to professional formats such as FLAC, AC3 or MKV. This article lists all supported formats, explains how automatic conversion works, and what to bear in mind regarding recording quality.

Why so many formats? scryp’s conversion engine

scryp has its own conversion engine that can process practically any common audio and video format. Before transcription, every uploaded file is automatically converted into an optimized format – regardless of the source format.

The process in detail: you upload a file in any supported format. Our engine automatically extracts the audio track (for videos, the video track is discarded) and converts it into a standardized WAV format with a 16 kHz sample rate and a mono channel. This format is optimal for AI speech recognition. So you never have to think about codecs, sample rates or channel counts – it all happens fully automatically.

Supported audio formats (50+)

The following list shows the most common audio formats that scryp accepts directly. If your format is here, you can upload the file without any preparation:

MP3 (.mp3) – The most widespread audio format. Compressed, a good balance between file size and quality. Ideal for recordings from smartphones and dictation devices.
WAV (.wav) – Uncompressed format with full audio quality. The standard in professional audio production. Larger files, but the best transcription accuracy.
M4A / AAC (.m4a, .aac) – Apple’s standard audio format. Used by iPhones, iPads and macOS. Better quality than MP3 at the same file size.
OGG / Vorbis (.ogg, .oga) – Open-source format with good compression. Common on Linux systems and in web applications.
FLAC (.flac) – Losslessly compressed format. Full audio quality at roughly 50–60 % of the WAV file size. Popular with audiophiles and in music production.
Opus (.opus) – Modern codec with excellent quality at low bitrates. The standard for VoIP and WebRTC – often used by browsers for audio recordings.
AIFF (.aiff, .aif) – Apple’s uncompressed audio format. The equivalent of WAV in the macOS world. A standard in music production on Apple devices.
WMA (.wma) – Windows Media Audio. Microsoft’s proprietary audio format. Used by older Windows dictation devices and recording programs.
AMR (.amr) – Adaptive Multi-Rate. A compact speech format used by many mobile phones for voice recordings. Low bitrate, but optimized for speech.
AC3 (.ac3) – Dolby Digital. A surround-sound format often used on DVDs, Blu-rays and in TV recordings. scryp extracts and mixes the channels down to mono automatically.
DTS (.dts) – Digital Theater Systems. A high-quality surround format from cinema and home cinema. It is automatically converted into a format optimized for speech recognition.
WebM Audio (.webm) – A container format for web audio. The standard for browser recordings, such as via scryp’s built-in recording feature.

A further 38 supported audio formats:

Container & web: WebM Audio (.webm), CAF (.caf) – Core Audio Format, MKA (.mka) – Matroska Audio, MP2 (.mp2), SPX (.spx) – Speex, 3GP (.3gp).

Lossless & audiophile: APE (.ape) – Monkey's Audio, WavPack (.wv), TTA (.tta) – True Audio, TAK (.tak), Shorten (.shn), DSF (.dsf) – DSD Stream File, Musepack (.mpc).

Surround & cinema: EAC3 (.eac3) – Dolby Digital Plus, DTS-HD (.dtshd), TrueHD (.thd) – Dolby TrueHD, MLP (.mlp).

Telephony & VoIP: GSM (.gsm), iLBC (.lbc), QCP (.qcp), SBC (.sbc) – Bluetooth Audio, G.722 (.g722), G.723 (.g723), G.726 (.g726), G.729 (.g729).

Dictation devices: DSS (.dss) – Digital Speech Standard (Olympus, Philips), ACT (.act) – ACT Voice.

Archive & legacy: AU (.au) – Sun/Unix Audio, W64 (.w64) – Sony Wave64, VOC (.voc) – Creative Voice, OMA (.oma) – Sony OpenMG, PVF (.pvf) – Portable Voice Format, SOX (.sox) – Sound eXchange, VQF (.vqf) – TwinVQ, MMF (.mmf) – Yamaha SMAF, IRCAM (.sf) – Berkeley/IRCAM, AVR (.avr) – Audio Visual Research, SLN (.sln) – Asterisk PCM.

Supported video formats (50+)

scryp also transcribes video files directly – the audio track is extracted automatically. You do not have to separate the audio manually beforehand:

MP4 (.mp4) – The universal video standard. Produced by practically all cameras, smartphones and video conferencing tools (Zoom, Teams, Google Meet).
MOV (.mov) – Apple’s QuickTime format. The standard for iPhone videos and macOS screen recordings.
MKV (.mkv) – The Matroska container. A flexible open-source format that supports multiple audio and subtitle tracks. Common for screencasts and video archives.
AVI (.avi) – The classic Windows video format. Used by older cameras and Windows applications. Large files, but universally compatible.
WebM (.webm) – Google’s open web video format. The standard for browser-based video recordings and YouTube downloads.
MPEG / MPG (.mpeg, .mpg) – A classic video format. The standard for DVDs and older video archives.
WMV (.wmv) – Windows Media Video. Microsoft’s video format, occasionally found in training videos and older conference recordings.
FLV (.flv) – Flash Video. Originating from the Flash era, it can still be found in older video archives.
M4V (.m4v) – Apple’s video variant of MP4. Used by iTunes and Apple TV.
TS / MTS (.ts, .mts) – MPEG Transport Stream. The standard for camcorders (AVCHD) and TV recordings.
3GP / 3G2 (.3gp, .3g2) – Mobile video formats. Produced by older smartphones and tablets for video recordings.
VOB (.vob) – DVD Video Object. The file format on DVD discs. Relevant for digitizing DVD archives.

A further 44 supported video formats:

Professional & broadcast: MXF (.mxf) – Material eXchange Format, GXF (.gxf) – General eXchange Format, DV (.dv) – Digital Video, R3D (.r3d) – RED Raw, LXF (.lxf) – VR Native Stream, Y4M (.y4m) – YUV4MPEG, MLV (.mlv) – Magic Lantern Video, MJ2 (.mj2) – Motion JPEG 2000, IVF (.ivf).

Web & streaming: OGV (.ogv) – Ogg Video, ASF (.asf) – Advanced Streaming Format, F4V (.f4v) – Flash MP4, SWF (.swf) – ShockWave Flash, NSV (.nsv) – Nullsoft Streaming Video, ISM/ISMV (.ism, .ismv) – Smooth Streaming.

TV recordings & surveillance: WTV (.wtv) – Windows Television, TY (.ty) – TiVo, DAV (.dav) – CCTV DVR, EVO (.evo) – HD-DVD.

Archive & legacy: RM/RMVB (.rm, .rmvb) – RealMedia, NUT (.nut), NUV (.nuv) – NuppelVideo, DivX (.divx), FLC/FLI (.flc, .fli) – Autodesk Animator, PSP (.psp), CDXL (.cdxl) – Amiga CDXL.

Gaming & multimedia: BIK (.bik) – Bink Video, SMK (.smk) – Smacker, ROQ (.roq) – id Software, THP (.thp) – Nintendo, VMD (.vmd) – Sierra, VIV (.viv) – Vividas, XMV (.xmv) – Microsoft XMV, PMP (.pmp) – PSP Media, CPK (.cpk) – Sega FILM, RL2 (.rl2), RPL (.rpl), MTV (.mtv), PDV (.pdv) – PlayDate, IV8 (.iv8) – IndigoVision, BMV (.bmv) – Discworld, TMV (.tmv), YOP (.yop) – Psygnosis, WC3 (.wc3) – Wing Commander.

How automatic conversion works

The entire conversion process runs fully automatically on the server. Regardless of the source format, every file is processed in three steps: audio track extraction, conversion to 16 kHz mono WAV for AI recognition, and subsequent transcription with our SX-3 language model.

The technical process:

1. Upload: Your file is encrypted in the browser and uploaded in your original format.
2. Extraction: Our conversion engine extracts the audio track. For pure audio files, this step is skipped. For videos, only the sound is used – the visual material is not stored.
3. Normalization: The audio is converted into a standardized WAV format: 16 kHz sample rate, 16-bit PCM, mono. These parameters are optimal for our SX-3 speech recognition model.
4. Transcription: The normalized audio is processed by SX-3. In parallel, speaker diarization takes place, distinguishing the different voices.
5. Playback version: Additionally, a compressed MP3 version is created for playback in the browser, so you can listen along directly while proofreading.

Tips for optimal transcription quality

scryp accepts almost any format – but the quality of the result depends heavily on the quality of the recording. A few recommendations:

Prefer uncompressed formats: WAV and FLAC deliver the best results, because no compression artifacts interfere with speech recognition. If storage space is not an issue, record in WAV.
High bitrate for compressed formats: With MP3, the bitrate should be at least 128 kbps, better still 192 or 256 kbps. MP3 files at 64 kbps or below can noticeably degrade recognition accuracy.
Upload videos directly: You do not have to extract the audio track manually. Upload the video file directly – scryp handles the extraction automatically. This saves a step and avoids quality loss from double conversion.
Surround formats work: Multi-channel formats like AC3 (Dolby Digital) and DTS are automatically downmixed to mono. You do not have to convert the sound manually.
Mind the recording environment: Regardless of format: a quiet room and a good microphone have more influence on transcription accuracy than the choice between MP3 and WAV.

Frequently asked questions

Do I have to convert my files beforehand? No. Upload the file in its original format. Conversion happens automatically on the server.

What if my format is not in the list? Just try it. scryp accepts any file with an audio or video MIME type. The formats listed here are the most common ones – in practice, our engine handles considerably more.

Do very large video files work too? Yes. Depending on your subscription plan, files of up to 1 GB (Nano), 5 GB (Pro) or 10 GB (Ultra) can be uploaded. For large files, a multipart upload is used, which works reliably even on an unstable connection.

What about audio tracks in foreign languages? The file format is independent of the language. scryp automatically recognizes over 90 languages. If you want to speed up recognition, you can specify a language hint during upload.

Are my files deleted after transcription? Yes. The encrypted original files are automatically deleted after processing. Only an encrypted playback version (MP3) and the encrypted transcript remain on the server.

Conclusion

scryp supports over 100 audio and video formats – from everyday standards like MP3 and MP4, through professional surround formats such as AC3 and DTS, to specialized archive formats like VOB, MXF or MTS. Our conversion engine takes care of codecs and format compatibility. Simply upload your file in its original format, and the rest happens automatically. For the best possible transcription quality, an uncompressed or high-bitrate format is recommended – and, more importantly, a good recording environment.