Open Whisper-style Speech Model V4 from CMU WAVLab
OWSM (pronounced as "awesome") is a series of Open Whisper-style Speech Models from CMU WAVLab. We reproduce Whisper-style training using publicly available data and an open-source toolkit ESPnet. For more details, please check our website.
Language of input speech. Select 'Unknown' (1st option) to detect it automatically.
Task to perform on input speech.
OWSM V4 model to use for recognition.
Perform long-form decoding. If an exception happens, it will fall back to standard decoding on the initial 30s.
The latest demo uses OWSM v4 based on E-Branchformer. OWSM v4 medium model has 1.02B parameters and is trained on 320k hours of labelled data (290k for ASR, 30k for ST). OWSM-V4 CTC model has 1.01B parameters and is trained on the same dataset as the medium model. They supports various speech-to-text tasks:
- Speech recognition in 151 languages
- Any-to-any language speech translation
- Utterance-level timestamp prediction
- Long-form transcription
- Language identification
Additionally, OWSM v4 applies 8 times subsampling (instead of 4 times in OWSM v3.1) to the log Mel features, leading to a final resolution of 80 ms in the encoder. When running inference, we recommend setting maxlenratio=1.0 (default) instead of smaller values.
As a demo, the input speech should not exceed 2 minutes. We also limit the maximum number of tokens to be generated. Please try our Colab demo if you want to explore more features.
Disclaimer: OWSM has not been thoroughly evaluated in all tasks. Due to limited training data, it may not perform well for certain languages.
Please consider citing the following papers if you find our work helpful.
@inproceedings{owsm-v4,
title={{OWSM} v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning},
author={Yifan Peng and Shakeel Muhammad and Yui Sudo and William Chen and Jinchuan Tian and Chyi-Jiunn Lin and Shinji Watanabe},
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
year={2025},
}
@inproceedings{peng2024owsm31,
title={OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer},
author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
booktitle={Proc. INTERSPEECH},
year={2024}
}
@inproceedings{peng2023owsm,
title={Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data},
author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
booktitle={Proc. ASRU},
year={2023}
}
@inproceedings{owsm-ctc,
title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
author = "Peng, Yifan and
Sudo, Yui and
Shakeel, Muhammad and
Watanabe, Shinji",
booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
year = "2024",
month= {8},
url = "https://aclanthology.org/2024.acl-long.549",
}