Yes @Amirjab21 , all the code is open-sourced :)
Training script: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py
Streaming config: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/conf/fastconformer/cache_aware_streaming/fastconformer_ctc_bpe_streaming.yaml
Inference script: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py
Kunal Dhawan
AI & ML interests
Recent Activity
Organizations
Deploy Streaming nemotron speech model
Thanks for raising this, @Amirjab21 . As discussed and confirmed in the Hugging Face model page thread, the model’s forward pass maintains a fixed-size encoder cache and a fixed-size RNN-T decoder hidden state, both of which are independent of the total audio duration and do not grow with input length.
After retesting, we’re glad to see that you no longer observe a degradation in inference speed as audio length increases. This aligns with the intended design and expected performance characteristics of the cache-aware streaming architecture.
Thanks again for taking the time to investigate and share your findings, and please feel free to reach out if you encounter any other issues or have additional questions.
Does decoding efficiency decrease as the audio length increases?
Smaller model planned?
Can we expect an ONNX quant?
Multilingual version planned?
Thank you for the question, @Amirjab21 ! This is one of the key advantages of a native streaming model. The audio is not processed in a single pass over the full input; instead, it is consumed incrementally in small chunks as they arrive, with relevant contextual information preserved in the model’s cache. This design allows the model to handle arbitrarily long audio streams without an explicit duration limit, since context is carried forward through the cache and computation is performed only on the new incoming frames, rather than reprocessing the entire audio or chunking it to a fixed maximum length.
Installation Video and Testing - Step by Step
RNNT decoder stalls after sentence boundaries in streaming mode
Great question, @RakshitAralimatti . To better handle real-world conversational dynamics such as interruptions and rapid turn-taking, we recently released a cache-aware model that jointly performs ASR and end-of-utterance (EOU) detection. The EOU signal can be used to explicitly trigger cache resets at turn boundaries, enabling robust behavior in interactive, streaming settings. You can find the model here: https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1
MLX version planned?
colab notebooks do not work
Hi @kavyamanohar , thank you for the question. Unlike parakeet-tdt-0.6b-v2, this model was trained in a single stage. To enable proper punctuation and capitalization, we leveraged the Granary dataset and pipeline, which provides pseudo punctuation and capitalization labels generated using a strong LLM (e.g., Qwen-2.5-7B-Instruct).
Hi @TomSchelsen , thank you for the question. This blog is written from an end-user perspective, focusing on why and when one should use the Nemotron Speech ASR model. For that reason, we chose to compare models that deliver similar performance in terms of accuracy and WER.
In particular, nemotron-speech-streaming-en-0.6b achieves comparable (and in some cases better) accuracy than our leading streaming parakeet-ctc-1.1b-asr model across multiple evaluation datasets, while also providing the scaling and latency advantages highlighted in the blog. A comparison with parakeet-ctc-0.6b-asr is reasonable; however, that model does not match nemotron-speech-streaming-en-0.6b in terms of overall accuracy and WER.
We will try to address this better in a followup blog and also share more interesting results using the model. Thank you!