21 4 10

Kunal Dhawan

kunaldhawan

https://kunal-dhawan.weebly.com/

KunalDhawan

AI & ML interests

Conversational AI, NLP, Multimodal Machine Learning

Recent Activity

new activity 27 days ago

nvidia/nemotron-speech-streaming-en-0.6b:Deploy Streaming nemotron speech model

commented on an article about 1 month ago

Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR

updated a model about 1 month ago

nvidia/nemotron-speech-streaming-en-0.6b

View all activity

Organizations

New activity in nvidia/nemotron-speech-streaming-en-0.6b 27 days ago

Deploy Streaming nemotron speech model

#10 opened 30 days ago by

MinhHan1009

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR about 1 month ago

Yes @Amirjab21 , all the code is open-sourced :)
Training script: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py
Streaming config: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/conf/fastconformer/cache_aware_streaming/fastconformer_ctc_bpe_streaming.yaml
Inference script: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py

updated a model about 1 month ago

nvidia/nemotron-speech-streaming-en-0.6b

Automatic Speech Recognition • Updated Jan 30 • 11.2k • 479

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR about 1 month ago

Thanks for raising this, @Amirjab21 . As discussed and confirmed in the Hugging Face model page thread, the model’s forward pass maintains a fixed-size encoder cache and a fixed-size RNN-T decoder hidden state, both of which are independent of the total audio duration and do not grow with input length.

After retesting, we’re glad to see that you no longer observe a degradation in inference speed as audio length increases. This aligns with the intended design and expected performance characteristics of the cache-aware streaming architecture.

Thanks again for taking the time to investigate and share your findings, and please feel free to reach out if you encounter any other issues or have additional questions.

New activity in nvidia/nemotron-speech-streaming-en-0.6b about 1 month ago

Does decoding efficiency decrease as the audio length increases?

👀 1

#9 opened about 1 month ago by

Kerwin11

Smaller model planned?

#8 opened about 1 month ago by

downtown1629

updated a collection about 1 month ago

Nemotron Speech

Collection

Open, state-of-the-art, production‑ready enterprise speech models from the NVIDIA Speech research team for ASR, TTS, Speaker Diarization and S2S • 9 items • Updated about 8 hours ago • 41

upvoted a collection about 2 months ago

Nemotron Speech

Collection

Open, state-of-the-art, production‑ready enterprise speech models from the NVIDIA Speech research team for ASR, TTS, Speaker Diarization and S2S • 9 items • Updated about 8 hours ago • 41

New activity in nvidia/nemotron-speech-streaming-en-0.6b about 2 months ago

Can we expect an ONNX quant?

➕ 3

#6 opened about 2 months ago by

SuperPauly

Multilingual version planned?

#2 opened about 2 months ago by

fosple

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR about 2 months ago

Thank you for the question, @Amirjab21 ! This is one of the key advantages of a native streaming model. The audio is not processed in a single pass over the full input; instead, it is consumed incrementally in small chunks as they arrive, with relevant contextual information preserved in the model’s cache. This design allows the model to handle arbitrarily long audio streams without an explicit duration limit, since context is carried forward through the cache and computation is performed only on the new incoming frames, rather than reprocessing the entire audio or chunking it to a fixed maximum length.

New activity in nvidia/nemotron-speech-streaming-en-0.6b about 2 months ago

Installation Video and Testing - Step by Step

❤️ 1

#4 opened about 2 months ago by

fahdmirzac

RNNT decoder stalls after sentence boundaries in streaming mode

#5 opened about 2 months ago by

chatboo

liked a Space about 2 months ago

Nemotron Speech Streaming

🎤

Real-time speech recognition with NVIDIA Triton

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR about 2 months ago

Great question, @RakshitAralimatti . To better handle real-world conversational dynamics such as interruptions and rapid turn-taking, we recently released a cache-aware model that jointly performs ASR and end-of-utterance (EOU) detection. The EOU signal can be used to explicitly trigger cache resets at turn boundaries, enabling robust behavior in interactive, streaming settings. You can find the model here: https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1

New activity in nvidia/nemotron-speech-streaming-en-0.6b about 2 months ago

MLX version planned?

#3 opened about 2 months ago by

Amit-I

New activity in nvidia/canary-1b-flash about 2 months ago

colab notebooks do not work

#15 opened 2 months ago by

malinphy

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR about 2 months ago

Hi @kavyamanohar , thank you for the question. Unlike parakeet-tdt-0.6b-v2, this model was trained in a single stage. To enable proper punctuation and capitalization, we leveraged the Granary dataset and pipeline, which provides pseudo punctuation and capitalization labels generated using a strong LLM (e.g., Qwen-2.5-7B-Instruct).

commented on Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR about 2 months ago

Hi @TomSchelsen , thank you for the question. This blog is written from an end-user perspective, focusing on why and when one should use the Nemotron Speech ASR model. For that reason, we chose to compare models that deliver similar performance in terms of accuracy and WER.
In particular, nemotron-speech-streaming-en-0.6b achieves comparable (and in some cases better) accuracy than our leading streaming parakeet-ctc-1.1b-asr model across multiple evaluation datasets, while also providing the scaling and latency advantages highlighted in the blog. A comparison with parakeet-ctc-0.6b-asr is reasonable; however, that model does not match nemotron-speech-streaming-en-0.6b in terms of overall accuracy and WER.
We will try to address this better in a followup blog and also share more interesting results using the model. Thank you!

upvoted an article about 2 months ago

Article

Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR

Jan 5

•

Kunal Dhawan

AI & ML interests

Recent Activity

Organizations

kunaldhawan's activity

Deploy Streaming nemotron speech model

Does decoding efficiency decrease as the audio length increases?

Smaller model planned?

Can we expect an ONNX quant?

Multilingual version planned?

Installation Video and Testing - Step by Step

RNNT decoder stalls after sentence boundaries in streaming mode

Nemotron Speech Streaming

MLX version planned?

colab notebooks do not work

Scaling Real-Time Voice Agents with Cache-Aware Streaming ASR