Automatic Speech Recognition
Transformers
Safetensors
English
lasr_ctc
medical-asr
radiology
medical

Steps to build the llm

#10
by kundangraticare - opened

Kindly advise the steps to prepare text and build custom llm to use in beam search.

Also is it feasible to mix the provided lm_6.kenlm with my custom data?

pls guide

Hi @kundangraticare ,

Just to be clear, are you asking about fine tuning the medasr model so that it can be used in beam search?

Thank you!

I am actually building KenLM arpa model and passing it to the Medasr to increase accuracy

Hey @kundangraticare , building a KenLM n-gram model for CTC beam search is the correct approach to increase accuracy.

First, ensure your LM training text matches the acoustic model's vocabulary. If MedASR uses subword tokens, you can try tokenise your normalised text with the exact same spiece.model vocabulary before training KenLM. If it uses character level CTC, train KenLM directly on properly normalised text. In all cases, make sure normalisation matches the acoustic model's training transcript.

Next, estimate the model using lmplz (choose the n-gram order based on corpus size; 4–6 is common, and lm_6.kenlm is a 6-gram), then convert the ARPA file to binary format using build_binary for optimised memory usage and faster loading.

To combine your custom model with the provided lm_6.kenlm, you cannot merge compiled binaries directly. If you have access to the original ARPA files or training text, you can retrain or interpolate offline. otherwise, if your decoder supports it, use shallow fusion (multi-LM decoding) during beam search so both language models are evaluated simultaneously with configurable weights.

Finally, systematically tune the LM weight (alpha) and word insertion bonus (beta) on a validation set. proper tuning is critical for achieving actual Word Error Rate(WER) improvements.

Thank you!

Google org

To build a KenLM to use instead of lm_6.kenlm, the main thing one needs to do is SentencePiece tokenization (e.g. turning "hello" into "▁he ll o". However, because pyctcdecode by default recognizes the '▁' marker and tries to turn the subword pieces back into words, we need to substitute the '▁' marker with '#' (see this part of the example notebook).

Here's an example of how to perform this text processing in Python:

# !uv pip install -U sentencepiece huggingface_hub
import huggingface_hub
import sentencepiece

tokenizer = sentencepiece.SentencePieceProcessor()
tokenizer.Load(huggingface_hub.hf_hub_download(repo_id='google/medasr', filename='spiece.model'))

print(' '.join(tokenizer.EncodeAsPieces('hello')).replace('▁', '#'))

The processed text can then be used to build a KenLM model in the standard way.

Sign up or log in to comment