Steps to build the llm
Kindly advise the steps to prepare text and build custom llm to use in beam search.
Also is it feasible to mix the provided lm_6.kenlm with my custom data?
pls guide
Hi @kundangraticare ,
Just to be clear, are you asking about fine tuning the medasr model so that it can be used in beam search?
Thank you!
I am actually building KenLM arpa model and passing it to the Medasr to increase accuracy
Hey @kundangraticare , building a KenLM n-gram model for CTC beam search is the correct approach to increase accuracy.
First, ensure your LM training text matches the acoustic model's vocabulary. If MedASR uses subword tokens, you can try tokenise your normalised text with the exact same spiece.model vocabulary before training KenLM. If it uses character level CTC, train KenLM directly on properly normalised text. In all cases, make sure normalisation matches the acoustic model's training transcript.
Next, estimate the model using lmplz (choose the n-gram order based on corpus size; 4β6 is common, and lm_6.kenlm is a 6-gram), then convert the ARPA file to binary format using build_binary for optimised memory usage and faster loading.
To combine your custom model with the provided lm_6.kenlm, you cannot merge compiled binaries directly. If you have access to the original ARPA files or training text, you can retrain or interpolate offline. otherwise, if your decoder supports it, use shallow fusion (multi-LM decoding) during beam search so both language models are evaluated simultaneously with configurable weights.
Finally, systematically tune the LM weight (alpha) and word insertion bonus (beta) on a validation set. proper tuning is critical for achieving actual Word Error Rate(WER) improvements.
Thank you!
To build a KenLM to use instead of lm_6.kenlm, the main thing one needs to do is SentencePiece tokenization (e.g. turning "hello" into "βhe ll o". However, because pyctcdecode by default recognizes the 'β' marker and tries to turn the subword pieces back into words, we need to substitute the 'β' marker with '#' (see this part of the example notebook).
Here's an example of how to perform this text processing in Python:
# !uv pip install -U sentencepiece huggingface_hub
import huggingface_hub
import sentencepiece
tokenizer = sentencepiece.SentencePieceProcessor()
tokenizer.Load(huggingface_hub.hf_hub_download(repo_id='google/medasr', filename='spiece.model'))
print(' '.join(tokenizer.EncodeAsPieces('hello')).replace('β', '#'))
The processed text can then be used to build a KenLM model in the standard way.