Steps to build the llm

#10

by kundangraticare - opened 16 days ago

Discussion

kundangraticare

16 days ago

Kindly advise the steps to prepare text and build custom llm to use in beam search.

Also is it feasible to mix the provided lm_6.kenlm with my custom data?

pls guide

srikanta-221

Google org 12 days ago

Hi @kundangraticare ,

Just to be clear, are you asking about fine tuning the medasr model so that it can be used in beam search?

Thank you!

kundangraticare

12 days ago

I am actually building KenLM arpa model and passing it to the Medasr to increase accuracy

srikanta-221

Google org 12 days ago

Hey @kundangraticare , building a KenLM n-gram model for CTC beam search is the correct approach to increase accuracy.

First, ensure your LM training text matches the acoustic model's vocabulary. If MedASR uses subword tokens, you can try tokenise your normalised text with the exact same spiece.model vocabulary before training KenLM. If it uses character level CTC, train KenLM directly on properly normalised text. In all cases, make sure normalisation matches the acoustic model's training transcript.

Next, estimate the model using lmplz (choose the n-gram order based on corpus size; 4–6 is common, and lm_6.kenlm is a 6-gram), then convert the ARPA file to binary format using build_binary for optimised memory usage and faster loading.

To combine your custom model with the provided lm_6.kenlm, you cannot merge compiled binaries directly. If you have access to the original ARPA files or training text, you can retrain or interpolate offline. otherwise, if your decoder supports it, use shallow fusion (multi-LM decoding) during beam search so both language models are evaluated simultaneously with configurable weights.

Finally, systematically tune the LM weight (alpha) and word insertion bonus (beta) on a validation set. proper tuning is critical for achieving actual Word Error Rate(WER) improvements.

Thank you!

wuuuuuuuk

Google org 3 days ago

To build a KenLM to use instead of lm_6.kenlm, the main thing one needs to do is SentencePiece tokenization (e.g. turning "hello" into "▁he ll o". However, because pyctcdecode by default recognizes the '▁' marker and tries to turn the subword pieces back into words, we need to substitute the '▁' marker with '#' (see this part of the example notebook).

Here's an example of how to perform this text processing in Python:

# !uv pip install -U sentencepiece huggingface_hub
import huggingface_hub
import sentencepiece

tokenizer = sentencepiece.SentencePieceProcessor()
tokenizer.Load(huggingface_hub.hf_hub_download(repo_id='google/medasr', filename='spiece.model'))

print(' '.join(tokenizer.EncodeAsPieces('hello')).replace('▁', '#'))

The processed text can then be used to build a KenLM model in the standard way.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment