Bug with Tokenizer
Hi, I think there may be a bug with your tokenizer.
Some strings with repeating characters do not have the correct mapping when encoding and then decoding.
Here is an example:
processor = AutoProcessor.from_pretrained("google/medasr")
input_string = "pevesca plus is a combination medication used for treating certain types of pain and inflammation"
encoded = processor.tokenizer(input_string)
decoded_string = processor.decode(encoded['input_ids'], skip_special_tokens=True)
assert input_string == decoded_string
Hi @dmakhervaks I tested the same example using the current google/medasr checkpoint and could not reproduce the issue.
I got the correct decoded output .It may have been caused by an older tokenizer or transformers version.
Could you try updating to the latest library versions and check again? Also, please let me know which specific versions you are currently running.
Thank you!