Model in float16 or int8 possible?

#1
by puettmann - opened

Hello! I tried the models and they work great. I think they are run in float32. Would it be possible to run them in float16 or int8 somehow to save even more memory, or would this degrade quality too much? Thanks!

Absolutely! Glad to hear they're working well for you.

If you are using the quickmt library for inference, you can pass ctranslate2 arguments to the Translator class, e.g.

t = Translator(
    "./quickmt-zh-en", device="cpu", intra_threads=2, inter_threads=2, compute_type="int8"
)

The options for compute_type are: default, auto, int8, int8_float32, int8_float16, int8_bfloat16, int16, float16, bfloat16, float32

"int8" will work well for inference on CPU and give "int8_float16" or "int8_bfloat16" a try for GPU inference.

The quality should be pretty close, and the speed will likely be improved, too - give it a try and report back !

Cheers!

Thanks for the reply, I tried it and it works really well! Thank you. I have build a fastapi endpoint that uses the quickmt models to translate shorter sentences or texts and when loaded in int8, the translation usually takes less than half a second on CPU while still being really good.

These models deserve way more recognition, they are much better than the OPUS models (which are also good but a bit dated by now imo).

puettmann changed discussion status to closed

Sign up or log in to comment