Model in float16 or int8 possible?

by puettmann - opened 11 days ago

11 days ago

Hello! I tried the models and they work great. I think they are run in float32. Would it be possible to run them in float16 or int8 somehow to save even more memory, or would this degrade quality too much? Thanks!

radinplaid

quickmt org 8 days ago

•

edited 8 days ago

Absolutely! Glad to hear they're working well for you.

If you are using the quickmt library for inference, you can pass ctranslate2 arguments to the Translator class, e.g.

t = Translator(
    "./quickmt-zh-en", device="cpu", intra_threads=2, inter_threads=2, compute_type="int8"
)

The options for compute_type are: default, auto, int8, int8_float32, int8_float16, int8_bfloat16, int16, float16, bfloat16, float32

"int8" will work well for inference on CPU and give "int8_float16" or "int8_bfloat16" a try for GPU inference.

The quality should be pretty close, and the speed will likely be improved, too - give it a try and report back !

Cheers!

puettmann

7 days ago

•

edited 7 days ago

Thanks for the reply, I tried it and it works really well! Thank you. I have build a fastapi endpoint that uses the quickmt models to translate shorter sentences or texts and when loaded in int8, the translation usually takes less than half a second on CPU while still being really good.

These models deserve way more recognition, they are much better than the OPUS models (which are also good but a bit dated by now imo).

puettmann changed discussion status to closed 7 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment