add more benchmark numbers#2900
Conversation
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks for the updates. I think for torchao, we could additionally mention the FP8 numbers on RTX 4090. WDYT?
| Most of these quantization backends can be combined with the memory optimization techniques offered in Diffusers. Let's explore CPU offloading, group offloading, and `torch.compile`. You can learn more about these techniques in the [Diffusers documentation](https://huggingface.co/docs/diffusers/main/en/optimization/memory). | ||
|
|
||
| > **Note:** At the time of writing, bnb + `torch.compile` also works if bnb is installed from source and using pytorch nightly or with fullgraph=False. | ||
| > **Note:** At the time of writing, bnb + `torch.compile` works if bnb is installed from source and using pytorch nightly or with fullgraph=False. |
There was a problem hiding this comment.
https://github.com/bitsandbytes-foundation/bitsandbytes/releases/tag/0.46.0 is done so bitsandbytes==0.46.0 should work with PyTorch nightly.
| | int8_weight_only | 17.020 GB | 22.473 GB | 8 seconds | ~851 seconds | | ||
| | float8_weight_only | 17.016 GB | 22.115 GB | 8 seconds | ~545 seconds | | ||
|
|
||
| **bitsandbytes + `torch.compile`**: **Note:** To enable compatibility with torch.compile, make sure you're using the latest version of bitsandbytes and PyTorch nightlies (2.8) |
There was a problem hiding this comment.
Mention the hardware this was obtained on.
There was a problem hiding this comment.
Not sure what hardware this was obtained on? was it A100?
Co-authored-by: Sayak Paul <[email protected]>
Hmm, I'm not sure. RTX 4090 is probably more common for developers, but could make blogpost more complex |
|
In the diffusion world, RTX 4090 is more common than A100, H100, etc., actually. But okay without. |
@sayakpaul