For those who are note patient enough to wait for a matrix multiplication, the Cupy library mimics the Numpy library API while providing CUDA acceleration. However, I would not recommend using it as a drop-in replacement for Numpy, but just in the case where acceleration is really needed. For my application, I need to multiply two (60000×2048) matrices. This is way too slow for Numpy’s default single-threaded API.
In order to install, you need to resist the temptation of doing a “pip install”. Instead, follow the official instructions carefully. If you use PyTorch, the CUDA toolkit version you use should match the one installed with it. For reference, my last installation command was the following:
conda install -c conda-forge cupy cutensor cudatoolkit=11.0