Using CUDA and the GPU to Accelerate Training/Testing

Code for this section is provided on GitHub, on this page.

In Torch, it is (almost) transparent to move parts of your computation graph to the GPU.

Basics: Tensors

First initialize the environment like this:

require 'cutorch'
print(  cutorch.getDeviceProperties(cutorch.getDevice()) )

This should produce something like:

{[deviceOverlap]            = 1
 [textureAlignment]         = 512
 [minor]                    = 0
 [integrated]               = 0
 [major]                    = 2
 [sharedMemPerBlock]        = 49152
 [regsPerBlock]             = 32768
 [computeMode]              = 0
 [multiProcessorCount]      = 16
 [totalConstMem]            = 65536
 [totalGlobalMem]           = 3220897792
 [memPitch]                 = 2147483647
 [maxThreadsPerBlock]       = 1024
 [name]                     = string : "GeForce GTX 580"
 [clockRate]                = 1566000
 [warpSize]                 = 32
 [kernelExecTimeoutEnabled] = 0
 [canMapHostMemory]         = 1}

Now you can easily sum two tensors on the GPU by doing this:

t1 = torch.CudaTensor(100):fill(0.5)
t2 = torch.CudaTensor(100):fill(1)

This summing happened on the GPU.

Now you can very easily move your tensors back and forth the GPU like this:

t1_cpu = t1:float()
t1[{}] = t1_cpu  -- copies the data back to the GPU, with no new alloc
t1_new = t1_cpu:cuda()  -- allocates a new tensor

Knowing this, a more subtle way of working with the GPU is to keep Tensors as DoubleTensors or FloatTensorsi.e. keep them on the CPU by default, and just move specific Tensors to the GPU when needed. We will see that now when working with nn.

Using the GPU with nn

The nn module provides modules which each contain their state, and some contain trainable parameters. When you create a module, the default type of these Tensors is the default type of Torch. If you want to create pure Cuda modules, then simply set the default type to Cuda, and just create your modules. These modules will therefore expect CudaTensors as inputs. It's often a bit too simplistic to set things up this way, as your dataset will typically be made of CPU-based tensors, and some of your nn modules might also be more efficient on the CPU, such that you'll prefer splitting the model in CPU and GPU pieces.

To use Cuda-based nn modules, you will need to import cunn:

require 'cunn'

You can easily cast your modules to any type available in Torch:

-- we define an MLP
mlp = nn.Sequential()
mlp:add(nn.Linear(ninput, 1000))
mlp:add(nn.Linear(1000, 1000))
mlp:add(nn.Linear(1000, 1000))
mlp:add(nn.Linear(1000, noutput))
-- and move it to the GPU:

At this stage the network expects CudaTensor inputs. Given a FloatTensor input, you will simply need to retype it before feeding it to the model:

-- input
input = torch.randn(ninput)
-- retype and feed to network:
result = mlp:forward( input:cuda() )
-- the result is a CudaTensor, if your loss is CPU-based, then you will
-- need to bring it back:
result_cpu = result:float()

Another solution, to completely abstract this issue of type, is to insert Copy layers, which will transparently copy the forward activations, and backward gradients from one type to another:

-- we put the mlp in a new container:
mlp_auto = nn.Sequential()
mlp_auto:add(nn.Copy('torch.FloatTensor', 'torch.CudaTensor'))
mlp_auto:add(nn.Copy('torch.CudaTensor', 'torch.FloatTensor'))

This new mlp_auto expects FloatTensor inputs and outputs, so you can plug this guy in any of your existing trainer.


All the basic modules have implemented on CUDA, and all provide excellent performance. Note though that a lot of modules are still missing, and we welcome any external help to implement all of them!

tutorial_cuda.txt · Last modified: 2015/02/14 06:40 by clement