The ROT13 Algorithm implemented in CUDA using python and PyCuda


Well, What can I say? I needed a simple project to start with in order to learn the basics of CUDA programming and rot13 fit the bill (especially as rot13 is inherently parallelizable). Saying that, it is highly unlikely that rot13 will get any beneficial speedup using CUDA. Indeed the latency of transferring data to the card and back will probably negate any execution speed increase.

What it does do is provide a simple example of programming a CUDA device using Python. For kicks I will do some benchmarking to see what difference exists.

What will you need?

1st Attempt

Breakdown of the code:

import pycuda.driver as cuda   
import pycuda.autoinit   
import string   
import numpy      

assert cuda.Device.count() >= 1    
dev = cuda.Device(0)   
ctx = dev.make_context()  

The above is pretty much identitical to PyCuda's example code. We import the required modules and initialise Cuda. The "assert" statement makes sure we have a CUDA device available.

Then we create a Cuda instance with the device we want. In this case device "0", which is the first device. This is hard coded here, but you can make it selectable on the command line if you want.

mod = cuda.SourceModule("""

__global__ void rot13K( char* input, char* output )
    const int i = threadIdx.x;

    if(input[i] > 90)
      output[i] = ((input[i] - 84) % 26 ) + 97;
      output[i] = ((input[i] - 52) % 26 ) + 65;

} """,'nvcc',[],keep=True,no_extern_c=False)

Now the above code is key. Within the "SourceModule" function you write the code to be executed on the CUDA device. The code here is Nvidia's extended version of C for CUDA. You define your functions here. If a function has the definition "__global__" at the beginning, then it can be called from outside the device (i.e. from python). All functions you want to access from python need the "__global__" definition. Inside you can see a very simple implementation of the rot13 algorithm.

The function has two variables, which are pointers to locations in memory, locations that we will read/write to using python, as shown (and annotated) below:

def rot13it(chunk):            
     #//Chunk here is the a chunk of text to process

     input = numpy.array(list(chunk))    
     #//the above breaks the string into an array

     func = mod.get_function("rot13K")
     #get the CUDA function we write in SourceModule

     output = numpy.zeros_like(input)
     #prepare the output, make it the same size and type as the input

     func(cuda.In(input),cuda.Out(output), block=(input.nbytes,1,1))
     #Call the function, the input variable is the text to process,
     #output is where to place the result, while block=(X,1,1) is
     #number of blocks to simultaniously execute the code on.
     #In this case we execute a block for every character in the chunk
     #resulting in each character being rot13'd in parallel.    */

     return string.join(output,'')
     #The python function returns the rot13d code.   

The rest of the code deals with preprocessing/setup:

    devattr = cuda.Device.get_attributes(dev)
    #Get device attributes

    chunksplit = devattr[pycuda._driver.device_attribute.MAX_THREADS_PER_BLOCK]
    #Each Device has a set number of threads that can be executed in
    #parallel. The chunksplit variable saves the number of threads
    #available for parallel processing, so we can execute the right number
    #of simultanious executions.

    import os

    fd = open(file,'r')
    #Open the file and get descriptor for reading 
    size = os.stat(file).st_size
    #we need to know the file size to know how much to read
    while fd.tell() != size:         
        #While we have not reached the end
        chunk =
        #read the chunksize worths of the file, and put in the chunk variable
        #write out the result

 ctx.pop() #We are done with the context, so we close it. 

Thats all of this program done, now we can run it with the "hello" text in the file. In our case we get "uryyb" in response, which is correct.

 2nd Attempt

This attempt involved altering the rot13 algorithm so that it does not use branch statements. In theory not needing the processor to jump around sould allow faster execution. The code in the sourcemodule has changed to:

mod = cuda.SourceModule("""

__global__ void rot13K( char* input, char* output )
    const int i = threadIdx.x;

//rot13 implementation without needing branch statement. should be faster
     char cap;
     cap = input[i] & 32;
     input[i] &= ~cap;
     output[i] = ((input[i] >= 'A') && (input[i] <= 'Z') ? ((input[i] - 'A' + 13) % 26 + 'A') : input[i]) | cap;



The rest of the code is identical.


For benchmarking we test repeatadly using larger and larger text, and we time it using "real cpu time" value. We range from 0 to 4194304 characters, getting larger in multiples of 131072. The maximum filesize is ~4MB by the end of the benchmark.

The bash script that does the benchmarking is as follows:

echo "" > ./test
echo "Words,rot13_1,rot13_2,rot13_N"
for x in `seq 0 131072 4194304`; do

        wc=`wc ./test -m | cut -d' ' -f 1`

        perl -e "print 'H'x$x" > ./test

        /usr/bin/time -o ./timing --format %e  python ./ > /dev/null
        export CUDAtime=`cat ./timing`

        /usr/bin/time -o ./timing --format %e  python ./ > /dev/null
        export CUDA2time=`cat ./timing`

        /usr/bin/time -o ./timing --format %e  cat ./test | ./rot13 > /dev/null
        export NATIVEtime=`cat ./timing`

        echo $wc,$CUDAtime,$CUDA2time,$NATIVEtime

It executes all the versions and prints out the results in CSV format. The versions are:

From the results we plot a graph of Time Vs number of characters, as shown:


As you can see from the graph, the CUDA enabled rot13 program is much slower than the native one. the CUDA program starts out 2x slower, and by the last benchmark test it's over 5x slower than the CPU version.

I suspect that the main reason for this is the latency of the PCI-E Bus. The rot13 CUDA programs split the problem into 512byte chunks (that is how many processors my GPU has), transfer a single chunk, process and return it. This is repeated as long as there is data left in the file. The CPU version works in a similar way, but because RAM<=>CPU latency is magintudes lower than GPU<=>PCI-E<=>HOST, this is not much of a problem.

In order to improve these dysmal numbers, we need to do fewer HOST/GPU transfers. The best way to do this is fill up the entire GPU memory with data, process it, then move it all back. As this low-end GPU has 256MB of ram, that should reduce the memory transfer (and therefore processing time) by a large amount.

I will give the above idea a shot, and do further benchmarking in part 2 of this article.