The ROT13 Algorithm implemented in CUDA using python and PyCuda

Introduction

In part 1 we wrote our first rot13 CUDA program, and found it to be lacking in performance. It was more than 5x slower than the CPU version. In this article, we're going to try to fix that.

The problem

Latency. Specifically, the latency of the PCI-Express Bus. The bus is very fast and has high bandwidth, but there is a minumum time it takes to transfer a chunk of data, regardless of size. We were being very inefficient by transferring small chunks (512 bytes) in our programs. This time we are going to transfer much larger chunks.

How much larger?

Definately more than 512 bytes. The question is how much more? My card has 256MB of memory, so ideally I would like to use all of it. In practice some of the memory will be unusable, or already being used by something else. However much free memory we have, we will use that for transferring.

1st Attempt

We needed to make a few alterations to the python code, but not really that much. The main difference is that rather than transferring everything in 512byte blocks, we transfer it in blocks of  "constant memory" * number of threads.

In the case of the 8400GS, we have 512 threads and 65535 Bytes of constant memory accessable per thread. This gives us 33553920 bytes of memory that we can transfer in one go. The main difference is how to tell the GPU to process more than 512bytes.

In order to do that, we had to alter the C source, adding a loop that will first do the initial chunk, then do a chunk 512bytes away , and continue until it's done.

For example, a single thread will start by processing chunk number 1, it would then jump to chunk number 513, then 1025, etc...  If you have 512 threads, it will process everything in 512 byte chunks all the way till the end.

Thats the basic explanation, now onto the source!

mod = cuda.SourceModule("""
  __global__ void rot13(char *input,int* memchunk,int* numloops)
  {
    int idx = threadIdx.x;

    int x =0;
    int i;
    while (x != numloops[0])
        {

                i = idx + (x * memchunk[0]);

                char cap;
                cap = input[i] & 32;
                input[i] &= ~cap;
                input[i] = ((input[i] >= 'A') && (input[i] <= 'Z') ? ((input[i] - 'A' + 13) % 26 + 'A') : input[i]) | cap;
                x++;
        }
  }
""",'nvcc',[],keep=True,no_extern_c=False,arch=None,code=None,cache_dir=None)

We now have a while loop, and three arguments. "input" is a pointer to memory block we've just uploaded, "memchunk" is the size of the memory that can be processed in one go, which is the same as the number of simultanious threads that can be executed (in this case, 512). "numloops" is the number of times we have to loop until the end of the memory chunk (worked out in advance, and is "size of memory chunk" / "number of threads" ).

Now onto the rest of the code:

import pycuda.driver as cuda  import pycuda.autoinit  import os    cuda.init()  assert cuda.Device.count() >= 1   dev = cuda.Device(0)  ctx = dev.make_context()   #Get device attributes devattr=  cuda.Device.get_attributes(dev)     #====================================================================|     #MT = Max Theads per block  MT = devattr[pycuda._driver.device_attribute.MAX_THREADS_PER_BLOCK]    #MC = MemoryChunk (how much we can process in one go)  MC = MT  #Device Memory available (AM) and total Memory (TM) AM ,  TM = cuda.mem_get_info()   #====================================================================|  

Nothing much changed in this section of code.

func = mod.get_function("rot13")



#====================================================================|

#This deals with reading data and allocating memory

print AM, "Memory free,",TM,"Total memory"

#size is the max amount of memory we can allocate
size = devattr[pycuda.driver.device_attribute.TOTAL_CONSTANT_MEMORY] * MT

def rot13(chunk,size,MT):
        print "Allocating ",size,"bytes for use"

        a = numpy.array(list(chunk),'S1')

        a_gpu = cuda.mem_alloc(a.nbytes)

        cuda.memcpy_htod(a_gpu,a)

        numloops = numpy.array(size / MT,'int')

        #If numloops = 65536, it means that the data is smaller than
        #the minimum chunk size. When that happens we only need the
        #loop to execute once, hence the if statement
        if numloops == 65536 or numloops == 0:
                print "Taking into account: data smaller than minimum chunk size"
                numloops = 1

        print "Each kernel must loop",numloops,"time(s)"

        #pycuda must have numpy.array type to send data to gpu (looks for arr.nbytes)
        func(a_gpu, cuda.In(numpy.int_(MT)), cuda.In(numpy.int_(numloops)), block=(MT,1,1))

        cuda.memcpy_dtoh(a, a_gpu)
        return a
        print len(a)

 

The above is where most changes occured. The major difference is the addition of the "numloops" variable and code to work out the number of loops that need to be done for a memory chunk. The reason we have to recalculate each time is that we can't guarantee that the entire allocated memory will be used in any one go. For example. towards the last chunk of the file might end before filling the entire memory, this would result in less memory needed, which would alter the number of loops we need to do.

file="./test"
fd = open(file,'r')
fsize = os.stat(file).st_size
while fd.tell() != fsize:
       chunk =  fd.read(size)
        os.sys.stdout.write(rot13(chunk,size,MT))
ctx.pop()

This section is (again) more or less identical to the other code. We read in a set number of bytes from the file (same as the chunk we will transfer to the GPU), and continue doing that until we reach the end of the file.

 

Benchmarking

Well, how does our new code stack up to the previous benchmark? We are keeping the benchmark the same:

Note: There is no rot13_4. This was a massive failure (my first attempt at doing memory blocks) and it doesn't work properly, so not used.

We get the result in a CSV file again, which we will plot on the graph.

Results

 

Isn't that a difference! Our new code starts of a bit slower than native, but as the character size increases, the execution time changes very little, in fact it changes less than the CPU version. As such at around 1048576 characters (about 1MB) they reach parity in execution time, and from thereon the GPU version is faster, cumilating in a ~2x speedup at the end of the benchmark.

A massive improvement over the previous rot13 GPU attempts, which makes me curious to see how the Native and new GPU code compare on larger file sizes.

Benchmark 2

The next benchmark starts from 0 and goes to 536870912 characters (about 500MB of data), in 16777216 character blocks, in order to see how much faster the GPU code is compared to the CPU code.

Participants:

The bash script for this one is:

#!/bin/bash
echo "" > ./test
echo "Words,rot13_N,rot13_o1,rot13_o2,rot13_o3,rot13_5"
for x in `seq 0 4194304 536870912`; do

        wc=`wc ./test -m | cut -d' ' -f 1`

        perl -e "print 'H'x$x" > ./test

        /usr/bin/time -o ./timing --format %e  cat ./test | ./rot13 > /dev/null
        export NATIVEtime=`cat ./timing`

        /usr/bin/time -o ./timing --format %e  cat ./test | ./rot13_o1 > /dev/null
        export NATIVE1time=`cat ./timing`

        /usr/bin/time -o ./timing --format %e  cat ./test | ./rot13_o2 > /dev/null
        export NATIVE2time=`cat ./timing`

        /usr/bin/time -o ./timing --format %e  cat ./test | ./rot13_o3 > /dev/null
        export NATIVE3time=`cat ./timing`

        /usr/bin/time -o ./timing --format %e  python ./rot13_5.py > /dev/null
        export CUDA5time=`cat ./timing`

        echo $wc,$NATIVEtime,$NATIVE1time,$NATIVE2time,$NATIVE3time,$CUDA5time
done

Results