Linux Follies: openmp

2011-04-19

Basic OpenMP

OpenMP is a standard API in C, C++ and Fortran which supports shared memory programming. Compilers which support it include GCC, and Intel.

While there is some instructional material on the web, they can be a bit obtuse, I found. So, here is a typical first exercise. Compute π by approximating an integral: \int_{0}^{1} 4/(1 + x^2) dx = π

The serial version looks like (leaving out obvious details):

 
    sum = 0.;
    step = 1./(double)N;
    for (i = 0; i < N; ++i) {
        x  = (i + 0.5) * step;
        sum += 4./(1. + x*x);
    }
    sum *= step; // pi == sum

The naive parallel version uses an array of size equal to the number of threads, i.e. creating an array slot to store the results from each thread.

 
    for (i = 0; i < 8; ++i) buf[i] = 0.;
    sum = 0.;
    step = 1./(double)N;
    #pragma omp parallel for private(i,x)
    for (i = 0; i < N; ++i) {
        x = (i + 0.5) * step;
        buf[omp_get_thread_num()] += 4./(1. + x*x);
    }
    for (i = 0; i < 8; ++i) sum += buf[i];  // pi == sum

The naive version is slow, I'm guessing due to the repeated calls to omp_get_thread_num(). You can improve it by declaring a large parallel block so that omp_get_thread_num() is only called once per thread:

 
    for (i = 0; i < 8; ++i) buf[i] = 0.;
    sum = 0.;
    step = 1./(double)N;
    #pragma omp parallel
    {
        me = omp_get_thread_num();
        #pragma omp parallel for private(i,x,me)
        for (i = 0; i < N; ++i) {
            x = (i + 0.5) * step;
            buf[me] += 4./(1. + x*x);
    }
    
    for (i = 0; i < 8; ++i) sum += buf[i];  // pi == sum

The “proper” version uses a reduce operation, which ensures all threads are complete before the reduction operation is done. Waiting for all threads to complete is called a “barrier”:

 
    sum = 0.;
    step = 1./(double)N;
    #pragma omp parallel for private(i,x) reduction(+:sum)
    for (i = 0; i < N; ++i) {
        x = (i + 0.5) * step;
        sum += 4./(1. + x*x);
    }
    sum *= step;  // pi == sum

The important bit in both the parallel versions, which I didn't learn until after several tries was the “private(i,x)” bit. That ensures that the various threads do not overwrite each other’s copy of those variables.

This tutorial (PDF) from OpenMP by Tim Mattson and Larry Meadows both of Intel is very useful, and works through such technical issues as caching.