While there is some instructional material on the web, they can be a bit obtuse, I found. So, here is a typical first exercise. Compute π by approximating an integral: \int_{0}^{1} 4/(1 + x^2) dx = π
The serial version looks like (leaving out obvious details):
sum = 0.; step = 1./(double)N; for (i = 0; i < N; ++i) { x = (i + 0.5) * step; sum += 4./(1. + x*x); } sum *= step; // pi == sumThe naive parallel version uses an array of size equal to the number of threads, i.e. creating an array slot to store the results from each thread.
for (i = 0; i < 8; ++i) buf[i] = 0.; sum = 0.; step = 1./(double)N; #pragma omp parallel for private(i,x) for (i = 0; i < N; ++i) { x = (i + 0.5) * step; buf[omp_get_thread_num()] += 4./(1. + x*x); } for (i = 0; i < 8; ++i) sum += buf[i]; // pi == sumThe naive version is slow, I'm guessing due to the repeated calls to omp_get_thread_num(). You can improve it by declaring a large parallel block so that omp_get_thread_num() is only called once per thread:
for (i = 0; i < 8; ++i) buf[i] = 0.; sum = 0.; step = 1./(double)N; #pragma omp parallel { me = omp_get_thread_num(); #pragma omp parallel for private(i,x,me) for (i = 0; i < N; ++i) { x = (i + 0.5) * step; buf[me] += 4./(1. + x*x); } for (i = 0; i < 8; ++i) sum += buf[i]; // pi == sumThe “proper” version uses a reduce operation, which ensures all threads are complete before the reduction operation is done. Waiting for all threads to complete is called a “barrier”:
sum = 0.; step = 1./(double)N; #pragma omp parallel for private(i,x) reduction(+:sum) for (i = 0; i < N; ++i) { x = (i + 0.5) * step; sum += 4./(1. + x*x); } sum *= step; // pi == sumThe important bit in both the parallel versions, which I didn't learn until after several tries was the “private(i,x)” bit. That ensures that the various threads do not overwrite each other’s copy of those variables.
This tutorial (PDF) from OpenMP by Tim Mattson and Larry Meadows both of Intel is very useful, and works through such technical issues as caching.