370CT: Parallel Programming - 4

Dr Carey Pridgeon

2016-08-06

Created: 2017-03-05 Sun 16:36

Reduction

Reduction - 1

  • Some problems do require that the outcome be a sum of the outputs from any thread in a threadpool.
  • In order to help these scale, rather than requiring you to use critical sections, OpenMP (and others) provide reduction functionality that avoids such nastiness.
  • During reduction, each thread in a threadpool has its own copy of the value being calculated. On completion these disperate values are reduced (combined) into a single value.

Reduction - 2

#pragma omp parallel for reduction(op:sum)

Where op can be one of:

+ -
* &
&& ^
\ |
||  

Reduction - 3

  • Each Thread
    • Has a local copy of the variable to store the value being reduced. This is, as with private, provided uninitialised.
    • Performs its calculation and stores the result in its own (private) copy of the reduction variable as passed.
  • Once the structured block exits, these disparate values are recombined into the global copy of the reduction variable.

Reduction - 4

We will convert this code to use reduction.

const int num_steps = 100000; 
double x, sum = 0.0; 
const double step = 1.0/double(num_steps); 
for (int i=1;i<= num_steps; i++){ 
  x = double(i-0.5)*step; 
  sum += 4.0/(1.0+x*x); 
} 
const double pi = step * sum;
 std::cout<< ">Pi is "<< pi<<std::endl;

Reduction - 5

  • Put this in your base C++ code and run it as is serially to check that it all works.
  • The sum variable must be declared as the reduction var.
  • Put in the reduction clause, with the correct operator, set x as private, then compile and run it again.

Moores Law

Moores Law - 1

  • Moore's law refers to an observation made by Intel co-founder Gordon Moore in 1965. He noticed that the number of transistors per square inch on integrated circuits had doubled every year since their invention.
  • Since we have effectively reached the point where we have exceeded the point where a single core can contain enough processing power to serve our needs, we have moved to multi core, and thus Concurrent or Parallel processing capable architectures.

Moores Law - 2

  • With Multi-Core architectures, we can either use concurrent or parallel programming, or a mixture of both to maximise resource utilisation and minimise runtime.
  • Moores Law was always only descriptive, it was never intended to be a rigorous mathematical description of the advancement of processor development. It has simply been found to be useful over time.
  • That time is ending.

Moores Law - 3

  • We might refer to Moores Law to say that a problem would be better served by running on multiple cores, since it would be too much to expect a single core to manage.
  • For example, any non trivial producer consumer problem would be entirely unsuited to a single core, since that architecture would be inadequate for the task.
  • Similerly, an array sort which could be parallelised would be a poor choice for a single core achitecture.

Moores Law - 4

  • Concurrent operations are those which may or may not happen simultaneously, but do not act on the same data.
  • Concurrency is obtained because one operation need not complete at any particular point as no other thread/operation is dependent on the outcome.

Moores Law - 5

  • Moores law determines that, since there is an upper limit on their performance of a single core, multiple cores are required to increase the quantity of processing that can be performed.
  • That's pretty much as specific as it gets, which is why it's not really an algorithmicly defined law, more an observation backed by evidence.

Moores Law Possible examples:

  • Web servers – Same operation, but unique data. Provides access to web content in separate instances (sessions) that are independent of other users of that web content. Use of a single core to provide this serivice would fail to scale.
  • Early Microsoft Operating Systems were unable to multi task, and usually ran on single core architectures, where multi-tasking would have slowed the machines significantly.

Amdhals Law

Amdhals Law - 1

  • Ahmdahls Law, which we need not cover algorithmically, tells us how to identify that part of our program which might best benefit from optimisation.
  • It may well be, though not always, that this optimisation can take the form of serial, concurrerrent, parallel computation.
  • I would say that if you are only discovering that an aspect of your program can be distributed during the optimisation phase you have a case of very poor program design.

Amdhals Law - 2

  • Your approach depends on the amount of time the optimisation will save.
  • Is it a fairly trivial 'out of the way' optimisation, like shifting file loading into the background to a concurrent thread?
  • A sorting operation might be well served by being parallel or concurrent sorted.
  • Array searching may also benefit from some form of threading, depending on the size of the array.

Amdhals Law - 3

  • Anything that might slow your program (menus, GUIs) and whos effect could be mitigated by offloading into threading should be.
  • These may not be normal applications of Ahmdahls Law, but in practical application, they are relevent.

Amdhals Law - 4

  • When Ahmdahls law was first widely applied parellisation was not generally considered, the primary target was optimisation of the region identified for whom speedup could be applied but now it is.
  • Since it wasn't considered, It doesn't really scale well. Therefore a modification was created - Gustafson's law.

Gustafson's law

Gustafson's law - 1

  • A law in computer science which says that computations involving arbitrarily large data sets can be efficiently parallelized.

\(S(P)=P-a.(P-1) \)

  • You will not be asked to enumerate this algorithm in an exam
  • Where \(P\) is the number of processors \(S\) is the Speedup and \(a\) is the non parallisable part of the process.
  • or this

Gustafson's law - 2

  • Gustafson's law states the increase in computing power will prompt researchers to add more data to more fully simulate more variables, giving a more accurate result.
  • (In simpler terms, Computer Scientists will always seek to add more detail to their simulations if the power and storage to do so is available.)
  • This law is entirely concerning the nature of the dataset to be processed and the program to do it.

Gustafson's law - 3

  • Hardware related speed overheads are not included.
  • This means improvements brought about by using GPU's are not accounted for, so there are limitations to the application of this law.
  • But it is a guideline, not an exact measure, and should be treated as such.

Gustafson's law - 4

  • The scalability of parallelization is limited by the non-parallelizable (serial) portion of the workload.
  • So, in any program that is to rely heavily on Parallel or Distributed processing (Gustafsons law really doesn't distinguish between these), the purely serial portions must be minimised.
  • Wording that another way, by application of the ideas embodied in Ahmdahl, having as much as possible of their work shifted out to concurrency as possible is a good thing,

MPI

Introduction

  • MPI (Message Passing Interface) is a communication system used on distributed systems (clusters).
  • We will use the OpenMPI version, an open source build that implements the MPI standard (as all implementations must) specification
  • Read if you want, and have a couple of weeks spare, but you don’t need to in order to pass this module.
  • OpenMPI is maintained by a consortium of academic and industry partners.
  • MPI is not a language itself, it’s a library that Compilers for various languages can use (C,C++,Python/Java,Fortran) can use to enable cluster computation.

Message Passing

  • MPI works via a message passing system. Messages contain data when sent by the user. MPI uses the same message system to handle its inter node communication, but we won't cover that.
  • One node (usually the one initially loading the program) manages the distributed program by sending and receiving these messages and collating the results.

Inter Node Comms

  • MPI is’t hard to set up, and will work on any connected set of computers that share the same operating system. MPI uses ssh to communicate between nodes, and all nodes should be on the same subnet.
  • This isn’t a strict requirement for MPI, but if the computers aren’t in the same subnet/same building, you will incur network overheads.
  • Your accounts are already set up so SSH will work with MPI accross the cluster, so don't alter it.

Usage

Basic Usage - 1

  • MPI has a wrapper for gcc, called mpic++. Using this to compile MPI code builds in all dependancies.
  • The programs that result have to be passed as parameters to mpirun.
  • Programs passed to mpirun need not have been compiled for MPI, but they usually are.
  • For mpirun to work youy need to give it a list of machines set up to communicate using MPI's protocols, and some other experiment specific instructions.

Basic Usage - 3

  • MPI runs a duplicated process across all the compute nodes.
  • The process is duplicated automagically, what we control is the distribution of the data to be processed in that distributed process.

    mpi.png

    #+RESULTSf590e9343d2da1d95

No Parallel Programming with MPI

Why Not

  • MPI can do its thing quite easily on a multi core system, as you could see, indeed may have seen after you play with the exercises.
  • Working on a multi core system means Parallel, not Distributed, and you can mix the two paradigms easily with MPI.
  • We don't want to though, because MPIs strength is with distribution, and it can be used in conjunction with OpenMP, a library that is specifically aimed at paralellism only.

Exercise 1

intro to mpirun

  • Go through exercise_d1 to get an initial idea of how MPI works over the cluster.
  • We will start MPI properly next week.