370CT: Parallel Programming

Dr Carey Pridgeon

2016-08-06

Created: 2017-02-08 Wed 12:40

Gustafson's law

Gustafson's law - 1

  • A law in computer science which says that computations involving arbitrarily large data sets can be efficiently parallelized.

\(S(P)=P-a.(P-1) \)

  • Where \(P\) is the number of processors \(S\) is the Speedup and \(a\) is the non parallisable part of the process.
  • Gustafson's law states the increase in computing power will prompt researchers to add more data to more fully simulate more variables, giving a more accurate result.
  • This law is entirely concerning the nature of the dataset to be processed and the program to do it. Hardware related speed overheads are not included.

Gustafson's law - 2

  • The scalability of parallelization is limited by the non-parallelizable (serial) portion of the workload.
  • The serial fraction is the percentage of code that is not parallelized.
  • Runtime spent in non parallel portions needs to be minimised.
  • Gustafson noted that problem sizes grow as computers become more powerful. His law applies as well to distributed computing.

Gustafson's law - 3

  • As the problem size grows, the work required for the parallel part of the problem frequently grows much faster than the serial part.
  • If this is true for a given application, then as the problem size grows the serial fraction decreases and speedup improves.
  • All this is relevent, and included for reference. But as a rule you won't use it. You just should keep it in mind.

Parallel Programming

Parallel Computing

  • Concurrent programs having limitations that prevent them scaling well in the presence of many cores. Because of this, paralell programming is used.
  • Parallel programming allows us to decompose a problem so that it can be executed in parallel across multiple cores simultantiously.

High level description: Fork and Join - 1

  • Fork and Join is used to distribute the workload across multiple cores.
  • Sets of worker threads to do a shared task, managed within a threadpool.

fork_join.png

High level description: Fork and Join - 2

  • Jobs taking place in the forks cannot cross talk.
  • Critical Sections must be mimimised, or altogether absent if your program is to be truly parallel.
  • Your algorithm must either inherantly parallel or hacked about with till it is.
  • No Paralellisation library allows you to specify any thread code, but you can usually specify some threadpool behaviour.

Classes of parallel programming

Classes of parallel programming - 1

  • Course grained - Thread based.
    • Limited to thread level Problem decomposition.
    • Simpler to learn and implement.
    • Paralellisation can be de-activated easily.
  • Compile time managed.

Classes of parallel programming - 2

  • Fine grained - Task based.
    • Threads are still used, but the problem is decomposed further into a series of tasks.
    • More complex to code.
    • Parallelisation cannot be easily de-activated for debugging.

OpenMP - 1

  • The Parallelisation manager decomposes your problem into threads.
  • You can control the number of threads.
  • You can control the number of jobs to be allocated to each thread.
  • Unles you are coding for a fixed platform, this is usually pointless.
  • Parallel regions are defined via pragma statements e.g

    #pragma omp parallel [region specific instructions]
    

OpenMP - 2

  • The threading you covered in Concurrency is what gets used under the hood in OpenMP.
  • OpenMP manages threads for you in such a way that the threading interface is abstracted away, enabling easy parallel programming.

Intel Thread Building Blocks

  • Threads are still used, but the problem is decomposed further into a series of tasks.
  • The manager does not give you control over the number of threads.
  • Processors who run out of work can 'steal' tasks from the back of other processors queues.

Task Stealing - 1

  • With multiple processors co-operating and using task stealing, things change.
  • Each processor still has a stack for storing frames whose execution has been suspended; however, these stacks are more like deques, in that suspended states can be removed from either end.
  • A processor can still only remove states from its own stack from the same end that it puts them on.

Task Stealing - 2

  • Any processor which is not currently working (having finished its own work, or not yet having been assigned any) will pick another processor at random through the scheduler, and try to "steal" work from the opposite end of their stack (it’s suspended states), which the stealing processor can then begin to execute.
  • This imposes an effective upper limit on processors of 16 per thread pool before performance is hit [Limits of Work-Stealing Scheduling, Carsten Griwodz, 2009].
  • At 32 cores it can be 47% of the total system load, an untenable usage penalty.

Task Stealing - 3

taskSteal1.png

Task Stealing - 4

taskSteal2.png

Task Stealing - 5

taskSteal3.png

Task Stealing - 5

taskSteal4.png

Intel Thread Building Blocks - 1

  • Parallel regions are defined via function templates e.g

    class classname {
    int  classLocalVariables;
     public:
     void operator()(const tbb::blocked_range<size_t>& r) const {
       int threadlocalVars
       for(size_t i=r.begin(); i!=r.end(); ++i)  {
           // thread body
       }
     }
     classname(int classLocalVariables) :
              threadlocalVars(classLocalVariables)
     {}
    };
    

Intel Thread Building Blocks - 2

  • tbb::blocked_range
    • Your range is the total number of loops, the blocked_range value is how many of those loops can be assigned to a thread (one loop per thread is bad, splitting it in half will get you two threads).

Chunking and Grain size breifly - 1

parallel_for(blocked_range(0, to_scan.size(), 100),
  • This code calls the parallel_for template, defining the range for the problem as being from 0 to to_scan,size(), and the grain size as 100.
  • The grain size defines the approximate number of array elements that will exist in each sub-range.
  • In other words, if the grain size is set to 100, then the composite work task will be divided up into subtasks where each subtask will perform the required operation on about 100 array elements.

[source -Intel online documentation]

Chunking and Grain size breifly - 2

  • Chunking is the number of iterations allocated to each thread.
  • Grain size is the amount of work each processor should do in a single chunk.
  • This gives Tak Stealing the ability to distribute the work to idle processors.

Chunking and Grain size breifly - 3

  • Grain size and cost

many_overhead.png less_overhead.png

  • It really is, in almost all cases, simpler to let the auto_partitioner sort all this out for you.
  • We won't be covering manual grainsize control in TBB, that's a whole different rabbit hole and we don't have time.

This weeks programming

  • Rather than start parallel programming, we'll start on assignment coding.
  • Given that having me see you make progress is a big part of the marking scheme, doing this is very important.