370CT: Revision lecture

Dr Carey Pridgeon

2016-08-06 Tue

Created: 2017-04-02 Sun 21:29

General Exam Advice

Preperation - 1

  • Stop revising the day before the exam, you need to absorb what you know, not try to cram more in. Give your brain time to organize what you've learned.
  • Read the Exam paper, all four questions, and rank them by difficulty, then leave the easiest one till last.
  • Do not rush. And at try to attempt all the questions. Even a half right answer can get some marks.

Preperation - 2

  • Read through the entire paper once you've finished to check your answers.
  • Drawings are acceptable if you can't think of the exact words to describe the answer.
  • I do not Guarantee that these slides cover 100% of what is needed for the exam, for that you will have needed to have attended the lectures and done the exercises as advised.
  • That said, alongside those, they will provide all the revision notes you should require.
  • Bofore the Exam Nostromo will remain available.

Revision

Moores Law - 1

  • Moore's law refers to an observation made by Intel co-founder Gordon Moore in 1965. He noticed that the number of transistors per square inch on integrated circuits had doubled every year since their invention.
  • Moores law On transistor density and paths in chips.
  • 5 Nanometers is the limit, below, or at that that quantum tunneling becomes a major problem.
  • Since we have effectively exceeded the point where a single core can contain enough processing power to serve our needs, we have moved to multi core, and thus Concurrent or Parallel processing capable architectures.

Moores Law - 2

  • With Multi-Core architectures, we can either use concurrent or parallel programming, or a mixture of both to maximise resource utilisation and minimise runtime.
  • Moores Law was always only descriptive, it was never intended to be a rigorous mathematical description of the advancement of processor development. It has simply been found to be useful over time.
  • That time is ending.

Moores Law - 3

  • We might refer to Moores Law to say that a problem would be better served by running on multiple cores, since it would be too much to expect a single core to manage.
  • For example, any non trivial producer consumer problem would be entirely unsuited to a single core, since that architecture would be inadequate for the task.
  • Similerly, an array sort which could be parallelised would be a poor choice for a single core achitecture.

Moores Law - 4

  • Concurrent operations are those which may or may not happen simultaneously, but do not act on the same data.
  • Concurrency is obtained because one operation need not complete at any particular point as no other thread/operation is dependent on the outcome.

Moores Law - 5

  • Moores law determines that, since there is an upper limit on their performance of a single core, multiple cores are required to increase the quantity of processing that can be performed.
  • That's pretty much as specific as it gets, which is why it's not really an algorithmicly defined law, more an observation backed by evidence.

Moores Law Possible examples:

  • Web servers – Same operation, but unique data. Provides access to web content in separate instances (sessions) that are independent of other users of that web content. Use of a single core to provide this serivice would fail to scale.
  • Early Microsoft Operating Systems were unable to multi task, and usually ran on single core architectures, where multi-tasking would have slowed the machines significantly.

Amdhals Law

Amdhals Law - 1

  • Ahmdahls Law, which we will not cover algorithmically, tells us how to identify that part of our program which might best benefit from optimisation.
  • It may well be, though not always, that this optimisation can take the form of serial, concurrerrent, parallel computation.
  • I would say that if you are only discovering that an aspect of your program can be distributed during the optimisation phase you have a case of very poor program design.

Amdhals Law - 2

  • Your approach depends on the amount of time the optimisation will save.
  • Is it a fairly trivial 'out of the way' optimisation, like shifting file loading into the background to a concurrent thread?
  • A sorting operation might be well served by being parallel or concurrent sorted.
  • Array searching may also benefit from some form of threading, depending on the size of the array, it could simply be that you need to code it more efficiently.

Amdhals Law - 3

  • Anything that might slow your program (menus, GUIs) and whos effect could be mitigated by offloading into threading should be.
  • These may not be normal applications of Ahmdahls Law, but in practical application, they are relevent.

Amdhals Law - 4

  • When Ahmdahls law was first widely applied parellisation was not generally considered, the primary target was more traditional serial optimisation of the region identified for whom speedup could be applied.
  • Since it wasn't considered, Amdhahl as it is doesn't really scale well. Therefore a modification was created - Gustafson's law.

Gustafson's law

Gustafson's law - 1

  • A law in computer science which says that computations involving arbitrarily large data sets can be efficiently parallelized.

Gustafson's law - 2

  • Gustafson's law states the increase in computing power will prompt researchers to add more data to more fully simulate more variables, giving a more accurate result.
  • (In simpler terms, Computer Scientists will always seek to add more detail to their simulations if the power and storage to do so is available.)
  • This law is entirely concerning the nature of the dataset to be processed and the program to do it. The hardware used, and communication latency are not considered, as these vary too much.

Gustafson's law - 3

  • This means improvements brought about by using GPU's are not accounted for, so there are limitations to the application of this law.
  • But it is a guideline, not an exact measure, and should be treated as such.
  • You could drill down to a more exact measure on a per experiment level, but as few people do, you would have little to compare your data against.

Gustafson's law - 4

  • The scalability of parallelization is limited by the non-parallelizable (serial) portion of the workload.
  • So, in any program that is to rely heavily on Parallel or Distributed processing (Gustafsons law really doesn't distinguish between these), the purely serial portions must be minimised.
  • Wording that another way, by application of the ideas embodied in Ahmdahl, having as much as possible of their work shifted out to concurrency as possible is a good thing.

OpenMP Clauses

Pragmas - 1

  • A pragma is a directive used during compile time to control what the compiler is to do with a given block of code.
  • Outside of parallel programming I mostly use them to specify compile time inclusion or exclusion of code.
  • in OpenMP, pragma's are used to tell the compiler how to incorporate your code with OpenMP.

Pragmas - 2

#pragma omp parallel
  • Declares that the code scope that follows this pragma will be compiled as a parallel block.
  • Scope is that which normally defines scope in the language used.
  • What type of parallel block is determined by further options.

Pragma practice

  • Using the provided base code, run this.
#pragma omp parallel 
  {
   int i = omp_get_thread_num() 
   std::cout << omp_get_thread_num()  << std::endl;
  }
  • If all goes well you should have a program that prints out the number of each thread that OpenMP makes.

OMP Options

  • basic extentions to the parallel directive

    #pragma omp parallel for
    
  • This tells openmp to convert the next code block (which must be a for loop with no breakouts) into a threadpool.
#pragma omp parallel sections
  • This allows different functions to be run in the threads of a threadpool.

Barriers - 1

  • Each threadpool is followed by a barrier. All threads wait until every other thread has reached this barrier before the threadpool can pass through the barrier and the program continue.
  • Barriers are implicit, they are always present, so must be disabled if you don't want to use them.
  • To disable them you follow the type of the parallel region required with a nowait clause.

Barriers - 2

  • Barriers can be disabled for for loop regions if no following code depends on the output of the threadpool.
  • Barriers can be disabled for for Sections if you want any functions launched by one to have a longer runtime (e.g a menu thread), or if no following code depends on the output.
#pragma omp parallel sections nowait

Variable settings

Variable settings

OpenMP lets you assign properties to variables that effect how they function within the structured block.

Shared

#pragma omp parallel shared (x)
  • In OpenMP, all variables are shared by default in parallel regions, unless declared within the scope of a given parallel region.
  • Explicitly specifying them as such leads to clearer code that is easier to debug.

Private

#pragma omp parallel private (x)
  • This type of variable, again single or compound type, is duplicated and provided uninitialized to each thread in the block.
  • By default these are used internally by each thread then discarded. If this isn’t the required result, there are other options.

Firstprivate

#pragma omp parallel firstprivate (x)
  • Declares one or more list items to be private to a task.
  • After the structured block completes it updates the variable/s passed with the value held in the firstprivate variable before the block began.
  • So it can be set, used, then reset to be used again. Not specifically by another threadpool.

Lastprivate

#pragma omp parallel lastprivate (x)
  • Declares one or more list items to be private to a task.
  • After the structured block completes it updates the variable/s passed with the value held in the last thread or section to complete.
  • The private var is uninitialized, and on completion contains the last value it held in the last thread to complete.

threadprivate

#pragma omp parallel threadprivate (x)
  • The Threadprivate directive is used to make global file scope variables local and persistent to a thread through the execution of multiple parallel regions.
  • This allows some simplification of programming logic when privates are used repeatedly.

Parallel Programming

Parallel Computing

  • Concurrent programs having limitations that prevent them scaling well in the presence of many cores. Because of this, paralell programming is used.
  • Parallel programming allows us to decompose a problem so that it can be executed in parallel across multiple cores simultantiously.

High level description: Fork and Join - 1

  • Fork and Join is used to distribute the workload across multiple cores.
  • Sets of worker threads to do a shared task, managed within a threadpool.

fork_join.png

High level description: Fork and Join - 2

  • Jobs taking place in the forks cannot cross talk.
  • Critical Sections must be mimimised, or altogether absent if your program is to be truly parallel.
  • Your algorithm must either inherantly parallel or hacked about with till it is.
  • No Paralellisation library allows you to specify any thread code, but you can usually specify some threadpool behaviour.

Reduction and Sub Communicators

MPI::COMM::WORLD

  • Comm World handles communication between all the nodes (single computers, or processing units, if you are using multiple cores per machine). So Each node can talk to any other node, but ony via Comm World.
  • When we send a message, we send it to Comm World.
  • When we receive it, we receive not from the source node, but from Comm World, in its buffer that it reserved for the origin node.

Reduction - 1

  • Reduction works much like it does in OpenMP, but with a few significant differences.
  • That is to say, the result is the same, but the means by which the result is acheived is substantially different, and would be even if we used MPI only on a single machine in entirely paralell mode.

Reduction - 2

  • We do need to talk about the reduce function parameters a little
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype,
               MPI_Op op, int root, MPI_Comm comm)
  • This is from the code I've given you
MPI_Reduce(&my_sum,&sum,1,MPI_FLOAT,MPI_SUM,0,MPI_COMM_WORLD);
  • Random wittering about significance of various parameters follows.

Reduction - 3

  • [ MPI_MAX] maximum
  • [ MPI_MIN] minimum
  • [ MPI_SUM] sum
  • [ MPI_PROD] product

Reduction - 4

  • [ MPI_LAND] logical and
  • [ MPI_BAND] bit-wise and
  • [ MPI_LOR] logical or
  • [ MPI_BOR] bit-wise or

Reduction - 5

  • [ MPI_LXOR] logical xor
  • [ MPI_BXOR] bit-wise xor
  • [ MPI_MAXLOC] max value and location
  • [ MPI_MINLOC] min value and location

Sub Dividing the World Communicator

SubGroups - 1

  • We give MPI its total list of avalailable nodes on startup, thereafter we can split them up.
  • I have no exercises to demonstrate this, transferring data and your assignment are probably already too much at this point.
  • Why do it?
  • Sometimes you might have a lot of nodes in your cluster, and more than one problem to solve at once, so you create sub groups and assign tasks to them.
  • Each sub group has its own master node.

SubGroups - 2

comm_split.png http://mpitutorial.com/tutorials/introduction-to-groups-and-communicators/

SubGroups - 3

  • This is why all MPI function have a Communicator parameter, because you can have more than one.

MPI Datatypes

MPI does not do Pointers

  • Pointers are machine specific, they are are always relative to some location on a specific machine
  • Move to another machine, and the address becomes much harder to interpret.
  • Not impossible perhaps, but pointlessly timewasting, so MPI doesn't do it.

MPI does not do Objects

  • Objects are already fairly complicated containers, so rather hard for any language or library to serialise and transmit.
  • MPI cannot understand objects at all. You could not, for instance, send a std::string from one node to another using MPI, although you could send the contents of a std::string.

MPI can't do Structs either

  • The contents of a struct are always positioned relative to the beginning of that struct.
  • The beginning position of the struct is unknown to the destination computer, MPI has no means to transmit this information, since it would only be relative position, and thus meaningless.

It's all relative really

  • Thease are mostly manifestations of the same fundamental problem, relative position in physical memory.
  • Processes are allocated a virtual memory block, whose addresses mean nothing outside that block.
  • MPI ignores this problem by transforming the data structures we are used to into blocks consisting of data, it's position relative to other pieces of data, it's type, and other housekeeping information required to convert it to and from the data structures our program code on the nodes uses.

Random scribbling on the board to attempt to demonstrate my point

  • why are you reading the slide, look at the board…..

MPI Datatypes

MPI composite type creation - 1

  • Starting with a struct like this
struct something {
 int awholenumber;
 int anotherwholenumber;
 int yetanotherwholenumber;
 bool atrueornotthing;
 float afloatynumber;
 float asecondfloatynumber;
 double alongerfloatynumber;
 double asecondlongerfloatynumber;
 char  awholelistofletters[10];
}

MPI composite type creation - 2

  • You need to extract the structural information needed to remove information from, and write information to, datatypes in C++ that match the one you're using.
  • So we need to know types there are (expressed as their MPI equivilents), what order they can be found in (this is very important), and how many of each to expect.
const int count = 5;  
MPI::Datatype typesInStruct[count] = {MPI::INT,MPI::bool,MPI::float,MPI::double,MPI::char};
int arrayBlockLengths [count] = {3,1,2,2,10};

MPI composite type creation - 3

  • Finally we need memory locations relative to some position MPI can understand. In this case, the start of a struct.
  • This might not always be 0, but it only needs to be a relative position, MPI can work the rest out to read or fill in the struct correctly, because it will only ever be doing it one variable or struct at a time.

MPI composite type creation - 4

//Specify starting memory of each block
MPI::Aint objAddress, address1,address2;
MPI::Aint arrayDisplacements[count];        
hostStruct sbuf;//Just has to be a struct instance but not
                // the one you're actually sending
objAddress = MPI::Get_address(&sbuf);
address1 = MPI::Get_address(&sbuf.hostName);
address2 = MPI::Get_address(&sbuf.id);
arrayDisplacements[0] = address1 - objAddress;
arrayDisplacements[1] = address2 - objAddress;

MPI composite type creation - 5

  • Now we have all the pieces we can create our MPI datatype to transport our data over the MPI network

    MPI::Datatype mpiHostStruct;
    mpiHostStruct = MPI::Datatype::Create_struct(count,arrayBlockLengths,arrayDisplacements,typesInStruct);
    mpiHostStruct.Commit();