Parallel Programming - Basic Theory
-
Preface
-
Entities Of Parallel Systems
-
Processes
-
Resources
-
Schedulers
-
Synchronizers
-
Short Terms Dictionary
-
Kernel
-
The system call interface
-
The use of system calls
-
Accessing Resources
-
Mutual Exclusion
-
Deadlock
-
Starvation
-
Race Condition
-
Busy Wait
-
POSIX
1003.1b (formerly 1003.4) Functions
-
Handling Events And Notifications
-
Asynchronous Notification
-
Synchronous Notification
-
Event Loop
-
Process Scheduling
-
Process Preemption
-
Context Switch
-
Round-Robin Scheduling
-
Prioritized Scheduling
-
Load Balancing
-
Commands Pipe lining
-
Implementations Of Parallel Systems
-
Implementations In Software
-
Multi Processes
-
Multi Threading Kernels
-
Multi Threading Libraries
-
Implementations In Hardware
-
SMP - Symmetric Multi-Processing
-
Clustering
-
Tools And Methods For Parallel Applications
Development
Preface
Writing sequential programs is the what most, if not all, programmers
are being initially trained to do. At a certain phase in their life
they discover that there are other models for programs. One of them
is using parallelism. This means that instead of having your program
carried out in a sequence, one instruction at a time, it is being executed
by several different entities simultaneously. This can sometimes make the
programs simpler to design, and may also run faster then a matching sequential
program.
For example, if you have a computer with several processors,
they might each be running a part of the program simultaneously, and thus
complete it faster then if only one processor would have to run the
whole program.
Another example is if you have a program that needs to read data from
a disk, and meanwhile make some heavy calculations on it. Since the
disk can transfer data to memory without intervention of the CPU, it would
make sense to split your program into two parts running in parallel: one
handles I/O, and reads data into memory. The other does the heavy calculations.
You could do it all in one sequential part, that majorly does the calculations,
and from time to time goes to read another block of data, but it is harder
to write it this way, or to be efficient (how will the program know when
the last block of data was read, and it is time to read another block?)
Entities Of Parallel Systems
A parallel system is a system (software and/or hardware) that allows
one to write programs whose different parts are carried out in different
threads of execution.
In order to better understand what a parallel (or parallelic) system
is, we should check what are the different components such a system is
made of.
Processes
A process is an entity that executes a computer program, and manipulates
all other resources in order to fulfill the mission of the program. Several
(or many) processes may be running on the same computer, or on different
computers. Usually a single process is running on a single computer. Also,
usually each process has its own address space, as well as a few other
private resources (an private execution stack, for example).
Resources
A resource is an entity that can be used by a process to perform some specific
operation. There could be physical resources, as well as logical (or virtual)
resources.
A physical resource could be, for example, a disk, which is
used for saving files. Or a keyboard, which is used to read data from the
user.
A logical resource could be a shared memory area, which several
processes may access in order to read data to, or read data from.
Schedulers
A scheduler is an entity specifying when processes will run, on which
CPUs (executors) they will run, and in what order. Such a scheduler
may be controlling a single computer, or several computers (see clustering
below).
Synchronizers
A synchronizer is an entity used to control the access of processes
to resources. Some synchronizers are used to make sure only a single
process uses a given resource at any one time. Other synchronizers are
used to make sure a given resource is accessed in a given order of priority
by different processes. There is a host of other types of synchronizers
to be dealt with. You will most likely bump into synchronizers such as
Mutexes, Semaphores, Monitors, Conditions, etc.
Short Terms Dictionary
Kernel
Kernel is the nucleus of the operating system:
-
Management of the virtual processors (scheduling)
-
IPC (inter process communication)
-
Interrupt handling
-
I/O device management
-
Memory management
For
ex. file management is outside the kernel.
The operating system of the embedded systems is mostly a kernel.
The most crucial resource that is under of the control of the operating
system is processor(s).
The
system call interface
The system call interface defines
all the services of the operating system to the programs.
The basis of the different
variants of the UNIX operating system is different system call interface.
SVID (System V interface
definitions )
POSIX standard (IEEE)
POSIX 1003.4 is API (Application
Program Interface) for the real time systems.
The use
of system calls
An
example read system call (c programming language):
n = read(file_descriptor,buffer_address,number_of_bytes);
n
is return value (how many characters actually read or error 1).
To
save testing time, inspect always the return code!
if(( n = read( fd,buf,numb )) == -1)
{
printf("Read error 1\n") ;
exit(1);
};
You
can find a descriptive text of the error occurred:
#include <stdio.h>
.......
extern int errno,sys_nerr;
extern char *sys_errlist[];
.......
fprintf(stderr,"%s",sys_errlist[errno];
.......
System
calls are run in the system state (also privileged instructions).
Accessing Resources
-
Mutual Exclusion
-
A mechanism used to make sure no two processes are trying to use a resource
at the same time. Used to avoid corrupting the internal state state
of this resource. (mantra - that great big nurse sitting in front of the
doctor's room, making sure no one gets in until the previous person comes
out).
-
Deadlock
-
A situation in which a group of two or more processes are all waiting
for a set of resources that are currently taken by other processes in the
same group, or waiting for events that are supposed to be sent from other
processes in the group. (mantra - when was the last time you were anxiously
dialing to your boyfriend, finding his phone line constantly busy, while
he was trying to call you back all this time? Each one of you was making
one phone busy, while trying to dial the other's phone's number. If none
of you would have let go of his/her phone, you'd be in a deadlock. The
phone is, indeed, a valuable resource).
-
Starvation
-
A situation where a process that tries to access some resource is never
granted access to that resource (mantra - think of the last time you
went into your fast-food restaurant, approached the counter, and were always
pushed back to the end of the line by the crowd, verbally "starving to
death").
-
Race Condition
-
A situation in which two (or more) processes are doing some competing operations
at the same time, and the results might come up screwed up due to collisions.
(mantra - think of two people trying to get into the same door at the same
time, jamming each other's way in).
-
Busy Wait
-
A situation in which a process that is waiting for a resource to become
free, enters a loop of constantly polling the resource in order to find
if it is free. In the process it consumes all available CPU power to
perform its constant polls. (mantra - when you are waiting for the cable
guy to come and connect you new apartment, and keep looking out the window
all the time to see when the technician arrives).
Process Scheduling
-
Process Preemption
-
An event of the system suspending the execution of a process that is in
the middle of putting the CPU to good use, in order to let other processes
run. This makes sure no single process will take over the full CPU, even
if it gets into an infinite loop and does not make any blocking system
calls.
-
Context Switch
-
The method by which the operating system makes one CPU suspend the execution
of on process, and continue executing a second process. This is called
a context switch since the CPU needs to switch to using the full context
of the second process - its execution stack, its memory area, the values
the registers contained when last executing this process, etc. A context
switch may occur when the running process blocks waiting for some resource
to become free, when it yields the CPU voluntarily, or when it is being
preempted by the operating system's scheduler. (mantra - you talk to your
mom on the phone, while trying to watch a very interesting T.V. program
- you switch your attention between you T.V. and your mom constantly, always
taking a short while to remember where you were before the last switch).
-
Round-Robin Scheduling
-
A scheduling algorithm in which several processes are being executed by
an executed one after the other in turn, each getting the same amount of
time to run. After this time slice passes, a context-switch is made, where
the first process on the list gets to run next, while the suspended process
is being put back at the end of the list. (mantra - when the ideal teacher
in class makes sure every pupil gets a chance to speak in turn - well,
as if that ideal teacher exists...).
-
Prioritized Scheduling
-
A scheduling algorithm in which a priority is assigned to each process,
and when several processes are waiting for the CPU to get executed, the
one with highest priority is being run, usually until it gets blocked waiting
for some resource, or until a process with a higher priority wakes up and
demands CPU time. (mantra - on a T.V. debate, some people are given higher
priority then others, and interrupt everybody else when they have something
to say. On Israeli T.V. - see under 'Popolitica').
-
Load Balancing
A method by which several tasks are scheduled for separate CPUs or
computers, based on the current load on each CPU. Sometimes this includes
the ability to migrate processes from one CPU (or computer) to another
during its execution.
Different scheduling algorithms
-
FREE algorithms / process is not interrupted
-
TIME algorithms for ex. ROUND/ROBIN algorithm / a constant slice of time
to the process
-
MLQ (multi level queue) is a TIME algorithm / multiple queues for processes
last queue is ROUND/ROBIN (UNIX, DEC VMS, IBM VMS)
-
PRE- EMPTIVE algorithms / lower priority process is interrupted (real-
time systems)
Task (process) switch:
The scheduler task saves all hw registers of the running task and restores
registers of the new running task from the memory saving area.
The operating system has a data structure called PCB (process control
block) to save the state of every process.
Data to be saved is at least IP (instruction pointer ) or also called
PC (program counter) and stack pointer.
Many tasks can share the same code segment every process must have own
:
-
PCB
-
Data segment
-
Stack (segment)
PCB list

POSIX
1003.1b (formerly 1003.4) Functions
Scope:
This standard defines a standard operating system interface and environment
to support application
portability at the source code level. It is intended to be used
by both application developers and system
implementors.
(1)
Terminology, concepts, and definitions and specifications that govern structures,
headers, environment
variables, and related requirements
(2)
Definitions for system service interfaces and subroutines
(3)
Language specific system services for the C programming language
(4)
Interface issues, including portability, error handling, and error recovery
Not:
user interface, graphics interfaces, dbms, ...
Real- time operating system
components:
1.
Timers:
The
ability to set and read high resolution internal timers
2.
Priority Scheduling:
A
priority based pre-emptive scheduling
3.
Shared memory:
The ability to map common physical space into independent process specific
virtual spaces
4.
Real time files:
The ability to create and access files with deterministic performance
5.
Semaphores:
Efficient synchronization primitives ( P and V operations)
6.
Inter process communication:
Synchronous and asynchronous message passing capabilities with facilities
for flow and resource
control
7.
Asynchronous event notification:
A mechanism which provides queuing capabilities, deterministic delivery
and minimal data passing
8.
Process memory locking:
The ability to guarantee memory residence for sections of a process virtual
address space
9.
Asynchronous I/O :
The ability to overlap applications processing and I/O operations initiated
by the application
10.
I/O Synchronized:
The
ability to establish assurance of I/O completion at different logical levels
of completion
(Synchronous I/O - we wait
data from disk/ Synchrozined I/O is different matter )
The POSIX 1003.4 will also
identify the real-time performance metrics.
The
process dispatch latency time:
"The time from which an asynchronous device interrupts the machine causing
the highest priority process
to execute until that process returns from the system call that caused
it to block on that external device".
Interrupt
latency: time from interrupt to the first instruction in the service routine.
Context
switch time:
Different meanings in litterature!
Only time to save registers of the old process and restore new or
same as "dispatch latency"
An example of a hard real time system is a digital fly-by-wire
control system of an aircraft:
No lateness is accepted under any circumstances, otherwise the aircraft
is not controllable.
Useless results if late, if the control system does not respond timely,
the result is a hole in the ground.
Catastrophic failure, which needs no explanation in the case of an
aircraft crash.
Cost of missing deadline is infinitely high, the lives of people depend
on the correct working of the control system of the aircraft.
A soft real-time system can be a vending machine:
Rising cost for lateness of results: As it will take longer to treat
a customer when the performance of the vending machine is degrading, less
customers pay at this machine which results in less profits for the shop
owner.
Accept lower performance for lateness, it is not catastrophic when
deadlines are not met. It will take longer to handle one client with the
vending machine.
Implementations Of Parallel Systems
Parallel systems implementation may be done in software, in hardware,
or as a combination of both. It can also be made using symmetric elements,
all of which are capable of performing the same tasks in the same level
of efficiency, or using units that are specializing in in different jobs.
Implementations In Software
Software implementations of parallel systems usually have to handle
the task of letting many processes run on a limited amount of hardware,
using it in the most efficient way possible. Most efficient might mean
"making sure the CPU is never idle", or better yet "making sure the whole
system finishes its tasks as quickly as possible, in human time".
Multi Processes
All Unix systems, as well as many other operating systems, are multi-process
systems. In most of these systems, each process is given its own address
space, and they all share the same CPU in turn. The scheduling algorithm
might be a simple round-robin algorithm, or
one that takes priorities into account (priorities are mostly needed
for real-time programs, that must be granted some limits on how long
it'll take since an event they need arrives, until they are allowed to
execute it).
Multi Threading Kernels
A system similar to the multi-process system, except that here a process
may be divided into several threads. All threads share the same data
area, and are being scheduled in a similar way to how processes are scheduled.
Due to the sharing of data area (and few other resources), a context-switch
between two threads of the same process is faster then a context switch
between two processes. Also, passing data between threads is easier then
between two processes, due to the shared address space. On the other
hand, protection of threads from one another is weaker then that given
to processes - if one thread corrupts its memory area, all other threads
in its process may be directly affected. Threads supported by the kernel
are sometimes called 'Light-Weight Processes' (LWP).
Multi Threading Libraries
Similar to the kernel-based threads, except that the operating system's
kernel is not aware of these threads - they are created and handled by
a library. The advantage is that "context-switching" between them is faster
- it does not involve any system call (with the overhead of switching into
kernel mode). On the other hand, if one thread gets blocked by the
operating system when invoking some system call (e.g. sleep(),
or waiting for input from a network device), the whole process gets suspended,
including all its threads. Thus a multi-threaded library implementation
is suitable for calculation-intensive applications, not for I/O intensive
applications.
Implementations In Hardware
Implementations of parallel systems often involve using extra hardware
- several CPUs that can execute different processes in parallel being the
most notable hardware addition.
SMP - Symmetric Multi-Processing
This method is used to contain several CPUs in a single computer, each
of which has access to the same set of memory chips, and each working as
a general-purpose CPU, that can execute any process in the system.
The operating system is responsible for assigning newly created processes
to specific CPU, using some load-balancing
algorithm. Sometimes these systems also handle moving a process from one
CPU to another, that just became free.
In the past it was a simple task (after the current CPU finished executing
one command, simple copy its registers contents into another CPU, and set
it to continue execution from the same position). However, these days a
CPU executes several commands of a process simultaneously, using pipelining
techniques, and also contains an internal cache containing some data and
commands, making such a process migration harder to implement, and more
wasteful (all the partially-completed commands in the pipeline have to
be flashed, the cache of the second CPU has to be reloaded, etc).
Clustering
Clustering is a technique used to make several computers act as one
larger machine, splitting tasks amongst them. They allow one to take
several cheap stations, and combine them together to a larger system. It
also allows for more redundancy for the system - if one machine in the
cluster dies off, the other machines can cover up for it until the malfunctioning
machine is repaired, and all this without bringing the whole system down.
This type of setup is thus common in systems that must run 24 hours non-stop.
Clustering is often implemented in software, often using a protocol
named PVM to communicate between the different machines. Examples of such
systems are Beowulf, for Linux systems, or the clustering system by Tandem
corporation.
Tools And Methods For Parallel Applications Development
Writing a parallel application takes a different approach then writing
a sequential program. After we decide what needs to be done, we need
to decide who gets to do what, and find points where extra parallelism
would be beneficial. We then need to decide how our different runnables
are going to communicate with one another - sometimes a whole slew
of different communications methods is used in one large parallel application,
each of which fits a particular need best. We then come to the art of debugging
parallel applications, which requires some techniques not required when
debugging sequential applications.
Designing A Parallel Application
The first step in designing a parallel application, is determining what
level or parallelism, if at all, is beneficial to the problem our application
tries to solve. An important factor is the experience of the programmers
with parallel systems. Take into account some extra overhead needed to
fix hard bugs that stem from timing problems, race conditions, deadlocks
and the like.
Once we decided to use parallel programming, we should work on decomposing
our system into units that would logically belong to a single runnable.
Sometimes we find very natural divisions, other times only experience
will help us, or better, looking at other similar applications for which
we could find some success record. If we're programming in order to
learn, we should mostly experiment, write code, test it, dump bad ideas,
and be ready to write again from scratch. If we see our design leads to
new complexities, its probably time for a change.
Communications Frameworks
A very important factor for the success or a parallel application, is choosing
an appropriate communications framework. There are several such framework
in common use, and for anything but simplistic and experimental work, we
should consider using one of them. We'll show here a few examples, thought
of-course other methods (including methods implemented by various commercial
products) exist.
ONC RPC - Remote Procedure Call
Remote Procedure Calls (RPC) are a method originally developed by Sun
microsystems©, allows one process to activate a procedure in a second
process, passing it parameters, and optionally getting a result back from
the call.
Third-Party Libraries Supporting Process/Thread Abstractions
Various third-party libraries exist, whose purpose is to ease the development
of cross-platform applications. Of those, various libraries try to
make multi-process and multi-threaded programming easier.
ACE (Adaptive
Communications Environment) is a large C++ library, developed at the Washington
university in St. Louis. ACE attempts to supply abstractions for a lot
of system programming concepts, including sockets, pipes, share memory,
processes and threads. These abstractions allow one source-code to be compiled
by different compilers on different operating systems, from PCs running
Linux, BSD and windows systems, through most types of Unix for workstations,
and up to IBM's MVS open edition, and not to forget several real-time operating
systems, such as VxWorks and LynxOS. There is also a version of ACE ported
to Java, named JACE.
Rogue Wave© is a company known for writing commercial libraries
that are used to ease the development of applications. One of their libraries
is named 'Threads++', and is used to make multi-threaded programming easier.
This library is something to consider when developing a commercial multi-threaded
application. Refer to Rogue Wave's
home page for more information.
Debugging And Logging Techniques
The first problem is that things happen in different processes or threads
at the same time, and we need a debugger that can follow all relevant
runnables simultaneously. Most debuggers now can handle processes properly,
and on platforms with multi-threading support, usually the native debugger
can handle threads as well. Of course, we need some kind of IDE if we want
to use these debuggers without loosing our sanity.
However, because of the complex nature of such applications, and their
sensitivity to timing issues, stopping the application in order to examine
it with a debugger is something we sometimes cannot afford. Thus, our best
shot would be at extensive logging, that can be turned on and off while
the application is running. Such a logging facility must allow us to
see which process or thread wrote each log record, and exactly when, to
be able to deduce anything out of this info. We would also need to make
the format of the log files easy to parse, in order to locate interesting
events. It is advised that such a logging mechanism be devised early in
the life of the project, as it can save hours of battling processes later.
One of the first problems we face when looking for the the cause of
a bug in a parallel application, is finding the responsible process.
Many times several processes are sending messages in a chain, and this
chain breaks somewhere along the way. One method that could be used
to debug such problems between two interacting processes, is simply suspending
both of them. Then running the one that should initiate the message with
a debugger, checking that the data it contains is legal and that the message
it is about to send contains correct information. Then we can attach a
debugger to the second process, set a breakpoint on the function that is
supposed to receive the message, and resume the execution of this process.
Of-course, this method cannot be employed when the message handling is
sensitive to some timing constraints. In that case, only extensive logging
will help us.