Parallel Programming - Basic Theory

      1. Preface
      2. Entities Of Parallel Systems
        1. Processes
        2. Resources
        3. Schedulers
        4. Synchronizers
      3. Short Terms Dictionary
        1. Kernel
        2. The system call interface
          1. The use of system calls
        3. Accessing Resources
          1. Mutual Exclusion
          2. Deadlock
          3. Starvation
          4. Race Condition
          5. Busy Wait
        4. POSIX 1003.1b (formerly 1003.4) Functions
        5. Handling Events And Notifications
          1. Asynchronous Notification
          2. Synchronous Notification
          3. Event Loop
        6. Process Scheduling
          1. Process Preemption
          2. Context Switch
          3. Round-Robin Scheduling
          4. Prioritized Scheduling
          5. Load Balancing
          6. Commands Pipe lining
      4. Implementations Of Parallel Systems
        1. Implementations In Software
          1. Multi Processes
          2. Multi Threading Kernels
          3. Multi Threading Libraries
        2. Implementations In Hardware
          1. SMP - Symmetric Multi-Processing
          2. Clustering
      5. Tools And Methods For Parallel Applications Development
        • Designing A Parallel Application
        • Communications Frameworks
          • ONC RPC - Remote Procedure Call
        • Third-Party Libraries Supporting Process/Thread Abstractions
        • Debugging And Logging Techniques

      Preface

      Writing sequential programs is the what most, if not all, programmers are being initially trained to do. At a certain phase in their life they discover that there are other models for programs. One of them is using parallelism. This means that instead of having your program carried out in a sequence, one instruction at a time, it is being executed by several different entities simultaneously. This can sometimes make the programs simpler to design, and may also run faster then a matching sequential program.

      For example, if you have a computer with several processors, they might each be running a part of the program simultaneously, and thus complete it faster then if only one processor would have to run the whole program.

      Another example is if you have a program that needs to read data from a disk, and meanwhile make some heavy calculations on it. Since the disk can transfer data to memory without intervention of the CPU, it would make sense to split your program into two parts running in parallel: one handles I/O, and reads data into memory. The other does the heavy calculations. You could do it all in one sequential part, that majorly does the calculations, and from time to time goes to read another block of data, but it is harder to write it this way, or to be efficient (how will the program know when the last block of data was read, and it is time to read another block?)

      Entities Of Parallel Systems

      A parallel system is a system (software and/or hardware) that allows one to write programs whose different parts are carried out in different threads of execution.
      In order to better understand what a parallel (or parallelic) system is, we should check what are the different components such a system is made of.

      Processes

      A process is an entity that executes a computer program, and manipulates all other resources in order to fulfill the mission of the program. Several (or many) processes may be running on the same computer, or on different computers. Usually a single process is running on a single computer. Also, usually each process has its own address space, as well as a few other private resources (an private execution stack, for example).

      Resources

      A resource is an entity that can be used by a process to perform some specific operation. There could be physical resources, as well as logical (or virtual) resources.
      A physical resource could be, for example, a disk, which is used for saving files. Or a keyboard, which is used to read data from the user.
      A logical resource could be a shared memory area, which several processes may access in order to read data to, or read data from.

      Schedulers

      A scheduler is an entity specifying when processes will run, on which CPUs (executors) they will run, and in what order. Such a scheduler may be controlling a single computer, or several computers (see clustering below).

      Synchronizers

      A synchronizer is an entity used to control the access of processes to resources. Some synchronizers are used to make sure only a single process uses a given resource at any one time. Other synchronizers are used to make sure a given resource is accessed in a given order of priority by different processes. There is a host of other types of synchronizers to be dealt with. You will most likely bump into synchronizers such as Mutexes, Semaphores, Monitors, Conditions, etc.

      Short Terms Dictionary

      Kernel

       Kernel is the nucleus of the operating system:
      • Management of the virtual processors (scheduling)
      • IPC (inter process communication)
      • Interrupt handling
      • I/O device management
      • Memory management
                  For ex. file management is outside the kernel.
                  The operating system of the embedded systems is mostly a kernel.
                  The most crucial resource that is under of the control of the operating system is processor(s).
       

      The system call interface

              The system call interface defines all the services of the operating system to the programs.
       
       
       

              The basis of the different variants of the UNIX operating system is different system call interface.

              SVID (System V interface definitions )

              POSIX standard (IEEE)
              POSIX 1003.4 is API (Application Program Interface) for the real time systems.

      The use of system calls

                  An example read system call (c programming language):

                      n = read(file_descriptor,buffer_address,number_of_bytes);

                  n is return value (how many characters actually read or error 1).

                  To save testing time, inspect always the return code!

                      if(( n = read( fd,buf,numb )) == -1)
                      {
                      printf("Read error 1\n") ;
                      exit(1);
                      };

                  You can find a descriptive text of the error occurred:

                      #include <stdio.h>
                      .......
                      extern int errno,sys_nerr;
                      extern char *sys_errlist[];
                      .......
                      fprintf(stderr,"%s",sys_errlist[errno];
                      .......

                  System calls are run in the system state (also privileged instructions).
       

      Accessing Resources

        Mutual Exclusion
        A mechanism used to make sure no two processes are trying to use a resource at the same time. Used to avoid corrupting the internal state state of this resource. (mantra - that great big nurse sitting in front of the doctor's room, making sure no one gets in until the previous person comes out).

        Deadlock
        A situation in which a group of two or more processes are all waiting for a set of resources that are currently taken by other processes in the same group, or waiting for events that are supposed to be sent from other processes in the group. (mantra - when was the last time you were anxiously dialing to your boyfriend, finding his phone line constantly busy, while he was trying to call you back all this time? Each one of you was making one phone busy, while trying to dial the other's phone's number. If none of you would have let go of his/her phone, you'd be in a deadlock. The phone is, indeed, a valuable resource).

        Starvation
        A situation where a process that tries to access some resource is never granted access to that resource (mantra - think of the last time you went into your fast-food restaurant, approached the counter, and were always pushed back to the end of the line by the crowd, verbally "starving to death").

        Race Condition
        A situation in which two (or more) processes are doing some competing operations at the same time, and the results might come up screwed up due to collisions. (mantra - think of two people trying to get into the same door at the same time, jamming each other's way in).

        Busy Wait
        A situation in which a process that is waiting for a resource to become free, enters a loop of constantly polling the resource in order to find if it is free. In the process it consumes all available CPU power to perform its constant polls. (mantra - when you are waiting for the cable guy to come and connect you new apartment, and keep looking out the window all the time to see when the technician arrives).

       

      Process Scheduling

        Process Preemption
        An event of the system suspending the execution of a process that is in the middle of putting the CPU to good use, in order to let other processes run. This makes sure no single process will take over the full CPU, even if it gets into an infinite loop and does not make any blocking system calls.

        Context Switch
        The method by which the operating system makes one CPU suspend the execution of on process, and continue executing a second process. This is called a context switch since the CPU needs to switch to using the full context of the second process - its execution stack, its memory area, the values the registers contained when last executing this process, etc. A context switch may occur when the running process blocks waiting for some resource to become free, when it yields the CPU voluntarily, or when it is being preempted by the operating system's scheduler. (mantra - you talk to your mom on the phone, while trying to watch a very interesting T.V. program - you switch your attention between you T.V. and your mom constantly, always taking a short while to remember where you were before the last switch).

        Round-Robin Scheduling
        A scheduling algorithm in which several processes are being executed by an executed one after the other in turn, each getting the same amount of time to run. After this time slice passes, a context-switch is made, where the first process on the list gets to run next, while the suspended process is being put back at the end of the list. (mantra - when the ideal teacher in class makes sure every pupil gets a chance to speak in turn - well, as if that ideal teacher exists...).

        Prioritized Scheduling
        A scheduling algorithm in which a priority is assigned to each process, and when several processes are waiting for the CPU to get executed, the one with highest priority is being run, usually until it gets blocked waiting for some resource, or until a process with a higher priority wakes up and demands CPU time. (mantra - on a T.V. debate, some people are given higher priority then others, and interrupt everybody else when they have something to say. On Israeli T.V. - see under 'Popolitica').

        Load Balancing

        A method by which several tasks are scheduled for separate CPUs or computers, based on the current load on each CPU. Sometimes this includes the ability to migrate processes from one CPU (or computer) to another during its execution.
         
         

        Different scheduling algorithms

              • FREE algorithms / process is not interrupted
              • TIME algorithms for ex. ROUND/ROBIN algorithm / a constant slice of time to the process
              • MLQ (multi level queue) is a TIME algorithm / multiple queues for processes last queue is ROUND/ROBIN (UNIX, DEC VMS, IBM VMS)
              • PRE- EMPTIVE algorithms / lower priority process is interrupted (real- time systems)
            Task (process) switch:
            The scheduler task saves all hw registers of the running task and restores registers of the new running task from the memory saving area.

            The operating system has a data structure called PCB (process control block) to save the state of every process.

            Data to be saved is at least IP (instruction pointer ) or also called PC (program counter) and stack pointer.

            Many tasks can share the same code segment every process must have own :

              • PCB
              • Data segment
              • Stack (segment)
            PCB list
         
      POSIX 1003.1b (formerly 1003.4) Functions

              Scope:

                  This standard defines a standard operating system interface and environment to support application
                  portability at the source code level. It is intended to be used by both application developers and system
                  implementors.

                  (1) Terminology, concepts, and definitions and specifications that govern structures, headers, environment
                  variables, and related requirements

                  (2) Definitions for system service interfaces and subroutines

                  (3) Language specific system services for the C programming language

                  (4) Interface issues, including portability, error handling, and error recovery

                  Not: user interface, graphics interfaces, dbms, ...

              Real- time operating system components:

                  1. Timers:

                  The ability to set and read high resolution internal timers

                  2. Priority Scheduling:

                  A priority based pre-emptive scheduling

                  3. Shared memory:

                      The ability to map common physical space into independent process specific virtual spaces

                  4. Real time files:

                      The ability to create and access files with deterministic performance

                  5. Semaphores:

                      Efficient synchronization primitives ( P and V operations)

                  6. Inter process communication:

                      Synchronous and asynchronous message passing capabilities with facilities for flow and resource
                      control

                  7. Asynchronous event notification:

                      A mechanism which provides queuing capabilities, deterministic delivery and minimal data passing

                  8. Process memory locking:

                      The ability to guarantee memory residence for sections of a process virtual address space

                  9. Asynchronous I/O :

                      The ability to overlap applications processing and I/O operations initiated by the application

                  10. I/O Synchronized:

                  The ability to establish assurance of I/O completion at different logical levels of completion

              (Synchronous I/O - we wait data from disk/ Synchrozined I/O is different matter )

              The POSIX 1003.4 will also identify the real-time performance metrics.

                  The process dispatch latency time:
                  "The time from which an asynchronous device interrupts the machine causing the highest priority process
                  to execute until that process returns from the system call that caused it to block on that external device".

                  Interrupt latency: time from interrupt to the first instruction in the service routine.

                  Context switch time:
                  Different meanings in litterature!
                  Only time to save registers of the old process and restore new or
                  same as "dispatch latency"
       
       

        An example of a hard real time system is a digital fly-by-wire control system of an aircraft:
        No lateness is accepted under any circumstances, otherwise the aircraft is not controllable.
        Useless results if late, if the control system does not respond timely, the result is a hole in the ground.
        Catastrophic failure, which needs no explanation in the case of an aircraft crash.
        Cost of missing deadline is infinitely high, the lives of people depend on the correct working of the control system of the aircraft.

        A soft real-time system can be a vending machine:
        Rising cost for lateness of results: As it will take longer to treat a customer when the performance of the vending machine is degrading, less customers pay at this machine which results in less profits for the shop owner.
        Accept lower performance for lateness, it is not catastrophic when deadlines are not met. It will take longer to handle one client with the vending machine.

      Implementations Of Parallel Systems

      Parallel systems implementation may be done in software, in hardware, or as a combination of both. It can also be made using symmetric elements, all of which are capable of performing the same tasks in the same level of efficiency, or using units that are specializing in in different jobs.

      Implementations In Software

      Software implementations of parallel systems usually have to handle the task of letting many processes run on a limited amount of hardware, using it in the most efficient way possible. Most efficient might mean "making sure the CPU is never idle", or better yet "making sure the whole system finishes its tasks as quickly as possible, in human time".

      Multi Processes

      All Unix systems, as well as many other operating systems, are multi-process systems. In most of these systems, each process is given its own address space, and they all share the same CPU in turn. The scheduling algorithm might be a simple round-robin algorithm, or one that takes priorities into account (priorities are mostly needed for real-time programs, that must be granted some limits on how long it'll take since an event they need arrives, until they are allowed to execute it).

      Multi Threading Kernels

      A system similar to the multi-process system, except that here a process may be divided into several threads. All threads share the same data area, and are being scheduled in a similar way to how processes are scheduled. Due to the sharing of data area (and few other resources), a context-switch between two threads of the same process is faster then a context switch between two processes. Also, passing data between threads is easier then between two processes, due to the shared address space. On the other hand, protection of threads from one another is weaker then that given to processes - if one thread corrupts its memory area, all other threads in its process may be directly affected. Threads supported by the kernel are sometimes called 'Light-Weight Processes' (LWP).

      Multi Threading Libraries

      Similar to the kernel-based threads, except that the operating system's kernel is not aware of these threads - they are created and handled by a library. The advantage is that "context-switching" between them is faster - it does not involve any system call (with the overhead of switching into kernel mode). On the other hand, if one thread gets blocked by the operating system when invoking some system call (e.g. sleep(), or waiting for input from a network device), the whole process gets suspended, including all its threads. Thus a multi-threaded library implementation is suitable for calculation-intensive applications, not for I/O intensive applications.

      Implementations In Hardware

      Implementations of parallel systems often involve using extra hardware - several CPUs that can execute different processes in parallel being the most notable hardware addition.

      SMP - Symmetric Multi-Processing

      This method is used to contain several CPUs in a single computer, each of which has access to the same set of memory chips, and each working as a general-purpose CPU, that can execute any process in the system. The operating system is responsible for assigning newly created processes to specific CPU, using some load-balancing algorithm. Sometimes these systems also handle moving a process from one CPU to another, that just became free.

      In the past it was a simple task (after the current CPU finished executing one command, simple copy its registers contents into another CPU, and set it to continue execution from the same position). However, these days a CPU executes several commands of a process simultaneously, using pipelining techniques, and also contains an internal cache containing some data and commands, making such a process migration harder to implement, and more wasteful (all the partially-completed commands in the pipeline have to be flashed, the cache of the second CPU has to be reloaded, etc).

      Clustering

      Clustering is a technique used to make several computers act as one larger machine, splitting tasks amongst them. They allow one to take several cheap stations, and combine them together to a larger system. It also allows for more redundancy for the system - if one machine in the cluster dies off, the other machines can cover up for it until the malfunctioning machine is repaired, and all this without bringing the whole system down. This type of setup is thus common in systems that must run 24 hours non-stop.

      Clustering is often implemented in software, often using a protocol named PVM to communicate between the different machines. Examples of such systems are Beowulf, for Linux systems, or the clustering system by Tandem corporation.

      Tools And Methods For Parallel Applications Development

      Writing a parallel application takes a different approach then writing a sequential program. After we decide what needs to be done, we need to decide who gets to do what, and find points where extra parallelism would be beneficial. We then need to decide how our different runnables are going to communicate with one another - sometimes a whole slew of different communications methods is used in one large parallel application, each of which fits a particular need best. We then come to the art of debugging parallel applications, which requires some techniques not required when debugging sequential applications.

      Designing A Parallel Application

      The first step in designing a parallel application, is determining what level or parallelism, if at all, is beneficial to the problem our application tries to solve. An important factor is the experience of the programmers with parallel systems. Take into account some extra overhead needed to fix hard bugs that stem from timing problems, race conditions, deadlocks and the like.

      Once we decided to use parallel programming, we should work on decomposing our system into units that would logically belong to a single runnable. Sometimes we find very natural divisions, other times only experience will help us, or better, looking at other similar applications for which we could find some success record. If we're programming in order to learn, we should mostly experiment, write code, test it, dump bad ideas, and be ready to write again from scratch. If we see our design leads to new complexities, its probably time for a change.

      Communications Frameworks

      A very important factor for the success or a parallel application, is choosing an appropriate communications framework. There are several such framework in common use, and for anything but simplistic and experimental work, we should consider using one of them. We'll show here a few examples, thought of-course other methods (including methods implemented by various commercial products) exist.

      ONC RPC - Remote Procedure Call

      Remote Procedure Calls (RPC) are a method originally developed by Sun microsystems©, allows one process to activate a procedure in a second process, passing it parameters, and optionally getting a result back from the call.
       
       

      Third-Party Libraries Supporting Process/Thread Abstractions

      Various third-party libraries exist, whose purpose is to ease the development of cross-platform applications. Of those, various libraries try to make multi-process and multi-threaded programming easier.

      ACE (Adaptive Communications Environment) is a large C++ library, developed at the Washington university in St. Louis. ACE attempts to supply abstractions for a lot of system programming concepts, including sockets, pipes, share memory, processes and threads. These abstractions allow one source-code to be compiled by different compilers on different operating systems, from PCs running Linux, BSD and windows systems, through most types of Unix for workstations, and up to IBM's MVS open edition, and not to forget several real-time operating systems, such as VxWorks and LynxOS. There is also a version of ACE ported to Java, named JACE.

      Rogue Wave© is a company known for writing commercial libraries that are used to ease the development of applications. One of their libraries is named 'Threads++', and is used to make multi-threaded programming easier. This library is something to consider when developing a commercial multi-threaded application. Refer to Rogue Wave's home page for more information.

      Debugging And Logging Techniques

      The first problem is that things happen in different processes or threads at the same time, and we need a debugger that can follow all relevant runnables simultaneously. Most debuggers now can handle processes properly, and on platforms with multi-threading support, usually the native debugger can handle threads as well. Of course, we need some kind of IDE if we want to use these debuggers without loosing our sanity.

      However, because of the complex nature of such applications, and their sensitivity to timing issues, stopping the application in order to examine it with a debugger is something we sometimes cannot afford. Thus, our best shot would be at extensive logging, that can be turned on and off while the application is running. Such a logging facility must allow us to see which process or thread wrote each log record, and exactly when, to be able to deduce anything out of this info. We would also need to make the format of the log files easy to parse, in order to locate interesting events. It is advised that such a logging mechanism be devised early in the life of the project, as it can save hours of battling processes later.

      One of the first problems we face when looking for the the cause of a bug in a parallel application, is finding the responsible process. Many times several processes are sending messages in a chain, and this chain breaks somewhere along the way. One method that could be used to debug such problems between two interacting processes, is simply suspending both of them. Then running the one that should initiate the message with a debugger, checking that the data it contains is legal and that the message it is about to send contains correct information. Then we can attach a debugger to the second process, set a breakpoint on the function that is supposed to receive the message, and resume the execution of this process. Of-course, this method cannot be employed when the message handling is sensitive to some timing constraints. In that case, only extensive logging will help us.