English (en) español (es) 日本語 (ja)

Here I want to show an example how to create a lot of threads and wait while they will not finish their jobs (I don't need any synchronisation). I'm writing this tutorial because it was not obvious for me to write such a program after reading Multithreaded Application Tutorial. I was writing my application for OSX, but the resulting code should work on any system.

Let's assume that we have the following loop:

...
for i:=1 to n do begin
power(i,0.5)
end;
...

This loop computes the "power" function serially n times one after another. Lets use threads to achieve the same task in parallel.

For modern multiple-core computers, using multiple threads can dramatically increase performance. However, modern computers are very fast and threaded code is often harder to debug and maintain than serial processing. One should consider whether the processing time saved justifies the complexity of multi-threaded programming and whether the algorithm is well suited for parallel computing. Finally, note that some algorithms that can benefit from parallel computing may benefit from using CPU threads (as shown here), while others might be well suited for the GPU (where tools like OpenCL may be optimal).

## Managing memory.

Since multiple threads will be working simultaneously, you need to ensure that they do not have memory contention issues. We will have contention issues if multiple threads are writing to the same locations of memory. Some algorithms do not lend themselves to multi-threading, because each computation depends earlier results. On the other hand, multi-threading works very efficiently on problems where the computations can be performed independently and in parallel. In this example we will solve a problem that is completely independent and therefore easy to attack with multiple threads. Advanced algorithms will have to use memory locking features to avoid contention.

In our example, we will have each thread write to distinct memory locations. Specifically, we will create an array 1..n and compute the value power(i,0.5) where i is in the range 1..n. Each thread will be given an independent portion of the range to compute. Consider n=1000. If we use one thread, it will be tasked with the whole range 1..1000, whereas if we use two threads one will tackle 1..500 and the other 501..1000. This way threads will be working to fill different portions of our memory array.

## 1. Detect number of cores available.

A computer with only a single core will not benefit from threading, whereas a computer with four physical cores each with hyperthreading (able to run two tasks simultaneously) will be able to process up to eight tasks at once. The following unit "cpucount" reports the number of cores available. You can use this to determine how many threads your program should run on a given computer. For computers with four or more cores, you many want to run n-1 threads (where n is the core count), reserving one core for the graphical interface and other tasks, as the performance difference between n and n-1 threads will not be great with this many cores.

unit cpucount;
interface
//returns number of cores: a computer with two hyperthreaded cores will report 4
function GetLogicalCpuCount: Integer;

implementation

{$IF defined(windows)} uses windows; {$endif}

{$IF defined(darwin)} uses ctypes, sysctl; {$endif}

{$IFDEF Linux} uses ctypes; const _SC_NPROCESSORS_ONLN = 83; function sysconf(i: cint): clong; cdecl; external name 'sysconf'; {$ENDIF}

function GetLogicalCpuCount: integer;
// returns a good default for the number of threads on this system
{$IF defined(windows)} //returns total number of processors available to system including logical hyperthreaded processors var i: Integer; ProcessAffinityMask, SystemAffinityMask: DWORD_PTR; Mask: DWORD; SystemInfo: SYSTEM_INFO; begin if GetProcessAffinityMask(GetCurrentProcess, ProcessAffinityMask, SystemAffinityMask) then begin Result := 0; for i := 0 to 31 do begin Mask := DWord(1) shl i; if (ProcessAffinityMask and Mask)<>0 then inc(Result); end; end else begin //can't get the affinity mask so we just report the total number of processors GetSystemInfo(SystemInfo); Result := SystemInfo.dwNumberOfProcessors; end; end; {$ELSEIF defined(UNTESTEDsolaris)}
begin
t = sysconf(_SC_NPROC_ONLN);
end;
{$ELSEIF defined(freebsd) or defined(darwin)} var mib: array[0..1] of cint; len: cint; t: cint; begin mib[0] := CTL_HW; mib[1] := HW_NCPU; len := sizeof(t); fpsysctl(pchar(@mib), 2, @t, @len, Nil, 0); Result:=t; end; {$ELSEIF defined(linux)}
begin
Result:=sysconf(_SC_NPROCESSORS_ONLN);
end;

{$ELSE} begin Result:=1; end; {$ENDIF}
end.

## 2. Create a custom threads class.

I use a separate unit for defining the behavior of the threads. Note that I am setting the "FreeOnTerminate" to false - so my program will need to dispose of each thread when it is done. This makes it easier to juggle multiple threads (if you set FreeOnTerminate to true and launch multiple very fast jobs it is possible that the thread will be released before your program checks whether the thread is completed - and checking a non-existent thread can cause an exception). By setting FreeOnTerminate to false I can ensure that each thread completed successfully.

unit mythreads;
{$mode objfpc}{$H+}
interface
uses
Classes, SysUtils, Math;
type
TData = array of double;
PData = ^TData;
Type
private
protected
tPtr: PData;
tstart,tfinish: integer;
procedure Execute; override;
public
property Terminated;
Constructor Create(lstart, lfinish: integer; var lPtr: PData);
end;

implementation

constructor TMyThread.Create(lstart, lfinish: integer; var lPtr: PData);
begin
FreeOnTerminate := False;
tstart := lstart;
tfinish := lfinish;
tPtr := lPtr;
inherited Create(false);
end;
var
i: integer;
begin
for i := tstart to tfinish do
tPtr^[i] := power(i,0.5);
end;

end.

## 3. Write the main program.

Note that there are two ways to determine whether all the threads have completed. You can use the in-built "waitFor" function - this works very nicely but on my OSX computer I noted that it refreshes only every 100ms. This is perfect for real world programs (we only use threading for computationally slow problems) and reduces thread overhead. However, for quick example benchmarks it can hide the benefits of threading (as operations require a minimum of 100ms regardless of the number of threads). Therefore, in this example I detect the threads terminated status every 2ms. This provides more accurate benchmark timing.

Remember to free each thread when you are done with it. Since we set "FreeOnTerminate := False" the program needs to do this explicitly.

Tips: in my Lazarus IDE I was not able to debug multi-threading applications if I don't use 'pthreads'. I have read that if you use 'cmem', the program works faster, but I strongly recommend you to check it for any particular case (my program hangs when I use 'cmem').

uses //    cmem,pthreads,

var
dataArray: TData;
i: integer;
StartMS: double;
begin
if (nValues < 1) then exit;
StartMS:=timestamptomsecs(datetimetotimestamp(now));
setlength(dataArray, nValues+1);//+1 since indexed 0..n-1
for i:=1 to nValues do
dataArray[i] := power(i,0.5);  ;
Writeln('Serially processed '+inttostr(nValues)+' values in '+floattostr(timestamptomsecs(datetimetotimestamp(now))-StartMS)+'ms, with '+inttostr(nValues)+'^0.5 = '+floattostr(dataArray[nValues]));
end;

var
dataArray: TData;
lData : PData;
StartMS: double;
begin
if (nThreadsIn < 1) or (nValues < 1) then exit;
StartMS:=timestamptomsecs(datetimetotimestamp(now));
setlength(dataArray, nValues+1);//+1 since indexed 0..n-1
lData := @dataArray;
lStart := 1;
for i:=1 to nThreads do begin
else
lFinish:=  nValues;
lStart := lFinish+1;
end;
//for i:=1 to nThreads do threadArray[i].waitFor;  //appears to sleep for 100ms on OSX
end;

begin
Writeln('Computer reports '+inttostr(GetLogicalCpuCount)+' cores: probably optimal number of threads ');
end.