Difference between revisions of "Functions for descriptive statistics"

From Free Pascal wiki
Jump to navigationJump to search
(→‎Custom functions: adding median function)
m (Fixed syntax highlighting)
(7 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
{{Functions_for_descriptive_statistics}}
 
{{Functions_for_descriptive_statistics}}
  
Descriptive statistics aim at characterising empirical data by summative parameters (and also by tables and plots}.
+
Descriptive statistics aim at characterising empirical data by summative parameters (and also by tables and plots).
  
 
== Standard functions defined in math unit ==
 
== Standard functions defined in math unit ==
 +
The unit {{Doc|package=RTL|unit=math|identifier=statisticalroutines|text=<code>math</code>}} of the [[RTL|run-time library]] provides a plethora of routines for descriptive statistics.
  
The unit [[doc:rtl/math/index.html|math]] of the [[RTL]] provides a plethora of routines for descriptive statistics.
+
* {{Doc|package=RTL|unit=math|identifier=mean|text=<code>mean</code>}}: Returns the mean value of an array.
 +
* {{Doc|package=RTL|unit=math|identifier=stddev|text=<code>stdDev</code>}}: Returns the (sample) standard deviation of an array.
 +
* {{Doc|package=RTL|unit=math|identifier=popnstddev|text=<code>popNStdDev</code>}}: Returns the (population) standard deviation of an array.
 +
* {{Doc|package=RTL|unit=math|identifier=meanandstddev|text=<code>meanAndStdDev</code>}}: Returns mean and standard deviation of an array.
 +
* {{Doc|package=RTL|unit=math|identifier=momentskewkurtosis|text=<code>momentSkewKurtosis</code>}}: Returns the first four moments of an array.
 +
* {{Doc|package=RTL|unit=math|identifier=variance|text=<code>variance</code>}}: Returns the (sample) variance of an array.
 +
* {{Doc|package=RTL|unit=math|identifier=popnvariance|text=<code>popnVariance</code>}}: Returns the (population) variance of an array.
 +
* {{Doc|package=RTL|unit=math|identifier=totalvariance|text=<code>totalVariance</code>}}: Returns the total variance of an array.
 +
* {{Doc|package=RTL|unit=math|identifier=sumofsquares|text=<code>sumOfSquares</code>}}: Returns the sum of squares of an array.
 +
* {{Doc|package=RTL|unit=math|identifier=sum|text=<code>sum</code>}}: Returns the sum of values of an array.
 +
* {{Doc|package=RTL|unit=math|identifier=sumsandsquares|text=<code>sumsAndSquares</code>}}: Returns sum and sum of squares of the values in an array.
  
*  <code>[[doc:rtl/math/mean.html|mean]]</code>: Returns the mean value of an array.
+
These functions expect an array of predefined length (e.g. <code>array[1..100] of float</code>) or a 0-based open array (e.g. <code>array of extended</code>) with subsequent call of the {{Doc|package=RTL|unit=system|identifier=setlength|text=<code>setLength</code>}} procedure.
*  <code>[[doc:rtl/math/stddev.html| stddev]]</code>: Returns the (sample) standard deviation of an array.
 
*  <code>[[doc:rtl/math/popnstddev.html| popnstddev]]</code>: Returns the (population) standard deviation of an array.
 
*  <code>[[doc:rtl/math/meanandstddev.html| meanandstddev]]</code>: Returns mean and standard deviation of an array.
 
*  <code>[[doc:rtl/math/momentskewkurtosis.html| momentskewkurtosis]]</code>: Returns the first four moments of an array.
 
*  <code>[[doc:rtl/math/variance.html| variance]]</code>: Returns the (sample) variance of an array.
 
*  <code>[[doc:rtl/math/popnvariance.html| popnvariance]]</code>: Returns the (population) variance of an array.
 
*  <code>[[doc:rtl/math/totalvariance.html| totalvariance]]</code>: Returns the total variance of an array.
 
*  <code>[[doc:rtl/math/sumofsquares.html| sumofsquares]]</code>: Returns the sum of squares of an array.
 
*  <code>[[doc:rtl/math/sum.html| sum]]</code>: Returns the sum of values of an array.
 
<code>[[doc:rtl/math/sumsandsquares.html| sumsandsquares]]</code>: Returns sum and sum of squares of the values in an array.
 
  
These functions expect an array of predefined length (e.g. <code>array[1..100] of float</code>) or a 0-based open array (e.g. <code>array of extended</code>) with subsequent call of the <code>[[doc:rtl/system/setlength.html| SetLength]]</code> procedure.
 
  
 
== Standard functions defined in other units ==
 
== Standard functions defined in other units ==
 
+
* {{Doc|package=RTL|unit=system|identifier=length|text=<code>length</code>}}: Delivers the length (n) of an array.
* [[doc:rtl/system/length.html| length]]: Delivers the length (n) of an array.
 
  
 
== Custom functions ==
 
== Custom functions ==
 
+
Some functions aren't defined in the RTL.
Some functions aren't defined in the RTL. The subsequent section lists source code of some commonly used measures for centrality and dispersion. Where not otherwise specified the code is provided with a BSD license.
+
The subsequent section lists source code of some commonly used measures for centrality and dispersion.
 +
Where not otherwise specified the code is provided with a BSD license.
  
 
Together with other useful statistical code expanded and thoroughly tested versions of these functions are also available in the [http://quantum-salis.sf.net/ QUANTUM SALIS] project.
 
Together with other useful statistical code expanded and thoroughly tested versions of these functions are also available in the [http://quantum-salis.sf.net/ QUANTUM SALIS] project.
 
  
 
=== Median ===
 
=== Median ===
 +
The term ''median'' denotes the 50% quantile of a sample, i.e. the value separating the higher half of a data vector from its lower half.
 +
It can be calculated with:
  
The term ''median'' denotes the 50% quantile of a sample, i.e. the value separating the higher half of a data vector from its lower half. It can be calculated with
+
<syntaxhighlight lang=pascal>
 
 
<syntaxhighlight>
 
 
type
 
type
 
   TExtArray = array of Extended;
 
   TExtArray = array of Extended;
  
function SortExtArray(const data: TExtArray): TExtArray;
+
// Based on Shell Sort - avoiding recursion allows for sorting
{ Based on Shell Sort - avoiding recursion allows for sorting of very
+
// of very large arrays, too
  large arrays, too }
+
function sortExtArray(const data: TExtArray): TExtArray;
 
var
 
var
 
   data2: TExtArray;
 
   data2: TExtArray;
Line 72: Line 71:
 
   end;
 
   end;
 
   result := data2;
 
   result := data2;
end;      
+
end;
  
 
function median(const data: TExtArray): extended;
 
function median(const data: TExtArray): extended;
Line 79: Line 78:
 
   sortedData: TExtArray;
 
   sortedData: TExtArray;
 
begin
 
begin
   sortedData := SortExtArray(data);
+
   sortedData := sortExtArray(data);
 
   centralElement := length(sortedData) div 2;
 
   centralElement := length(sortedData) div 2;
 
   if odd(length(sortedData)) then
 
   if odd(length(sortedData)) then
Line 87: Line 86:
 
end;
 
end;
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
Of course, the function <code>sortExtArray</code> may be replaced with another sorting algorithm, e.g. QuickSort.
 +
The Shell Sort algorithm presented here has the advantage that it is able to sort very large vectors even on machines with a very small amount of memory (albeit with the expense of slightly reduced speed compared to QuickSort).
  
 
=== Standard error of the mean ===
 
=== Standard error of the mean ===
 
 
The standard error of the mean (SEM) is a measure that estimates how precisely the true mean of the population can be known.
 
The standard error of the mean (SEM) is a measure that estimates how precisely the true mean of the population can be known.
  
 
Calculation of SEM is simple:
 
Calculation of SEM is simple:
  
<syntaxhighlight>
+
<syntaxhighlight lang=pascal>
  function sem(const data: array of Extended): real;
+
function sem(const data: array of Extended): extended;
  begin
+
begin
    sem := stddev(data) / sqrt(length(data));
+
  sem := stddev(data) / sqrt(length(data));
  end;
+
end;
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 
=== Coefficient of variation ===
 
=== Coefficient of variation ===
 +
The coefficient of variation (CoV or CV), also known as relative standard deviation (RSD), is a measure of dispersion that is standardised with respect to the data's mean.
  
The coefficient of variation (CoV or CV), also known as relative standard deviation (RSD), is a measured of dispersion that is standardised with respect of the data's mean.
+
It can be calculated with:
 
 
it can be calculated with:
 
  
<syntaxhighlight>
+
<syntaxhighlight lang=pascal>
function cv(const data: array of Extended): real;
+
{**
{ calculates the coefficient of variation (CV or CoV) of a vector of extended }
+
  calculates the coefficient of variation (CV or CoV) of a vector of extended
 +
}
 +
function cv(const data: array of Extended): extended;
 
begin
 
begin
 
   result := stddev(data) / mean(data);
 
   result := stddev(data) / mean(data);
 
end;
 
end;
 
</syntaxhighlight>
 
</syntaxhighlight>

Revision as of 05:21, 16 February 2020

fpc source logo.png

English (en) français (fr)

Descriptive statistics aim at characterising empirical data by summative parameters (and also by tables and plots).

Standard functions defined in math unit

The unit math of the run-time library provides a plethora of routines for descriptive statistics.

  • mean: Returns the mean value of an array.
  • stdDev: Returns the (sample) standard deviation of an array.
  • popNStdDev: Returns the (population) standard deviation of an array.
  • meanAndStdDev: Returns mean and standard deviation of an array.
  • momentSkewKurtosis: Returns the first four moments of an array.
  • variance: Returns the (sample) variance of an array.
  • popnVariance: Returns the (population) variance of an array.
  • totalVariance: Returns the total variance of an array.
  • sumOfSquares: Returns the sum of squares of an array.
  • sum: Returns the sum of values of an array.
  • sumsAndSquares: Returns sum and sum of squares of the values in an array.

These functions expect an array of predefined length (e.g. array[1..100] of float) or a 0-based open array (e.g. array of extended) with subsequent call of the setLength procedure.


Standard functions defined in other units

  • length: Delivers the length (n) of an array.

Custom functions

Some functions aren't defined in the RTL. The subsequent section lists source code of some commonly used measures for centrality and dispersion. Where not otherwise specified the code is provided with a BSD license.

Together with other useful statistical code expanded and thoroughly tested versions of these functions are also available in the QUANTUM SALIS project.

Median

The term median denotes the 50% quantile of a sample, i.e. the value separating the higher half of a data vector from its lower half. It can be calculated with:

type
  TExtArray = array of Extended;

// Based on Shell Sort - avoiding recursion allows for sorting
// of very large arrays, too
function sortExtArray(const data: TExtArray): TExtArray;
var
  data2: TExtArray;
  arrayLength, i, j, k: longint;
  h: extended;
begin
  arrayLength := high(data);
  data2 := copy(data, 0, arrayLength + 1);
  k := arrayLength div 2;
  while k > 0 do
  begin
    for i := 0 to arrayLength - k do
    begin
      j := i;
      while (j >= 0) and (data2[j] > data2[j + k]) do
      begin
        h := data2[j];
        data2[j] := data2[j + k];
        data2[j + k] := h;
        if j > k then
          dec(j, k)
        else
          j := 0;
      end;
    end;
    k := k div 2
  end;
  result := data2;
end;

function median(const data: TExtArray): extended;
var
  centralElement: integer;
  sortedData: TExtArray;
begin
  sortedData := sortExtArray(data);
  centralElement := length(sortedData) div 2;
  if odd(length(sortedData)) then
    result := sortedData[centralElement]
  else
    result := (sortedData[centralElement - 1] + sortedData[centralElement]) / 2;
end;

Of course, the function sortExtArray may be replaced with another sorting algorithm, e.g. QuickSort. The Shell Sort algorithm presented here has the advantage that it is able to sort very large vectors even on machines with a very small amount of memory (albeit with the expense of slightly reduced speed compared to QuickSort).

Standard error of the mean

The standard error of the mean (SEM) is a measure that estimates how precisely the true mean of the population can be known.

Calculation of SEM is simple:

function sem(const data: array of Extended): extended;
begin
  sem := stddev(data) / sqrt(length(data));
end;

Coefficient of variation

The coefficient of variation (CoV or CV), also known as relative standard deviation (RSD), is a measure of dispersion that is standardised with respect to the data's mean.

It can be calculated with:

{**
  calculates the coefficient of variation (CV or CoV) of a vector of extended
}
function cv(const data: array of Extended): extended;
begin
  result := stddev(data) / mean(data);
end;