Difference between revisions of "IEEE 754 formats"

From Free Pascal wiki
Jump to navigationJump to search
m (Added back page link)
 
(14 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
{{IEEE 754 formats}}
 
{{IEEE 754 formats}}
  
 +
<syntaxhighlight lang="pascal" inline>single</syntaxhighlight>, <syntaxhighlight lang="pascal" inline>double</syntaxhighlight> and <syntaxhighlight lang="pascal" inline>extended</syntaxhighlight> are [[FPC]]'s [[Data type|data types]] implementing [[Pascal]]’s [[Real|<syntaxhighlight lang="pascal" inline>real</syntaxhighlight>]].
 +
All of them are implemented according IEEE standard 754, where <syntaxhighlight lang="pascal" inline>single</syntaxhighlight> is “single-precision”, <syntaxhighlight lang="pascal" inline>double</syntaxhighlight> is “double-precision”, and <syntaxhighlight lang="pascal" inline>extended</syntaxhighlight> has 80 bits.
  
Back to [[Data type|data types]].
+
== <syntaxhighlight lang="pascal" inline>single</syntaxhighlight> ==
 
 
 
 
<syntaxhighlight lang="pascal" enclose="none">single</syntaxhighlight>, <syntaxhighlight lang="pascal" enclose="none">double</syntaxhighlight> and <syntaxhighlight lang="pascal" enclose="none">extended</syntaxhighlight> are Pascal's [[Data type|data types]] implementing the platform-dependent [[Real|<syntaxhighlight lang="pascal" enclose="none">real</syntaxhighlight>]].
 
All of them are implemented according IEEE standard 754, where <syntaxhighlight lang="pascal" enclose="none">single</syntaxhighlight> is “single-precision”, <syntaxhighlight lang="pascal" enclose="none">double</syntaxhighlight> is “double-precision”, and <syntaxhighlight lang="pascal" enclose="none">extended</syntaxhighlight> has 80 bits.
 
 
 
== <syntaxhighlight lang="pascal" enclose="none">single</syntaxhighlight> ==
 
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 23: Line 19:
  
 
Definition of a data field of data type Single:
 
Definition of a data field of data type Single:
<syntaxhighlight lang=pascal>
+
<syntaxhighlight lang="pascal">
  var
+
var
    s: Single;
+
  s: single;
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 
Examples of assigning valid values:
 
Examples of assigning valid values:
  
<syntaxhighlight lang=pascal>
+
<syntaxhighlight lang="pascal">
    s := -123.45678;
+
s := -123.45678;
    s := 0;
+
// Note: '0' is an integer literal. 0.0 is an real literal.
    s := 123.45678;
+
// Here, FPC will make an implicit typecast from integer to single:
 +
s := 0;
 +
// a positive sign is optional
 +
s := 123.45678;
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Line 40: Line 39:
 
<!-- no syntaxhighlight, because this is a counter-example -->
 
<!-- no syntaxhighlight, because this is a counter-example -->
 
<syntaxhighlight lang = "text">
 
<syntaxhighlight lang = "text">
    s := '-123.45678';
+
s := '-123.45678';
    s := '0';
+
s := '0';
    s := '123.45678';
+
s := '123.45678';
 
</syntaxhighlight>
 
</syntaxhighlight>
  
The difference between the two examples is that the upper example is the assignment of Integer and FloatingCommand literals, while the assignment of the lower example is literals of the String type.
+
The difference between the two examples is that the upper example is the assignment of Integer and Floating literals, while the assignment of the lower example is literals of the String type.
  
 
=== Binary floating-point format ===
 
=== Binary floating-point format ===
Line 65: Line 64:
 
|}
 
|}
  
== <syntaxhighlight lang="pascal" enclose="none">double</syntaxhighlight> ==
+
== <syntaxhighlight lang="pascal" inline>double</syntaxhighlight> ==
  
 
Any value stored as a double requires 64 bits, formatted as shown in the table below:
 
Any value stored as a double requires 64 bits, formatted as shown in the table below:
Line 82: Line 81:
 
| Fraction f of the number 1.f
 
| Fraction f of the number 1.f
 
|}
 
|}
 +
 +
Example of converting from raw data to ''double'' (Data is <syntaxhighlight lang="pascal" inline>array [0..7] of byte</syntaxhighlight>):
 +
 +
<syntaxhighlight lang="pascal">
 +
function ToDouble(const Data; IntelEndianness: Boolean = False):Double;inline;
 +
var
 +
  ADouble: Double absolute Data;
 +
  AQWord: QWord absolute Result;
 +
begin
 +
  Result := ADouble;
 +
  if not IntelEndianness then
 +
    AQWord := SwapEndian(AQWord);
 +
end;
 +
</syntaxhighlight>
 +
 +
== <syntaxhighlight lang="pascal" inline>extended</syntaxhighlight> ==
 +
<syntaxhighlight lang="pascal" inline>extended</syntaxhighlight> is a 80-bit wide floating-point data type.
 +
There are <abbr title="floating-point unit">FPU</abbr>s that internally use 80 bits for increased precision.
 +
FPC allows to use this gain in precision.
 +
 +
{{Note|If some platform does not support the <syntaxhighlight lang="pascal" inline>extended</syntaxhighlight> data type, it will be mapped to largest available floating-point number data available, i. e. usually <syntaxhighlight lang="pascal" inline>double</syntaxhighlight>.}}
 +
 +
==Using the maximum precision for constants==
 +
 +
In this example:
 +
<syntaxhighlight lang="pascal">
 +
var
 +
  f: double;
 +
  n: integer = 1758;
 +
  m: integer = 0;
 +
begin
 +
  f := n * 1.2E6 + (2*m+1) * 50E3; // 2109650048
 +
</syntaxhighlight>
 +
 +
FPC will interpret the constant <tt>1.2E6</tt> as <tt>Single</tt> type, because it fits into the range of the <tt>Single</tt> type. This leads to some precision loss in the calculation: the result is 2109650048 instead of 2109650000.
 +
 +
You can use the FPC command-line flag <tt>-CF64</tt> to force floating point constants to have at least 64 bits precision. Or you can use <tt>{$MINFPCONSTPREC <n>}</tt> to force the compiler to evaluate all floating point constants always with the precision given by <n> = <tt>32</tt>, <tt>64</tt> or <tt>DEFAULT</tt> (<tt>80</tt> is not supported for implementation reasons). In FPC v3.3.1+, you can also use the directive <tt>{$excessprecision on}</tt> which exists also in Delphi.
  
 
{{Data types}}
 
{{Data types}}

Latest revision as of 21:41, 5 April 2023

English (en)

single, double and extended are FPC's data types implementing Pascal’s real. All of them are implemented according IEEE standard 754, where single is “single-precision”, double is “double-precision”, and extended has 80 bits.

single

value range 1.5E-45 .. 3.4E38
accuracy 6-9 significant decimal digits precision
memory requirement 4 bytes or 32 bits
property The single- data-type data field can hold floating-point values ​​and signed and unsigned integer values.

Assigning other values ​​will result in error messages from the compiler when the program is compiled, and the compile will be aborted. That is, the executable program is not created.

Definition of a data field of data type Single:

var
  s: single;

Examples of assigning valid values:

s := -123.45678;
// Note: '0' is an integer literal. 0.0 is an real literal.
// Here, FPC will make an implicit typecast from integer to single:
s := 0;
// a positive sign is optional
s := 123.45678;

Examples of assigning invalid values:

s := '-123.45678';
s := '0';
s := '123.45678';

The difference between the two examples is that the upper example is the assignment of Integer and Floating literals, while the assignment of the lower example is literals of the String type.

Binary floating-point format

Any value stored as a single requires 32 bits, formatted as shown in the table below:

Bits Usage
31 Sign (0 = positive, 1 = negative)
30 to 23 Exponent, biased by 127
22 to 0 Fraction f of the number 1.f

double

Any value stored as a double requires 64 bits, formatted as shown in the table below:

Bits Usage
63 Sign (0 = positive, 1 = negative)
62 to 52 Exponent, biased by 1023
51 to 0 Fraction f of the number 1.f

Example of converting from raw data to double (Data is array [0..7] of byte):

function ToDouble(const Data; IntelEndianness: Boolean = False):Double;inline;
var
  ADouble: Double absolute Data;
  AQWord: QWord absolute Result;
begin
  Result := ADouble;
  if not IntelEndianness then
    AQWord := SwapEndian(AQWord);
end;

extended

extended is a 80-bit wide floating-point data type. There are FPUs that internally use 80 bits for increased precision. FPC allows to use this gain in precision.

Light bulb  Note: If some platform does not support the extended data type, it will be mapped to largest available floating-point number data available, i. e. usually double.

Using the maximum precision for constants

In this example:

var
  f: double;
  n: integer = 1758;
  m: integer = 0;
begin
  f := n * 1.2E6 + (2*m+1) * 50E3; // 2109650048

FPC will interpret the constant 1.2E6 as Single type, because it fits into the range of the Single type. This leads to some precision loss in the calculation: the result is 2109650048 instead of 2109650000.

You can use the FPC command-line flag -CF64 to force floating point constants to have at least 64 bits precision. Or you can use {$MINFPCONSTPREC <n>} to force the compiler to evaluate all floating point constants always with the precision given by <n> = 32, 64 or DEFAULT (80 is not supported for implementation reasons). In FPC v3.3.1+, you can also use the directive {$excessprecision on} which exists also in Delphi.


navigation bar: data types
simple data types

boolean byte cardinal char currency double dword extended int8 int16 int32 int64 integer longint real shortint single smallint pointer qword word

complex data types

array class object record set string shortstring