Fast direct pixel access

From Free Pascal wiki
Jump to navigationJump to search

Introduction

Standard graphical LCL components provides Canvas object for common drawing. But most of available graphic routines have some overhead given by universality, platform independence and safety. To achieve best drawing speed it can be useful to use specialized bitmap structures and routines. You can also use existing Graphics libraries.

Direct pixel access in libraries is generally slowed-down by more factors:

  • Coordinate limits checking
  • Cropping
  • Facility for automatic image updating and redrawing (notifications and mass operation optimization)
  • Abstract program constructions as property, static and virtual methods, dynamic two dimensional arrays
  • Support for multiple pixel formats
  • Support for multiple platforms
  • Support for bit-level pixel size and addressing

This articles shows how to achieve a custom bitmap structure that is then copied to a TBitmap to render it on the screen.

Pixel format

Pixel type can be written in various ways to define its size and internal channel structure. Here you can found some known ways to define pixel type.

Simple ordinal type

We can use simple integer pixels which would be faster on 32-bit platform:

TFastBitmapPixel = Integer;

Record

Or more abstract pixels with separated components:

TFastBitmapPixelComponents = packed record
  Blue: Byte;
  Green: Byte;
  Red: Byte;
  Alpha: Byte;
end;

Subrange

It is possible even go further to bit level and define 16-bit RGB pixel used for some LCD displays:

TFastBitmapPixelComponents16Bit = packed record
  Blue: 0..31; // 5 bits
  Green: 0..63; // 6 bits
  Red: 0..31; // 5 bits
end;

Pointer

Pixel can be pointer which would be useful for cases where pixel value itself is rather large or can be compressed somehow.

TFastBitmapPixelComponentsValue = packed record
  Blue: Word;
  Green: Word;
  Red: Word;
  Alpha: Word;
end;

TFastBitmapPixelComponents = ^TFastBitmapPixelComponentsValue;

Class

Another situation is use of polymorphism of classes. But this would require calling Create and Destroy for each pixel instance and further background processing caused by allocation and deallocation memory on a heap.

TFastBitmapPixel = class
  procedure Clear; virtual;
end;

TFastBitmapPixelComponents = class(TFastBitmapPixel)
  Blue: Word;
  Green: Word;
  Red: Word;
  Alpha: Word;
  procedure Clear; override;
end;

TFastBitmapPixelByte = class(TFastBitmapPixel)
  Value: Byte;
  procedure Clear; override;
end;

Variant parts in record

Notable implementation of the pixel type is Graphics32 library which define pixel using composed record type with case construction to union access with different methods to the same pixel data. There four ways to access pixel data and its internal structure defined by this type:

TColor32 = type Cardinal;

TColor32Component = (ccBlue, ccGreen, ccRed, ccAlpha);

TColor32Entry = packed record
  case Integer of
    0: (B, G, R, A: Byte);
    1: (ARGB: TColor32);
    2: (Planes: array[0..3] of Byte);
    3: (Components: array[TColor32Component] of Byte);
end;

Bitmap structure

Bitmap class should provide direct pixel access given by X, Y coordinate. But some graphic operation could be further optimized by not doing coordinate calculations for every pixel and rather do pixel pointer shifting by simple memory pointer addition. Some mass operation as filling rectangular region could be optimized using Move and FillChar functions.

Two dimensional dynamic array

This is native way to express two dimensional array in pascal. Internal structure is implemented as pointer to array of pointers to data because dynamic array is in fact pointer to array data. Then calculation of pixel position is matter of fetching pointer for rows and add horizontal position to it.

interface

type
  TFastBitmap = class
  private
    function GetSize: TPoint;
    procedure SetSize(const AValue: TPoint);
  public
    Pixels: array of array of TFastBitmapPixel;
    property Size: TPoint read GetSize write SetSize;
  end;

implementation

{ TFastBitmap }

function TFastBitmap.GetSize: TPoint;
begin
  Result.X := Length(Pixels);
  if Result.X > 0 then 
    Result.Y := Length(Pixels[0])
  else 
    Result.Y := 0;
end;

procedure TFastBitmap.SetSize(const AValue: TPoint);
begin
  SetLength(Pixels, AValue.X, AValue.Y);
end;

Raw dynamic memory

It is good to have whole bitmap in one compact memory area. Such memory block behave as video memory of video card. Position of pixels have to be calculated by using equation Y * Width + X with use of instructions for addition and multiplication. Access to pixels is pretty fast thanks to GetPixel and SetPixel methods inlining. But more instruction have to be used than in case of two dimensional dynamic array.

interface

type
  PFastBitmapPixel = ^TFastBitmapPixel;
  TFastBitmap = class
  private
    FPixelsData: PByte;
    FSize: TPoint;
    function GetPixel(X, Y: Integer): TFastBitmapPixel; inline;
    procedure SetPixel(X, Y: Integer; const AValue: TFastBitmapPixel); inline;
    procedure SetSize(const AValue: TPoint);
  public
    constructor Create;
    destructor Destroy; override;
    property Size: TPoint read FSize write SetSize;
    property Pixels[X, Y: Integer]: TFastBitmapPixel read GetPixel write SetPixel;
  end;

implementation

{ TFastBitmap }

function TFastBitmap.GetPixel(X, Y: Integer): TFastBitmapPixel;
begin
  Result := PFastBitmapPixel(FPixelsData + (Y * FSize.X + X) * SizeOf(TFastBitmapPixel))^;
end;

procedure TFastBitmap.SetPixel(X, Y: Integer; const AValue: TFastBitmapPixel);
begin
  PFastBitmapPixel(FPixelsData + (Y * FSize.X + X) * SizeOf(TFastBitmapPixel))^ := AValue;
end;

procedure TFastBitmap.SetSize(const AValue: TPoint);
begin
  if (FSize.X = AValue.X) and (FSize.Y = AValue.X) then
    Exit;
  FSize := AValue;
  FPixelsData := ReAllocMem(FPixelsData, FSize.X * FSize.Y * SizeOf(TFastBitmapPixel));
end;

constructor TFastBitmap.Create;
begin
  Size := Point(0, 0);
end;

destructor TFastBitmap.Destroy;
begin
  FreeMem(FPixelsData);
  inherited Destroy;
end;

Strict Pointer pixel access

We are able eliminate some of coordinate multiplications with low level pixel access using pointers only. Then only addition(incrementation) is necessary to change current pixel position.

interface

type
  TFastBitmap = class
  private
    FPixelsData: PByte;
    FSize: TPoint;
    procedure SetSize(const AValue: TPoint);
  public
    constructor Create;
    destructor Destroy; override;
    procedure RandomImage;
    property Size: TPoint read FSize write SetSize;
    function GetPixelAddress(X, Y: Integer): PFastBitmapPixel; inline;
    function GetPixelSize: Integer; inline;
  end;  

implementation

{ TFastBitmap }

procedure TFastBitmap.SetSize(const AValue: TPoint);
begin
  if (FSize.X = AValue.X) and (FSize.Y = AValue.X) then Exit;
  FSize := AValue;
  FPixelsData := ReAllocMem(FPixelsData, FSize.X * FSize.Y * SizeOf(TFastBitmapPixel));
end;

constructor TFastBitmap.Create;
begin
  Size := Point(0, 0);
end;

destructor TFastBitmap.Destroy;
begin
  FreeData(FPixelData);
  inherited Destroy;
end;

function TFastBitmap.GetPixelAddress(X, Y: Integer): PFastBitmapPixel;
begin
  Result := PFastBitmapPixel(FPixelsData) + Y * FSize.X + X;
end;

function TFastBitmap.GetPixelSize: Integer;
begin
  Result := SizeOf(TFastBitmapPixel);
end;

In this case drawing pixels is less readable:

procedure RandomImage(FastBitmap: TFastBitmap);
var
  X, Y: Integer;
  PRow: PFastBitmapPixel;
  PPixel: PFastBitmapPixel;
begin
  with FastBitmap do begin
    PRow := GetPixelAddress(0, Size.Y div 2);
    for Y := 0 to Size.Y - 1 do begin
      PPixel := PRow;
      for X := 0 to Size.X - 1 do begin
        PPixel^ := Random(256) or (Random(256) shl 16) or (Random(256) shl 8);
        Inc(PPixel);
      end;
      Inc(PRow, Size.X);
    end;
  end;
end;

Pixel operation optimization

Basic line algorithm

This is naive form which is readable but with price of slower processing.

procedure TFastBitmap.HorizontalLine(X, Y, Length: Integer; Color: TFastBitmapPixel);
var
  I: Integer;
begin
  for I := 0 to Length - 1 do
    Pixels[X + I, Y] := Color;
end;

Pointers

With use of pointers we can eliminate much of pixel address addition and multiplication by Pixels property access. Only fast increment operation is performed.

procedure TFastBitmap.HorizontalLine(X, Y, Length: Integer; Color: TFastBitmapPixel);
var
  I: Integer;
  P: PFastBitmapPixel;
begin
  P := PFastBitmapPixel(FPixelData + (Y * Size.X + X) * SizeOf(TFastBitmapPixel));
  for I := 0 to Length - 1 do begin
    P^ := Color;
    Inc(P);
  end;
end;

Mass fill using FillDWord

Access using pointers and incrementation is fastest possible using conventional single operations. But most of todays CPU offer instructions for mass operations like MOVS, STOS for x86 architecture. Pixel size should be 1, 2 or 4 bytes to be able to use this optimization.

procedure TFastBitmap.HorizontalLine(X, Y, Length: Integer; Color: TFastBitmapPixel);
var
  I: Integer;
  P: PFastBitmapPixel;
begin
  P := PFastBitmapPixel(FPixelData + (Y * Size.X + X) * SizeOf(TFastBitmapPixel));
  FillDWord(P^, Length, Color);
end;

Inlining

If code is notably smaller like SetPixel and GetPixel methods it is better to inline instructions rather than do push and pop operations on stact with execution of call and ret instruction. This optimization will be even significant if such operation is executed many times as pixel operations do.

procedure TFastBitmap.HorizontalLine(X, Y, Length: Integer; Color: TFastBitmapPixel); inline;
var
  I: Integer;
  P: PFastBitmapPixel;
begin
  P := PFastBitmapPixel(FPixelData + (Y * Size.X + X) * SizeOf(TFastBitmapPixel));
  FillDWord(P^, Length, Color);
end;

DMA

If memory block have to be copied to another memory place or device memory DMA(Direct Memory Access) can be used. CPU doesn't have to be involved in copy operations and can do further processing. This kind of optimization can be used in OpenGL for copying data to video card memory.

Drawing bitmap on screen

In this test let assume that we have simple bitmap structure designed as two dimensional byte array where each pixel have 256 possible colors. This could be gray image or some palette mapped image. All image manipulation will be done with custom functions with direct pixel access. Thanks to custom data structure functions could be optimized for faster block memory operations if necessary.

To be able to display image on Form custom bitmap have to be copied to some TWinControl canvas area. Image have to be copied repeatedly if motion image is generated. Every bitmap copy in memory take some time. Then our aim is to do as low as possible copy operations and rather copy our bitmap to screen directly if possible.

You can draw image as fast as possible in simple loop:

repeat
  FastBitmapToBitmap(FastBitmap, Image1.Picture.Bitmap);
  Application.ProcessMessages;
until Terminated;

Or draw image for example using Timer with defined drawing interval. Even if nothing is changed on bitmap there is no need to copy bitmap to screen so RedrawPending simple flag could be used. Thanks to delayed draw execution with calling Redraw method drawing of frames could be skipped.

TForm1 = class(TForm)
published
  procedure Timer1Execute(Sender: TObject);
  ...  
public
  RedrawPending: Boolean;
  Drawing: Boolean;
  FastBitmap: TFastBitmap;
  procedure Redraw;
  ...
end;
  
procedure TForm1.Redraw;
begin
  RedrawPending := True;
end;
  
procedure TForm1.Timer1Execute(Sender: TObject);
begin
  if (not Drawing) and RedrawPending then 
  try
    Drawing := True;
    CustomProcessing(FastBitmap);
    FastBitmapToBitmap(FastBitmap, Image1.Picture.Bitmap);        
  finally
    RedrawPending := False;
    Drawing := False;
  end;
end;

Draw methods

  • These methods show various ways how to draw bitmap.
  • FastBitmap needs to be already initialized to desired size. Target Bitmap(Canvas) should have same size as FastBitmap. Use something like Bitmap.SetSize(FastBitmap.X, FastBitmap.Y)
  • These drawing methods don't assume exact pixel color format. They rather use abstract function called FastPixelToTColor for illustration purpose which needs to be implemented according real chosen color format.
  • These methods assume for simplicity Bitmap structure accessible by Pixels two-dimensional array usually implemented as property. For even better performance various further optimization could be used. See #Pixel operation optimization and #Strict Pointer pixel access.

TBitmap.Canvas.Pixels

This is most straighforward but slowest method:

function FastBitmapToBitmap(FastBitmap: TFastBitmap; Bitmap: TBitmap);
var
  X, Y: Integer;
begin
  for Y := 0 to FastBitmap.Size.Y - 1 do
    for X := 0 to FastBitmap.Size.X - 1 do
      Bitmap.Canvas.Pixels[X, Y] := FastPixelToTColor(FastBitmap.Pixels[X, Y]);
end;

TBitmap.Canvas.Pixels with Update locking

Previous method could be speeded up by update locking and thus reducing per pixel update and event signaling.

function FastBitmapToBitmap(FastBitmap: TFastBitmap; Bitmap: TBitmap);
var
  X, Y: Integer;
begin
  try
    Bitmap.BeginUpdate(True);
    for Y := 0 to FastBitmap.Size.Y - 1 do
      for X := 0 to FastBitmap.Size.X - 1 do
        Bitmap.Canvas.Pixels[X, Y] := FastPixelToTColor(FastBitmap.Pixels[X, Y]);
  finally
    Bitmap.EndUpdate(False);
  end;
end;

TLazIntfImage

TLazIntfImage is a memory image. I can store transparency and 16-bit values for each channel. TBitmap is compatible with Delphi and use TColor type for pixels which do not contain alpha information. So TLazIntfImage is better suited for image processing. This component provide faster access to pixels because it is an array in memory like our TFastBitmap.

Here we copy TFastBitmap grayscale pixels into a TLazIntfImage to convert it into a TBitmap.

uses
  ..., LCLType, LCLProc, LCLIntf, IntfGraphics;

function FastBitmapToBitmap(FastBitmap: TFastBitmap; Bitmap: TBitmap);
var
  X, Y: Integer;
  TempIntfImage: TLazIntfImage;
begin
  try
    TempIntfImage := Bitmap.CreateIntfImage; // Temp image could be pre-created and help by owner class to avoid new creation in each frame
    for Y := 0 to FastBitmap.Size.Y - 1 do
      for X := 0 to FastBitmap.Size.X - 1 do begin
        TempIntfImage.Colors[X, Y] := TColorToFPColor(FastPixelToTColor(FastBitmap.Pixels[X, Y]));
      end;
    Bitmap.LoadFromIntfImage(TempIntfImage);
  finally
    TempIntfImage.Free;
  end;                           
end;

We can also work directly with TLazIntfImage pixels, which can well serve our purpose. To do this, create a TLazInfImage, set the pixel format and then use GetDataLineStart to access scanlines. This is still an indirect method, because a TLazIntfImage need to be copied to a TBitmap to be drawn.

TBGRABitmap.ScanLine

There is graphic library BGRABitmap which allow access to scan lines. Overall speed of this method is pretty good. Drawing is done directly to Canvas of some TWinControl components like TForm of TPaintBox. The pixel format is 32-bit color with alpha channel, i.e. 8-bit for each channel.

Using TBitmap.ScanLine was a method used frequently on Delphi. But TBitmap.ScanLine is not supported by LCL. ScanLine property give access to memory starting point for each row raw data. Then direct manipulation with pixels is much faster than using Pixels property as no additional events is fired.

We can copy our grayscale FastBitmap data to a TBGRABitmap to render it on the screen.

uses
  ..., BGRABitmap, BGRABitmapTypes;

procedure FastBitmapToCanvas(FastBitmap: TFastBitmap; Canvas: TCanvas);
var
  X, Y: Integer;
  P: PBGRAPixel;
  bgra: TBGRABitmap;
begin
  bgra := TBGRABitmap.Create(FastBitmap.Size.X, FastBitmap.Size.Y);
  with FastBitmap do
    for Y := 0 to Size.Y - 1 do begin
      P := PInteger(bgra.ScanLine[Y]);
      for X := 0 to Size.X - 1 do begin
        PInteger(P)^ := FastPixelToTColor(Pixels[X, Y]) or $ff000000;
        Inc(P);
      end;
    end;
  bgra.InvalidateBitmap; // Changed by direct access
  bgra.Draw(Canvas, 0, 0, False);
  bgra.Free;
end;

We can also use TBGRABitmap only. This library works if possible with device independent bitmaps of the operating system, so it is generally a direct pixel access or quasi-direct pixel access.

BGRABitmap tutorial shows how to access directly to pixels

TBitmap.RawImage

This method is so far fastest in comparing to previous ones but more complicated as special care have to be given to bitmap data structure. Example assume that bitmap PixelFormat is pf24bit. Accessed raw data may differs across platforms.

uses
  ..., GraphType;

function FastBitmapToBitmap(FastBitmap: TFastBitmap; Bitmap: TBitmap);
var
  X, Y: Integer;
  PixelPtr: PInteger;
  PixelRowPtr: PInteger;
  P: TPixelFormat;
  RawImage: TRawImage;
  BytePerPixel: Integer;
begin
  try
    Bitmap.BeginUpdate(False);
    RawImage := Bitmap.RawImage;
    PixelRowPtr := PInteger(RawImage.Data);
    BytePerPixel := RawImage.Description.BitsPerPixel div 8;
    for Y := 0 to Size.Y - 1 do begin
      PixelPtr := PixelRowPtr;
      for X := 0 to Size.X - 1 do begin
        PixelPtr^ := FastPixelToTColor(Pixels[X, Y]);
        Inc(PByte(PixelPtr), BytePerPixel);
      end;
      Inc(PByte(PixelRowPtr), RawImage.Description.BytesPerLine);
    end;
  finally
    Bitmap.EndUpdate(False);
  end;  
end;

RawImage.Description values examples on various platforms:

Platform Format Depth BitsPerPixel BitOrder ByteOrder LineOrder LineEnd RedPrec RedShift GreenPrec GreenShift BluePrec BlueShift AlphaPrec AlphaShift
Windows RGBA 24 24 ReverseBits LSBFirst TopToBottom DWordBoundary 8 16 8 8 8 0 0 0
Windows RGBA 15 16 ReverseBits LSBFirst TopToBottom DWordBoundary 5 10 5 5 5 0 0 0
Linux GTK2 RGBA 24 32 BitsInOrder LSBFirst TopToBottom DWordBoundary 8 16 8 8 8 0 0 0

OpenGL

OpenGL is mainly used for 3D complex modeling but it can be used for simple 2D accelerated graphics. We need initialized OpenGL, create one textured rectangle with screen resolution and set orthogonal view. Then we will able to fill texture by our custom converted bitmap data. But this method is not significantly faster then RawImage.Data method because all image data are copied with glTexImage2D function which mean slow copy using CPU.

uses
  ..., GL, OpenGLContext;

var
  TextureId: GLuint;
  TextureData: Pointer; 
  OpenGLControl1: TOpenGLControl;

procedure InitGL;
begin
  glMatrixMode(GL_PROJECTION);
  glLoadIdentity;
  glOrtho(0, OpenGLControl1.Width, OpenGLControl1.Height, 0, 0, 1);
  glMatrixMode(GL_MODELVIEW);
  glLoadIdentity();
  glDisable(GL_DEPTH_TEST);
  glViewport(0, 0, OpenGLControl1.Width, OpenGLControl1.Height);
  glGenTextures(1, @TextureId);
  glTexEnvf(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_MODULATE); 
end;

function FastBitmapToBitmap(FastBitmap: TFastBitmap; OpenGLControl: TOpenGLControl);
var
  X, Y: Integer;
  P: PInteger;
  R: PInteger;
const
  GL_CLAMP_TO_EDGE = $812F;
begin
  glClear(GL_COLOR_BUFFER_BIT or GL_DEPTH_BUFFER_BIT);

  P := OpenGLBitmap;
  with FastBitmap do
  for Y := 0 to Size.Y - 1 do begin
    R := P;
    for X := 0 to Size.X - 1 do begin
      R^  := FastPixelToTColor(Pixels[X, Y]) or $ff000000;
      Inc(R);
    end;
    Inc(P, Size.X);
  end;

  glLoadIdentity;
  glTranslatef(-OpenGLControl.Width div 2, -OpenGLControl.Height div 2, 0.0);

  glEnable(GL_TEXTURE_2D);
  glBindTexture(GL_TEXTURE_2D, TextureId);
    //glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
    //glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
    glTexImage2D(GL_TEXTURE_2D, 0, 4, OpenGLControl.Width, OpenGLControl.Height,
      0, GL_RGBA, GL_UNSIGNED_BYTE, OpenGLBitmap);

  glBegin(GL_QUADS);
    glColor3ub(255, 255, 255);
    glTexCoord2f(0, 0);
    glVertex3f(0, 0, 0);
    glTexCoord2f(OpenGLControl.Width div 2, 0);
    glVertex3f(OpenGLControl.Width, 0, 0);
    glTexCoord2f(OpenGLControl.Width div 2, OpenGLControl.Height div 2);
    glVertex3f(OpenGLControl.Width, OpenGLControl.Height, 0);
    glTexCoord2f(0, OpenGLControl.Height div 2);
    glVertex3f(0, OpenGLControl.Height, 0);
  glEnd();

  OpenGLControl.SwapBuffers;
end;

OpenGL PBO

This method use asynchronous DMA transfer to copy texture data thus CPU is free to do further computations. It also eliminate one additional copy operation which is done by glTexImage2D in previous method. Method require GL_ARB_pixel_buffer_object extension.

uses
  ..., GL, OpenGLContext;

var
  TextureId: GLuint;
  TextureData: Pointer; 
  OpenGLControl1: TOpenGLControl;
  pboIds: array[0..1] of GLuint;

procedure InitGL;
var
  DataSize: Integer;
begin
  glMatrixMode(GL_PROJECTION);
  glLoadIdentity;
  glOrtho(0, OpenGLControl1.Width, OpenGLControl1.Height, 0, 0, 1);
  glMatrixMode(GL_MODELVIEW);
  glLoadIdentity();
  glDisable(GL_DEPTH_TEST);
  glViewport(0, 0, OpenGLControl1.Width, OpenGLControl1.Height);
  glGenTextures(1, @TextureId);
  glTexEnvf(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_MODULATE); 

  OpenGLControl1.MakeCurrent;
  DataSize := OpenGLControl1.Width * OpenGLControl1.Height * SizeOf(Integer);
  if Load_GL_ARB_vertex_buffer_object then begin
    glGenBuffersARB(2, @pboIds);
    glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[0]);
    glBufferDataARB(GL_PIXEL_PACK_BUFFER_ARB, DataSize, Pointer(0), GL_STREAM_READ_ARB);
    glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboIds[1]);
    glBufferDataARB(GL_PIXEL_PACK_BUFFER_ARB, DataSize, Pointer(0), GL_STREAM_READ_ARB);

  end else raise Exception.Create('GL_ARB_pixel_buffer_object not supported');

  glEnable(GL_TEXTURE_2D);
  glBindTexture(GL_TEXTURE_2D, TextureId);
    //glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
    //glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
    glTexImage2D(GL_TEXTURE_2D, 0, 4, OpenGLControl1.Width, OpenGLControl1.Height,
      0, GL_RGBA, GL_UNSIGNED_BYTE, OpenGLBitmap);
end;

function FastBitmapToBitmap(FastBitmap: TFastBitmap; OpenGLControl: TOpenGLControl);
var
  X, Y: Integer;
  P: PInteger;
  R: PInteger;
  Ptr: ^GLubyte;
  TextureShift: TPoint;
  TextureShift2: TPoint;
const
  GL_CLAMP_TO_EDGE = $812F;
begin
  // "index" is used to read pixels from framebuffer to a PBO
  // "nextIndex" is used to update pixels in the other PBO
  Index := (Index + 1) mod 2;
  NextIndex := (Index + 1) mod 2;

  glLoadIdentity;

  // bind the texture and PBO
  glBindTexture(GL_TEXTURE_2D, TextureId);
  glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pboIds[index]);

  // copy pixels from PBO to texture object
  // Use offset instead of ponter.
  glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, OpenGLControl.Width, OpenGLControl.Height,
    GL_BGRA, GL_UNSIGNED_BYTE, Pointer(0));


  // bind PBO to update texture source
  glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pboIds[nextIndex]);

  // Note that glMapBufferARB() causes sync issue.
  // If GPU is working with this buffer, glMapBufferARB() will wait(stall)
  // until GPU to finish its job. To avoid waiting (idle), you can call
  // first glBufferDataARB() with NULL pointer before glMapBufferARB().
  // If you do that, the previous data in PBO will be discarded and
  // glMapBufferARB() returns a new allocated pointer immediately
  // even if GPU is still working with the previous data.
  glBufferDataARB(GL_PIXEL_UNPACK_BUFFER_ARB, OpenGLControl.Width * OpenGLControl.Height * SizeOf(Integer), Pointer(0), GL_STREAM_DRAW_ARB);

  // map the buffer object into client's memory
  ptr := glMapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, GL_WRITE_ONLY_ARB);
  if Assigned(ptr) then begin
    // update data directly on the mapped buffer
    P := PInteger(Ptr);
    with FastBitmap do
    for Y := 0 to Size.Y - 2 do begin
      R := P;
      for X := 0 to Size.X - 1 do begin
        R^ := NoSwapBRComponent(FastPixelToTColor(Pixels[X, Y]) or $ff000000);
        Inc(R);
      end;
      Inc(P, Size.X);
    end;
    glUnmapBufferARB(GL_PIXEL_PACK_BUFFER_ARB);
  end;

  // it is good idea to release PBOs with ID 0 after use.
  // Once bound with 0, all pixel operations are back to normal ways.
  glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

  glClear(GL_COLOR_BUFFER_BIT or GL_DEPTH_BUFFER_BIT);
  glTranslatef(-OpenGLControl.Width / 2, -OpenGLControl.Height / 2, 0.0);
  glBindTexture(GL_TEXTURE_2D, TextureId);

  glBegin(GL_QUADS);
    glColor3ub(255, 255, 255);
    glTexCoord2f(0, 0);
    glVertex3f(0, TextureShift2.Y, 0);
    glTexCoord2f(OpenGLControl.Width div 2, 0);
    glVertex3f(OpenGLControl.Width, 0, 0);
    glTexCoord2f(OpenGLControl.Width div 2, OpenGLControl.Height div 2);
    glVertex3f(OpenGLControl.Width, OpenGLControl.Height, 0);
    glTexCoord2f(0, OpenGLControl.Height div 2);
    glVertex3f(0, OpenGLControl.Height, 0);
  glEnd();                   

  OpenGLControl.SwapBuffers;
end;

See also