Make your own compiler, interpreter, parser, or expression analyzer

From Free Pascal wiki
Revision as of 11:43, 17 April 2019 by MarkMLl (talk | contribs) (→‎Anatomy of a compiler: New "ancient history" section)
Jump to navigationJump to search

FCL Passrc

FPC comes with a pascal parser in library form in the fcl-passrc package. This is not the main compiler parser, but it is the one used for fpdoc and pas2js.

Other FPC parser packages

fcl-xml is a FPC package that contains SAX XML and html parsers.

FPC also contains two expression parsers symbolic and TFPExpressionParser.

Lex and Yacc

Two of the oldest unix tools. Lex is a lexical analyser (token parser), and Yacc is a LALR parser generator. BNF notation is used as a formal way to express context free grammars. Code and grammar are mixed, so grammar is tied to implementation language.

Plex and Pyacc

Plex and Pyacc are pascal implementations of Lex and Yacc and they are part of your FreePascal distribution.

Lazarus Lex and Yacc

You can find unfortunately abandoned Lazarus Lex and Yacc here.

Gold

Gold is a free parsing system that you can use to develop your own programming languages, scripting languages and interpreters. It uses LALR parsing, and a mix of BNF notation, character sets and regular expressions for terminals to define language grammars. Code and grammar are separated, so grammar is not tied to implementation language. This means that the same grammar can be loaded into engines made in different programming languages.

Gold Parser Builder can be used to create, modify and test languages in Windows IDE which can also run on Wine. Command line tools are also available.

Gold Parser Builder can also be used as a parser code generator using internal templates (FreePascal included), but there are also 3rd party engines to process compiled grammars.

Gold Parser Builder has grammar editor with syntax highlighting, grammar generating wizard, test window to step through parsing of a sample source, templating system that can generate lexers/parsers or skeleton programs for various languages (including Delphi and FreePascal), import/export YACC/Bison, XML and HTML export, and interactive inspection of the compiled DFA and LALR tables.

There is a subjective feature comparison table of several parsers on Gold site, with special attention to Gold vs Yacc comparison.

SynFacilSyn

SynFacilSyn is Lazarus cross-platform library that includes a SynEdit highlighter that also can work as a lexer because of its flexible syntax definition file. It's well documented and has been used in several projects like highlighter and lexer.

SynFacilSyn includes an adapter to be used in ATSynEdit.

The main advantage of using SynFacilSyn as lexer is that you can really see the tokens colored in SynEdit editor, without doing anything else. So if you define a lexer, you are defining a source highlighter too.

To define the syntax you can use an XML file, where the tokens are defined using tags and some RegEx constructions. There you can specify the token colors too, if you need it.

Syntax definition can be done programmatically too, so it's possible to change some syntax elements in runtime.

The highlighter of SynFacilSyn is light, fast and includes some useful methods to access the text content after the scan. Moreover includes some features to define recursive blocks ans sections so it can be used as a parser too.

SynFacilSyn is the base library for other tools that in group can be used to create powerful compilers/interpreters/IDE:

  • SynFacilCompletion - Scriptable Highlighter with code-completion for the SynEdit Component of Lazarus
  • SynFacilUtils Library with utilities to create editors using SynFacilSyn and SynFacilCompletion.
  • t-Xpres Framework to create compilers/interpreters based on SynFacilSyn. Includes lexer, parser, syntaxTree, expression evaluator and support to implement code generators or virtual machines.

The next programs have been created using this tools:

AntLR

TBD

Coco-R

Coco/R is a compiler generator based on L- attributed grammars which generates a scanner and a parser.

For more information:

http://www.ssw.uni-linz.ac.at/Coco/


Two chapters of this book give an introduction about Coco/R and show some sample studies.

Compilers and Compiler Generators - an introduction with C++

P.D. Terry, Rhodes University, 1996

http://www.cs.ru.ac.za/compilers/index.html


http://www.cs.ru.ac.za/compilers/pdfvers.pdf

http://www.cs.ru.ac.za/compilers/cocorp.zip

Anatomy of a compiler

Here is graphical representation of a typical compiler anatomy:

Anatomy of a compiler

The parse tree is typically stored in RAM, where an optimiser can recognise and simplify idioms e.g. to unroll loops. Some early compilers attempted to store the entire program's parse tree which was only converted to lower-level code on completion of the pass that generated it; one notable example was the Pastel ("an off-colour Pascal") compiler[1] running on a DEC PDP-11 which famously did not become the basis of Stallman's GCC compiler:[2]

Hoping to avoid the need to write the whole compiler myself, I obtained the source code for the Pastel compiler, which was a multiplatform compiler developed at Lawrence Livermore Lab. It supported, and was written in, an extended version of Pascal, designed to be a system-programming language. I added a C front end, and began porting it to the Motorola 68000 computer. But I had to give that up when I discovered that the compiler needed many megabytes of stack space, and the available 68000 Unix system would only allow 64k.

I then realized that the Pastel compiler functioned by parsing the entire input file into a syntax tree, converting the whole syntax tree into a chain of “instructions”, and then generating the whole output file, without ever freeing any storage. At this point, I concluded I would have to write a new compiler from scratch. That new compiler is now known as GCC; none of the Pastel compiler is used in it, but I managed to adapt and use the C front end that I had written. But that was some years later; first, I worked on GNU Emacs.

The code generator takes (fragments of) the parse tree and generates either binary object files or assembler source. At this stage there is typically further optimisation, in particular to recognise e.g. writes to variables that are never read.

Ancient history

A wiki entry hosted by a pascal compiler obviously has to start off with Niklaus Wirth, who was supervised by Harry Huskey at UC Berkeley; Wirth's doctoral work implemented a language named Euler on an IBM 704. After Berkeley Wirth moved to Stanford where he reimplemented Euler in ALGOL-60 on either a Burroughs B5000 or B5500 (drum or disc-based respectively, the system was upgraded at about the same time), then he moved on to PL/360 and ALGOL-W which he proposed as a successor to ALGOL-60. Broadly speaking, Wirth's early compilers used recursive ascent, later editions of his books introduced recursive descent as an alternative.

Useful BNF and EBNF tools

See also