Difference between revisions of "Make your own compiler, interpreter, parser, or expression analyzer"

From Free Pascal wiki
Jump to navigationJump to search
(→‎Anatomy of a compiler: New "ancient history" section)
(→‎Ancient history: Various thoughts on recursive ascent/descent etc.)
Line 95: Line 95:
 
= Ancient history =
 
= Ancient history =
 
A wiki entry hosted by a pascal compiler obviously has to start off with Niklaus Wirth, who was supervised by Harry Huskey at UC Berkeley; Wirth's doctoral work implemented a language named Euler on an IBM 704. After Berkeley Wirth moved to Stanford where he reimplemented Euler in ALGOL-60 on either a Burroughs B5000 or B5500 (drum or disc-based respectively, the system was upgraded at about the same time), then he moved on to PL/360 and ALGOL-W which he proposed as a successor to ALGOL-60. Broadly speaking, Wirth's early compilers used recursive ascent, later editions of his books introduced recursive descent as an alternative.
 
A wiki entry hosted by a pascal compiler obviously has to start off with Niklaus Wirth, who was supervised by Harry Huskey at UC Berkeley; Wirth's doctoral work implemented a language named Euler on an IBM 704. After Berkeley Wirth moved to Stanford where he reimplemented Euler in ALGOL-60 on either a Burroughs B5000 or B5500 (drum or disc-based respectively, the system was upgraded at about the same time), then he moved on to PL/360 and ALGOL-W which he proposed as a successor to ALGOL-60. Broadly speaking, Wirth's early compilers used recursive ascent, later editions of his books introduced recursive descent as an alternative.
 +
 +
There is a wealth of information on compiler implementation in that era on Huub de Beer's website[https://heerdebeer.org/ALGOL/implementation.html] which is the source of some of the links etc. that follow. In addition Thomas Haigh has some very useful material on the history of ALGOL standardisation, in particular throwing light on the crisis caused by the ALGOL-68 committee's rejection of Wirth's ALGOL-W proposal:[http://www.tomandmaria.com/Tom/Writing/DijkstrasCrisis_LeidenDRAFT.pdf]
 +
 +
<blockquote>
 +
According to Hoare’s famous account, given years later in his Turing Award lecture, the group took a decisive wrong turn in 1965 when it inexplicably rejected an elegant and pragmatic draft design compiled by Niklaus Wirth in favor of an “incomplete and incomprehensible” alternative design for a radically different language submitted by Adriaan van Wijngaarden.
 +
</blockquote>
 +
 +
Huub de Beer investigates the earliest published algorithms for compiling ALGOL-like languages, and in effect demonstrates that the methods of Dijkstra and Grau used recursive ascent. Edgar ("Ned") Irons at Princeton appeared to be closer to recursive descent, but all of these are notable in that they used multiple stacks etc., in effect extending Dijkstra's "shunting yard" algorithm to handle the entire language syntax.
 +
 +
At Burroughs, Richard Waychoff had the idea of using the run-time structure of ALGOL to hide much of the complication of a recursive descent compiler, he was apparently in informal contact with Irons but the idea of using the implementation language's capability of recursion (i.e. with nested local variables etc.) appears to be his alone.[http://www.ianjoyner.name/Files/Waychoff.pdf] It's worth noting at this point that both Donald Knuth and Edsger Dijkstra cooperated to varying degrees with the Burroughs language and operating system developers, Knuth's copy of the document just cited is annotated "mostly true".
 +
 +
Separately, Val Schorre at UCLA wrote Meta-II for an IBM 1401, which both allows the syntax of a language to be defined in a BNF-like fashion and associates code fragments with each recognised syntactic structure.[https://en.wikipedia.org/wiki/META_II] This allows a fairly comprehensive ALGOL-like language to be specified in a few hundred lines, and while the output- which is typically assembler-like source for a dedicated interpreter- lacks any attempt at optimisation it can be a useful tool when implementing e.g. an application-specific scripting language.
 +
 +
Associates of Schorre went on to design Tree Meta, which extended Meta-II by first generating a parse tree and then using the tree equivalent to a source statement to generate output code[https://en.wikipedia.org/wiki/TREE-META]. The advantage of this approach is that while the initial ''parse'' stage might be unable to distinguish between e.g. loading a constant and loading the result of an expression into a register, the trees that are generated by the two cases are recognisably different and can be optimised appropriately. Tree Meta was used by Douglas Englebart's team to write specialist compilers for "the mother of all demos", which is widely cited as a landmark in the design of interactive systems.
  
 
= Useful BNF and EBNF tools =
 
= Useful BNF and EBNF tools =

Revision as of 11:22, 17 April 2019

FCL Passrc

FPC comes with a pascal parser in library form in the fcl-passrc package. This is not the main compiler parser, but it is the one used for fpdoc and pas2js.

Other FPC parser packages

fcl-xml is a FPC package that contains SAX XML and html parsers.

FPC also contains two expression parsers symbolic and TFPExpressionParser.

Lex and Yacc

Two of the oldest unix tools. Lex is a lexical analyser (token parser), and Yacc is a LALR parser generator. BNF notation is used as a formal way to express context free grammars. Code and grammar are mixed, so grammar is tied to implementation language.

Plex and Pyacc

Plex and Pyacc are pascal implementations of Lex and Yacc and they are part of your FreePascal distribution.

Lazarus Lex and Yacc

You can find unfortunately abandoned Lazarus Lex and Yacc here.

Gold

Gold is a free parsing system that you can use to develop your own programming languages, scripting languages and interpreters. It uses LALR parsing, and a mix of BNF notation, character sets and regular expressions for terminals to define language grammars. Code and grammar are separated, so grammar is not tied to implementation language. This means that the same grammar can be loaded into engines made in different programming languages.

Gold Parser Builder can be used to create, modify and test languages in Windows IDE which can also run on Wine. Command line tools are also available.

Gold Parser Builder can also be used as a parser code generator using internal templates (FreePascal included), but there are also 3rd party engines to process compiled grammars.

Gold Parser Builder has grammar editor with syntax highlighting, grammar generating wizard, test window to step through parsing of a sample source, templating system that can generate lexers/parsers or skeleton programs for various languages (including Delphi and FreePascal), import/export YACC/Bison, XML and HTML export, and interactive inspection of the compiled DFA and LALR tables.

There is a subjective feature comparison table of several parsers on Gold site, with special attention to Gold vs Yacc comparison.

SynFacilSyn

SynFacilSyn is Lazarus cross-platform library that includes a SynEdit highlighter that also can work as a lexer because of its flexible syntax definition file. It's well documented and has been used in several projects like highlighter and lexer.

SynFacilSyn includes an adapter to be used in ATSynEdit.

The main advantage of using SynFacilSyn as lexer is that you can really see the tokens colored in SynEdit editor, without doing anything else. So if you define a lexer, you are defining a source highlighter too.

To define the syntax you can use an XML file, where the tokens are defined using tags and some RegEx constructions. There you can specify the token colors too, if you need it.

Syntax definition can be done programmatically too, so it's possible to change some syntax elements in runtime.

The highlighter of SynFacilSyn is light, fast and includes some useful methods to access the text content after the scan. Moreover includes some features to define recursive blocks ans sections so it can be used as a parser too.

SynFacilSyn is the base library for other tools that in group can be used to create powerful compilers/interpreters/IDE:

  • SynFacilCompletion - Scriptable Highlighter with code-completion for the SynEdit Component of Lazarus
  • SynFacilUtils Library with utilities to create editors using SynFacilSyn and SynFacilCompletion.
  • t-Xpres Framework to create compilers/interpreters based on SynFacilSyn. Includes lexer, parser, syntaxTree, expression evaluator and support to implement code generators or virtual machines.

The next programs have been created using this tools:

AntLR

TBD

Coco-R

Coco/R is a compiler generator based on L- attributed grammars which generates a scanner and a parser.

For more information:

http://www.ssw.uni-linz.ac.at/Coco/


Two chapters of this book give an introduction about Coco/R and show some sample studies.

Compilers and Compiler Generators - an introduction with C++

P.D. Terry, Rhodes University, 1996

http://www.cs.ru.ac.za/compilers/index.html


http://www.cs.ru.ac.za/compilers/pdfvers.pdf

http://www.cs.ru.ac.za/compilers/cocorp.zip

Anatomy of a compiler

Here is graphical representation of a typical compiler anatomy:

Anatomy of a compiler

The parse tree is typically stored in RAM, where an optimiser can recognise and simplify idioms e.g. to unroll loops. Some early compilers attempted to store the entire program's parse tree which was only converted to lower-level code on completion of the pass that generated it; one notable example was the Pastel ("an off-colour Pascal") compiler[1] running on a DEC PDP-11 which famously did not become the basis of Stallman's GCC compiler:[2]

Hoping to avoid the need to write the whole compiler myself, I obtained the source code for the Pastel compiler, which was a multiplatform compiler developed at Lawrence Livermore Lab. It supported, and was written in, an extended version of Pascal, designed to be a system-programming language. I added a C front end, and began porting it to the Motorola 68000 computer. But I had to give that up when I discovered that the compiler needed many megabytes of stack space, and the available 68000 Unix system would only allow 64k.

I then realized that the Pastel compiler functioned by parsing the entire input file into a syntax tree, converting the whole syntax tree into a chain of “instructions”, and then generating the whole output file, without ever freeing any storage. At this point, I concluded I would have to write a new compiler from scratch. That new compiler is now known as GCC; none of the Pastel compiler is used in it, but I managed to adapt and use the C front end that I had written. But that was some years later; first, I worked on GNU Emacs.

The code generator takes (fragments of) the parse tree and generates either binary object files or assembler source. At this stage there is typically further optimisation, in particular to recognise e.g. writes to variables that are never read.

Ancient history

A wiki entry hosted by a pascal compiler obviously has to start off with Niklaus Wirth, who was supervised by Harry Huskey at UC Berkeley; Wirth's doctoral work implemented a language named Euler on an IBM 704. After Berkeley Wirth moved to Stanford where he reimplemented Euler in ALGOL-60 on either a Burroughs B5000 or B5500 (drum or disc-based respectively, the system was upgraded at about the same time), then he moved on to PL/360 and ALGOL-W which he proposed as a successor to ALGOL-60. Broadly speaking, Wirth's early compilers used recursive ascent, later editions of his books introduced recursive descent as an alternative.

There is a wealth of information on compiler implementation in that era on Huub de Beer's website[3] which is the source of some of the links etc. that follow. In addition Thomas Haigh has some very useful material on the history of ALGOL standardisation, in particular throwing light on the crisis caused by the ALGOL-68 committee's rejection of Wirth's ALGOL-W proposal:[4]

According to Hoare’s famous account, given years later in his Turing Award lecture, the group took a decisive wrong turn in 1965 when it inexplicably rejected an elegant and pragmatic draft design compiled by Niklaus Wirth in favor of an “incomplete and incomprehensible” alternative design for a radically different language submitted by Adriaan van Wijngaarden.

Huub de Beer investigates the earliest published algorithms for compiling ALGOL-like languages, and in effect demonstrates that the methods of Dijkstra and Grau used recursive ascent. Edgar ("Ned") Irons at Princeton appeared to be closer to recursive descent, but all of these are notable in that they used multiple stacks etc., in effect extending Dijkstra's "shunting yard" algorithm to handle the entire language syntax.

At Burroughs, Richard Waychoff had the idea of using the run-time structure of ALGOL to hide much of the complication of a recursive descent compiler, he was apparently in informal contact with Irons but the idea of using the implementation language's capability of recursion (i.e. with nested local variables etc.) appears to be his alone.[5] It's worth noting at this point that both Donald Knuth and Edsger Dijkstra cooperated to varying degrees with the Burroughs language and operating system developers, Knuth's copy of the document just cited is annotated "mostly true".

Separately, Val Schorre at UCLA wrote Meta-II for an IBM 1401, which both allows the syntax of a language to be defined in a BNF-like fashion and associates code fragments with each recognised syntactic structure.[6] This allows a fairly comprehensive ALGOL-like language to be specified in a few hundred lines, and while the output- which is typically assembler-like source for a dedicated interpreter- lacks any attempt at optimisation it can be a useful tool when implementing e.g. an application-specific scripting language.

Associates of Schorre went on to design Tree Meta, which extended Meta-II by first generating a parse tree and then using the tree equivalent to a source statement to generate output code[7]. The advantage of this approach is that while the initial parse stage might be unable to distinguish between e.g. loading a constant and loading the result of an expression into a register, the trees that are generated by the two cases are recognisably different and can be optimised appropriately. Tree Meta was used by Douglas Englebart's team to write specialist compilers for "the mother of all demos", which is widely cited as a landmark in the design of interactive systems.

Useful BNF and EBNF tools

See also