Functional Thread 14 : Parser Combinators

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Reaction score
405
Location

Parser Combinators
The goal of this thread will be to do a ground up build of a purely functional recursive descent parser that can be used to parse simple DSLs akin to regex use cases (e.g. email verification) to far more complex text formats like JSON, XML, HTML, CSS; binary file formats; network protocols and even for the creation of language parsers, ASTs, lexers suitable for Logo, Basic, C#, etc.

The complexity of this thread will vary from intermediate to advanced covering simple functional composition to advanced implementations of functional algebras like Functor, Applicative Functor, Monad, etc.

The end goal will not to build a production ready parser; but rather to demonstrate how something functionally complex can be built from simple functional concepts that compose together to tackle ever more complex and full featured parser challenges.

For production use cases; there are many production capable parser combinator libraries that all follow the same conceptual design that this thread will introduce, for example:
  1. F# : FParsec
  2. Haskell : Attoparsec
  3. C# : LanguageExt.Parsec
  4. Java : PetitParser

Note:
You should be able to find a parser combinator library (following a similar design) for the majority of the mainstream languages.

As for performance; a functional design approach is certainly not sluggish and can be quite performative, for example:
Here's a comparison with a hand-written C code parser for Node.js and a functionally written parser combinator using the AttoParsec library.

761658

What will be needed for anyone that would like to follow along:
  1. Language : F# (latest version 4.7)
  2. .Net framework: 3.0
  3. IDE: Visual Studio for Windows or Visual Studio for Mac, or alternatively Visual Studio Code (+Ionide-fsharp) on Linux / Mac or Windows.
Although I've chosen to use F# for this build; the code and the functional concepts should quite easily translate to any of the mainstream languages.

Working parser example:
As for what parser example I'll build at the end of thread is still undecided; options currently are:
 
Last edited:
Parse a single character
The logic of a simple character parser function is:
  • Input parameters; character stream; and the character we want to parse.
  • If the character stream is empty, return a Failed No more input.
  • If the 1st character matches our character parameter, return a Success of a tuple of matching character and the remaining characters (less the matching character).
  • if the 1st character does not match our character parameter; return a Failed Expecting 'matching character'. Found 'character found'.


Parse function type signature
A parser function is a logical disjunction between a Success or Failed state; this can be represented as a discriminated union type called ParserResult.

To further simplify the use of the ParserResult type; we create a type alias called Parser<'a>, because only the generic 'a type is needed as a variable between a function that parses a character versus a function that parses an integer.

In both cases the character input stream and Failed error message is a string type; meaning we don't need the flexibility to override it in the Parser type alias.
761680


Successive parsing result
Using a character parser for the 'J' character and the following string input, results in a Success tuple result of the 'J' character and the remainder of the string.
761674



Failed parsing result
Using a character parser for the 'J' character and the following string input, results in a Failed result with an error message detailing the cause of the failure.
761676


Implementation of pchar

The following implementation encapsulates the logic of the character parser function described above:
C#:
type ParserResult<'a> =
  | Success of 'a
  | Failed of string

type Parser<'a> = Parser of (string -> ParserResult<'a * string>)

[<AutoOpen>]
module Parse =
  let pchar c =
    let f s =
      if String.IsNullOrEmpty(s) then Failed "No more input"
      else
        let (head, foot) = (s.[0], s.[1..])
        if head = c then Success (c, foot)
        else Failed (sprintf "Expecting '%c'. Found '%c'" c head)
    Parser f

The type signature of the pchar function is recorded as follows:
C#:
val pchar :
    c: char
   -> Parser<char>
i.e. its a function that takes a char, then a character stream (a string) and returns a ParserResult<char * string>.
Note: string -> ParserResult<char * string> is type aliased as Parser<char>.
 
Last edited:
How to use the pchar function
The pchar function is a simple parser combinator function that we'll use to parse a single character from a string (a stream of characters).

To run our pchar function we'll need a helper function that will take a Parser function and apply an input string (character stream) to it.
C#:
  let run parser input =
    let (Parser f) = parser
    f input


Usage example
Let's recreate the parsing examples from the previous post:
C#:
[<EntryPoint>]
let main argv =
  let result1 = run (pchar 'J') "[email protected]"
  printfn "%A" result1

  let result2 = run (pchar 'J') "[email protected]"
  printfn "%A" result2

  let result3 = run (pchar 'J') ""
  printfn "%A" result3


The output from this is...
Code:
Success ('J', "[email protected]")
Failed "Expecting 'J'. Found 'M'"
Failed "No more input"

At this point, you're probably doubtful that such a simple concept could be expanded into a set of combinators that will allow us to construct a powerful recursive descent parser that is able to parse input as simple as an email to something as complex as the C# language syntax.

As with everything functional programming; complex ideas are built on very simple ones; what we're missing is an expanded set of parser combinators; which we'll start constructing in the next post.
 
Last edited:
Combining two parsers
Let's explore what's required to combine two parsers in sequence; for example; let's say we want to match not only against the 'J' prefix in the "[email protected]", but against a character sequence like 'J', 'A', 'C', 'K'.

Essentially we'd need a way to combine two or more character parsers in a sequence.


The combine logic is as follows:
  • Run p1 parser
  • If p1 Failed; return.
  • If its a Success; then run the second parser with the remaining input (foot).
  • If p2 Failed; return.
  • If both p1 and p2 are a Success, then return a tuple containing the result of both head values (head1 & head2).
Let's create a function called andThen which takes as input two parsers p1 and p2; and then using pattern matching we'll evaluate whether p1 is a Success, and use the foot1 result (second part of tuple) as input to p2, again checking if the result is a Success.

Similar to a logical conjunction; the final result is only a Success only if both p1 and p2 are a Success.

C#:
  let andThen p1 p2 =
    let f input =
      match run p1 input with
      | Failed err -> Failed err
      | Success (head1, foot1) ->
        match run p2 foot1 with
        | Failed err -> Failed err
        | Success (head2, foot2) -> Success ((head1, head2), foot2)
    Parser f

The type signature of the andThen function is recorded as follows:
C#:
val andThen :
    p1: Parser<'a> -> p2: Parser<'b> -> Parser<'a * 'b>
The output Parser is a tuple of generic types 'a and 'b.

Simplifying usage of the andThen combinator
To simplify the use of the andThen combinator function; we'll add a compositional infix operator as follows:
C#:
  let (.>>.) = andThen


Usage example
Let's recreate the parsing examples from the previous post:
C#:
[<EntryPoint>]
let main argv =
  // define a new parser that is the combination of 4 character parsers
  let jack_pchar = pchar 'J' .>>. pchar 'A' .>>. pchar 'C' .>>. pchar 'K'

  let result1 = run jack_pchar  "[email protected]"
  printfn "%A" result1

  let result2 = run jack_pchar "[email protected]"
  printfn "%A" result2


The output from this is...
Code:
Success (((('J', 'A'), 'C'), 'K'), "@SPRAT.COM")
Failed "Expecting 'C'. Found 'R'"
The output for the Success case is a little strange; it's a tuple embedded in a tuple embedded in another tuple; which is not ideal; however let's ignore that for now with the promise that we'll simplify that in due course.

The second example; Failed as expected because the 3rd character of the input string "[email protected]" is the character 'R' instead of the expected character 'C'
 
Last edited:
Logical disjunction between two parsers -- or else
Another way to combine two or more parsers is a orElse combinator, for situations when we want to state that a parser must match either character 'J' or 'M'.

The combine logic is as follows:
  • Run p1 parser
  • If p1 is a Success; return the parsed value for p1.
  • If it Failed; then run p2 parser with the original input (i.e. backtracking).
  • If p2 is a Success; return the parsed value for p2.
  • ...and if both Failed, return the error message for the second parser.
C#:
  let orElse p1 p2 =
    let f input =
      let r1 = run p1 input
      match r1 with
      | Success _ -> r1
      | Failed _ -> run p2 input
    Parser f
Unlike other parser implementation; combinator parsers have automatic builtin backtracking because of their functional design; i.e. we don't have to backtrack to run the p2 parser; we simply provide the original input to p2.


Simplifying usage of the orElse combinator
To simplify the use of the andThen combinator function; we'll add a compositional infix operator as follows:
C#:
  let (<|>) = orElse


Usage example
Let's recreate the parsing examples from the previous post:
C#:
[<EntryPoint>]
let main argv =
  // define a new parser that is the logical disjunction of either 'J' or 'M'
  let jma_pchar = pchar 'J' <|> pchar 'M' .>>. pchar 'A'

  let result1 = run jma_pchar  "[email protected]"
  printfn "%A" result1

  let result2 = run jma_pchar "[email protected]"
  printfn "%A" result2


The output from this is...
Code:
Success (('J', 'A'), "[email protected]")
Success (('M', 'A'), "[email protected]")
The combined orElse parser accepts either 'J' or 'M' and hence both parsing examples are a Success. To parse the remaining characters of the both emails is however not possible because our two parsers are expecting a string as the input; whereas if we continue with this concept and try to create a single combined parser for both "JACK" and "MARY", for example:
C#:
let jmacrky_pchar = pchar 'J' <|> pchar 'M' .>>. pchar 'A' .>>. pchar 'C' <|> pchar 'R' .>>. pchar 'K' <|> pchar 'Y'
We ending up with an error; because of the tuple wrapped within a tuple problem apparent in the previous andThen post example.

The problem is that the Parser type signature does not support a multi level tuple; ...so to fix this; we need a way to transform the result from:
Code:
Success (('J', 'A'), "[email protected]")
..to...
Code:
Success ("JA", "[email protected]")
 
Last edited:
Recap
Tackling the problem from the previous post will require us to flesh out a bit more helper functions to get there; which we'll tackle in this post. Let's first try to understand the challenge; which is about the compatibility of input and output function type signatures.

Our base character parser function pchar has the following signature:
C#:
char -> Parser<char>



The following combined parser:
C#:
 let jma_pchar = pchar 'J' <|> pchar 'M' .>>. pchar 'A'
has the following type signature
C#:
val jma_pchar = Parser<char * char>
i.e. the wrapped result value is compounded tuple char * char; which is incompatible with the input signature of above the pchar function; it expects only a single char as its input parameter.

Composition requires compatible input and output type signatures; which is not possible with the pchar character parser. What we need is instead to create a string -> Parser<string> parser and at the same time transform the previous tuple output signature to a single string value.

That will be goal of this post; to flesh out the functions required to build a pstring parser.


Steps to construct a pstring parser
A string parser will be built on a character parser; but to get there requires us to flesh out more flexibility with the character parser.


Any character of parsers
Building a powerful combinator parser with our current functionality is not practical; because to specify that a prefix character can be any letter or any digit would be too verbose with what we currently have, for example... an any digit parser would require the following:
C#:
let anyDigit = pchar '0' <|> pchar '1' <|> pchar '2' <|> pchar '3' <|> pchar '4' <|> pchar '5' <|> pchar '6' <|> pchar '7' <|> pchar '8' <|> pchar '9'
Now imagine what an any letter definition would look like.

To fix this we need a more flexible character combinator; well two of them; choice and anyOf:

Choice
This is where the power of combinators starts; using for example the orElse combinator, we can build even more powerful combinator that is a choice between a list of character parsers, rather than just two:
C#:
let choice ps =
  List.reduce (<|>) ps

The type signature of the choice function is:
C#:
val choice: Parser<'a> list -> Parser<'a>
It takes a list of Parsers and reduces (combines it) to a single Parser.


anyOf
This allows us to define an anyOf combinator; whose input is a list of characters; which using choice will be reduced in a single multi-character parser:
C#:
let anyOf cs =
   cs
   |> List.map pchar // convert character to a pchar
   |> choice // reduce the list of pchar to a single pchar


Usage Example
Let's recreate the previous anyDigit parser, and also create an anyLetter parser:
C#:
let pAnyDigit = ['0'..'9']
let pAnyLowerLetter = ['a'..'z']
let pAnyUpperLetter = ['A'..'Z']
let pAnyLetter = anyLowerLetter <|> anyUpperLetter

To parse the previous two prefix characters of our two email addresses, now requires only this:
C#:
[<EntryPoint>]
let main argv =
  let parseTwoPrefixes = pAnyLetter .>>.  pAnyLetter
  let result1 = run parseTwoPrefixes "[email protected]"
  printfn "%A" result1

  let result2 = run parseTwoPrefixes "[email protected]"
  printfn "%A" result2


The output from this is...
Code:
Success (('J', 'A'), "[email protected]")
Success (('M', 'A'), "[email protected]")



Tackling the transformation of the result
To transform the result we need a function that will allows us to transform the wrapped content of a Parser, because the output from running the above parsers is ParserResult<(char * char) * string> -- a tuple of two matching characters and a tuple with the remaining unparsed string of characters.

What we really need is ParserResult<string * string> in stead of ParserResult<(char * char) * string>; meaning the tuple of (char * char) must be transformed into a string. The function to do transformations in the functional world is map; let's build that for our Parser type.


Map (functor compliance)
C#:
let mapP f p =
  let innerF input =
    let result = run p input
    match result with
    | Success (value, remaining) -> Success (f value, remaining) // transform value
    | Failed err -> Failed err // propagate error condition
  Parser innerF // wrap inner function in a Parser
mapP is a curried function that takes in a value transforming function f and a parser p and then has an internal function innerF that requires some input; a string; we then run our parser p on our curried input, and pattern match; if it's a Success we apply our transform function f to the value, otherwise if it Failed, we propagate the error err.

The type signature of our mapP function is:
C#:
val mapP: f:('a -> 'b) -> Parser<'a> -> Parser<'b>
The generic variable 'a refers to value part in the type alias of:
C#:
type Parser<'a> = Parser of (string -> ParserResult<'a * string>)
i.e. our mapP function allows us to transform the generic 'a parser result portion; the string which represents the remaining unparsed characters is untouched.


Map operators
We'll create two operators to simplify the use of the map function:
C#:
let (<!>) = mapP
let (<&>) x f = mapP f x // inverted parameter version of <!> operator


Usage Example
Let's revisit our previous parseTwoPrefixes parser to transform the result from a tuple of two characters to a string:
C#:
let parseTwoPrefixes =
  let twoLetterParser = pAnyLetter .>>. pAnyLetter
  let transform (c1, c2) = String [|c1; c2|]
  mapP transform twoLetterParser

// or more succinctly using our map operator
let parseTwoPrefixes = (pAnyLetter .>>. pAnyLetter) <&> fun (c1, c2) -> String [|c1; c2|]

Let's now parse our two emails with this.
C#:
[<EntryPoint>]
let main argv =
  let parseTwoPrefixes =  (pAnyLetter .>>. pAnyLetter) <&> fun (c1, c2) -> String [|c1; c2|]
  let result1 = run parseTwoPrefixes "[email protected]"
  printfn "%A" result1

  let result2 = run parseTwoPrefixes "[email protected]"
  printfn "%A" result2


The output from this is...
Code:
Success ("JA", "[email protected]")
Success ("MA", "[email protected]")
Looks good so far; but we're still missing quite a few functions to be able to create a parser that matches a char list (a string). In the next post we'll continue adding more combinator functions that are required to achieve our goal of a pstring parser.
 
Last edited:
Lifting functions into the world of Parsers
The applicative functor; is an algebra like map (functor); except that it lifts functions into the world of parsers; which will enable us to perform more powerful transformations within the Parser world.

The conform with the applicative functor algebra we need to create two functions:
  • a returnP function which allows us to lift a value into a Parser; keep in mind that in the functional world; a function is also treated as a value (i.e. a first class function)
  • an applyP function that like map allows us to apply a transform to a function; except in the case of applyP; the function is a value in the Parser world.

returnP implementation
C#:
let returnP x =
  let f input = Success (x, input)
  Parser f

The type signature of returnP is:
C#:
val returnP: 'a -> Parser<'a>
Confirming that it takes any generic value 'a and lifts it into the Parser world.


applyP implementation
C#:
let applyP fP xP = (fP .>>. xP) |> mapP (fun (f, x) -> f x)
We've used the .>>. operator; which is the andThen operator to achieve this by converting fP and xP into a tuple; mapP then unwraps f and x and applies f transform function to x;
Note:
later we'll look into redefining both mapP and applyP in terms of the monad algebra as part of simplifying our codebase.

The type signature of applyP is:
C#:
val applyP: fP: Parser('a -> 'b) -> xP: Parser<'a> -> Parser<'b>
Allowing to apply a transform function in the Parser world to a value in the Parser world.


Apply operator
We'll create an operator to simplify the use of the applyP function:
C#:
let (<*>) = applyP

So why do we need the applicative functor? ... because whilst a functor (map) allows us to apply a transform function to a value in the Parser world; it can only do this for a single parameter function. The applicative functor allows us to apply a transform function with 1 or more parameters.


liftA2 function
To simplify the lifting of a 2 parameter function into the Parser world to apply it to 2 values in the Parser world; we'll need a function that makes this easier.
C#:
let liftA2 f aP bP = returnP f <*> aP <*> bP
returnP lifts our 2 parameter function into the Parser world, where the applicative functor allows us to apply a value to a function in a curried fashion; meaning we have to use the applicative operator <*> twice for a 2 parameter function; once for Parser with the 'a value and again for the Parser with the 'b value.

The type signature of liftA2 is:
C#:
val liftA2: f: ('a -> 'b -> 'c) -> aP: Parser<'a> -> bP: Parser<'b> -> Parser<'c>


Functions with more than 2 parameters
As required we can define more liftA functions; for functions with more than 2 parameters, for example:
C#:
let liftA f aP = returnP f <*> aP // equivalent to mapP
let liftA2 f aP bP = liftA f aP <*> bP
let liftA3 f aP bP cP = liftA2 f aP bP <*> cP
let liftA4 f aP bP cP dP = liftA3 f aP bP cP <*> dP
...
let liftA9 f aP bP cP dP eP fP gP hP iP = liftA8 f aP bP cP dP eP fP gP hP <*> iP


So what can we do with the Applicative Functor?
Let's look at whether we can convert one of the existing multi parameter string methods to use in the Parser world.

First let's create a startsWith function to wrap the method call:
C#:
let startsWith (s: string) prefix = s.StartsWith(prefix)

Then let's lift the startsWith function it into the Parser world:
C#:
let startsWithP = liftA2 startsWith

Let's now examine the type signatures of these two functions:
C#:
val startsWith: s: string -> prefix: string -> bool
val startsWithP: Parser<string> -> Parser<string> -> Parser<bool>
Hopefully you're starting to see that the Applicative Functor can enable a lot of powerful transformations that weren't previously possible.

We need to continue down this track; taking a step closer to achieving our goal re pstring.
 
Last edited:
The choice and anyOf combinators allowed us to merge multiple character parsers into a single Parser that allowed us to create a pAnyLetter or pAnyDigit parser. What we need now is the ability to tie a sequence of character parsers together into a single parser, like a string parser.

Sequence
To reduce a sequence of character parsers into a single parser we'll use the "cons" :: operator; which constructs a list by appending an element (parser) onto the front of an existing list.
C#:
let rec sequence ps =
  let cons head tail = head :: tail // create a function to wrap the cons operator
  let consP = liftA2 cons // lift the cons function into the world of Parsers
  match ps with
  | [] -> returnP []
  | head::tail -> consP head (sequence tail) // recursively process all the Parsers

The signature of sequence is:
C#:
val sequence: ps: Parser<'a> list -> Parser<'a list>
Confirming that sequence takes a list of generic 'a type Parsers and concatenates them to a single Parser of list of generic 'a types


Usage Example
C#:
[<EntryPoint>]
let main argv =
  let parseTwoPrefixes =  sequence [pAnyLetter;  pAnyLetter]
  let result1 = run parseTwoPrefixes "[email protected]"
  printfn "%A" result1

  let result2 = run parseTwoPrefixes "[email protected]"
  printfn "%A" result2

The output from this is...
C#:
Success (['J'; 'A'], "[email protected]")
Success (['M'; 'A'], "[email protected]")
It returns a list of characters as opposed to a tuple of tuple, which makes it far easier to turn it into a string; because a string is essentially an list of characters.


The pstring parser
Finally we have all the functional tools to implement a parser that matches a string, which we'll call pstring:
C#:
  let pstring s =
    s
    |> List.ofSeq // convert to a list of char
    |> List.map pchar // convert to char parsers
    |> sequence // convert to Parser<char list>
    |> mapP (fun cs -> String(List.toArray cs)) // convert to Parser<string>


Usage Example
Let's define a string parser for "JACK"
C#:
[<EntryPoint>]
let main argv =
  let parseJACK = pstring "JACK"
  let result1 = run parseJACK "[email protected]"
  printfn "%A" result1

  let result2 = run parseJACK "[email protected]"
  printfn "%A" result2

The output from this is...
C#:
Success ("JACK", "@SPRAT.COM")
Failed "Expecting 'J'. Found 'M'"
Which works as expected for JACK...; and also fails for MARY... as expected.


Next Post
Whilst its nice to have a pstring parser; it's not exactly what we need to validate both the JACK and MARY emails. What we really need is a parser that can match against zero, one or more parsers, for example:
  • zero, one or more pAnyDigit to parse a number
  • zero, one or more pAnyLetter to parse a name like JACK or MARY
In the next post we'll build combinators for many and many1; which are combinators that either match zero or more parsers, or 1 or more parsers.
 
Last edited:
Monad?
So far we have built everything without the functional powerhouse algebra; the monad (flatmap). So let's go ahead and conform our Parser type to the monad algebra -- considering that we'll need this for some more powerful combinators + it can help us streamline some of our existing code.

Monad (bind / flatmap)
C#:
let flatmapP f p =
  let f input =
    match run p input with
    | Failed err -> Failed err
    | Success (head, tail) -> run (f head) tail
  Parser f

The type signature of flatmapP is:
C#:
val flatmapP: f: 'a -> Parser<'b> -> p: Parser<'a> -> Parser<'b>


Let's define Bind
C#:
let bindP = flatmapP
bindP is just an alternative function name for flatmapP

FlatMap operator
We'll create one operator to simplify the use of the flatmapP function:
C#:
let (>>=) p f = flatmapP f p


Refactor combinators in terms of our Monad
To streamline our codebase and to simplify maintenance; let's refactor our existing combinators in terms of flatmapP (the monad):

C#:
let mapP f = flatmapP (f >> returnP)
let (<!>) = mapP
let (<&>) x f = mapP f x

let applyP fP xP = fP >>= (fun f -> xP >>= (fun x -> returnP (f x)))
let (<*>) = applyP

let andThen p1 p2 = p1 >>= (fun r1 -> p2 >>= (fun r2 -> returnP (r1, r2)))
let (.>>.) = andThen


Matching a parser multiple times
From the last post; a common requirement is the ability to recursively process a parser whilst it's successful, to:
  • zero, one or more pAnyDigit to parse a number
  • zero, one or more pAnyLetter to parse a name like JACK or MARY
Some variations exist:
  • Some matches are optional e.g. whitespace; hence we need a zero or more matcher.
  • Some matches require 1 or more; e.g. digit(s) of a number.

We need a helper function to
Run the parser; return an empty list when the parser fails (the recursive exit point); and when the parser succeeds we keep recursively calling the helper function zeroOrMore with the remaining values, concatenating the successfully parsed values.

C#:
let rec zeroOrMore p input =
  let result = run p input
  match result with
  | Failed err -> ([], input) // if parse fails, return empty list
  | Success (value, remainder) ->
    // if Success, recurse for all subsequent values
    let (nextValue, nextRemainder) = zeroOrMore p remainder
    (value::nextValue, nextRemainder)


many combinator
Define a parser that can match zero or more times
C#:
let many parser =
  let rec innerFn input = Success (zeroOrMore parser input)
  Parser innerFn


many1 combinator
Define a parser that must match at least once or more times
C#:
let many1 p = p >>= (fun head -> many p >>= (fun tail -> returnP (head::tail)))


Usage Example
Let's define an pAnyString parser for our emails:
C#:
[<EntryPoint>]
let main argv =
  let pAnyUpperLetter = anyOf ['A'..'Z']
  let pAnyLowerLetter = anyOf ['a'..'z']
  let pAnyLetter = pAnyLowerLetter <|> pAnyUpperLetter
  let pAnyString = many pAnyLetter <&> (fun cs -> String(List.toArray cs))

  let result1 = run pAnyString "[email protected]"
  printfn "%A" result1

  let result2 = run pAnyString "[email protected]"
  printfn "%A" result2


The output from this is...
Code:
Success ("JACK", "@SPRAT.COM")
Success ("MARY", "@SPRAT.COM")


In the next post
In the next post we'll explore a few more uses of the many and many1 combinators and use these together with mapP to parse an integer value. We'll also implement a few more combinators like:
  • Optional (opt) -- characters that can optionally appear in the input e.g. negative sign for an integer.
  • Throwing away parsed results (.>> and >>.) -- for characters we expect to find in the parsed text, but characters that we want to exclude from the parsed result. e.g. quotes around a string; a semicolon at the end of a C# line, whitespace characters, etc.
  • between; values in between characters like quotes
  • Parsing lists delimited by separators (sepBy and sepBy1).
 
Last edited:
Optional
For situations where a character could optionally appear in the input; we need a parser to match zero or once; for example: to match against the negative sign of an number.


opt Combinator
C#:
let opt p = p <&> Some <|> returnP None
If the parser is a Success; we wrap the result in a Some, otherwise we return a None.
Note:
Some and None are type constructors for F#'s Option type


The type signature of opt is:
C#:
val opt: p: Parser<'a> -> p: Parser<'a option>


Usage Example
To demonstrate the opt combinator; let's create a pint parser to parse Integers.
C#:
let pint =
  let resultToInt (sign, cs) =
    let value = String(List.toArray cs) |> int
    match sign with
    | None -> value
    | Some _ -> -value
  opt (pchar '-') .>>. many pAnyDigit <&> resultToInt

  let input5 = "-12345"
  let result5 = run pint input5
  printfn "%A" result5


The output from this is...
Code:
Success (-12345, "")


Throwing away parsed results
For situations where we want match against something in the input, but we don't care about retaining the parsed value; we need the ability to selectively discard some parsed values. This is useful for:
  • Quoted string; we required the quotes, but we don't need the quotes in the parsed result.
  • For language syntax statements that end in a semicolon; it's important that the semicolon exists, but we don't need the semicolon to build the language AST result.
To cope with these situations; we need to define a few combinators that make it easy to specify which parts of the parsed values we want to discard.


Discard Left / Right parsers
C#:
/// Keep only the result of the left side parser
let (.>>) p1 p2 =
  // create a pair
  p1 .>>. p2
  // then only keep the first value in the tuple
  |> mapP (fun (a, b) -> a)

/// Keep only the result of the right side parser
let (>>.) p1 p2 =
  // create a pair
  p1 .>>. p2
  // then only keep the second value in the tuple
  |> mapP (fun (a, b) -> b)


Parse only a value between delimiters
A common use case is to create a parser for a value between a set of delimiters.
C#:
/// Keep only the result of the middle parser
let between p1 p2 p3 = p1 >>. p2 .>> p3


Usage Example
C#:
let pars3 = pstring "\"" >>. pAnyString .>> pstring "\""
let input3 = "\"Banana\""
let result3 = run pars3 input3
printfn "%A" result3
    
let pars4 = between (pstring "\"") pAnyString (pstring "\"")
let input4 = "\"Banana\""
let result4 = run pars4 input4
printfn "%A" result4


The output from this is...
Code:
Success ("Banana", "")
Success ("Banana", "")
 
Parsing values separated by delimiters
For situations where we want to parse values separated by a delimiter; we need some combinators like many and many1 that will allow us to parse one or more values separated by a delimiter.


sepBy and sepBy1
C#:
/// Parses one or more occurrences of p separated by sep
let sepBy1 p sep =
  let sepThenP = sep >>. p      
  p .>>. many sepThenP
  <&> fun (p, pList) -> p::pList

/// Parses zero or more occurrences of p separated by sep
let sepBy p sep = sepBy1 p sep <|> returnP []


Usage Example
C#:
let input6 = "abc;def;ghi"
let pars6 = sepBy (many pAnyLetter <&> fun cs -> String(List.toArray cs)) (pchar ';')
let result6 = run pars6 input6
printfn "%A" result6


The output from this is...
Code:
Success (["abc"; "def"; "ghi"], "")



In the next post
At this stage we have the following set of combinators, most of what we need to parse document types like CSV, JSON, etc.
  • andThen / .>>. applies two parses in sequence.
  • orElse / <|> applies the first parser; if it fails it applies the second parser.
  • choice extends orElse to allow a selection of parses.
  • flatmapP / >>= chains the result of 1 parser to another
  • mapP <!> and <&> transform a parser's result.
  • returnP lifts a value into the world of parsers
  • applyP <*> similar to mapP; enables the use of multi parameter functions by using currying.
  • liftA2, liftA3, ... lifts a mult parameter function into the world of parsers and computes it using Parsers as the parameters.
  • sequence converts a list of Parsers into a Parser containing a list
  • many / many1 matches zero or more occurrences of a parser.
  • opt matches and optional occurrence of a parser.
  • .>> keeps only the result of the left side parser.
  • >>. keeps only the result of the right side parser.
  • between keeps only the result of the middle parser.
  • sepBy / sepBy1 parses zero or more occurrence of a parser with a separator.
However before we build of a complete parser example with an AST; we going to first focus on enhancing our combinators to improve:
  • The error messaging of parsers
  • Enhance the flexibility of parsers by tackling the tight coupling between other parsers and the pchar parser.
 
Last edited:
Error Messaging with Parser Combinators
The current parser combinators are tied to the character parser pchar and it's error messaging -- this results in less than ideal error messaging, for example:

Let's create a parser to validate if the first 5 characters of a string are letters (a to z or A to Z):
C#:
let pAnyUpperLetter = anyOf ['A'..'Z']
let pAnyLowerLetter = anyOf ['a'..'z']
let pAnyLetter = pAnyLowerLetter <|> pAnyUpperLetter
let pPrefix5Letters = sequence [ pAnyLetter; pAnyLetter; pAnyLetter; pAnyLetter; pAnyLetter ]
We've created a parser pPrefix5Letters that is a sequence of five pAnyLetter parsers.

Let's validate an email with this parser:
C#:
[<EntryPoint>]
let main argv =
  let result = run pPrefix5Letters "[email protected]."
  printfn "%A" result


The output from this is...
Code:
Failed "Expecting 'Z'. Found '@'"
This error is a bit confusing because of the Expecting 'Z' bit.
Whilst it's correct, because pAnyLetter is a logical disjunction between all the letters (lowercase / uppercase).

We check each letter in turn, moving onto the next letter parser if a check fails i.e. orElse, until we've exhausted all parsers that fit into the composition we've called pAnyLetter -- the last check in that orElse conjunction is the uppercase Z; hence the error message.

A more appropriate and informative error message would be:
Code:
Error parsing 5 letter prefix at line 0 column 4
[email protected].
    ^Unexpected '@'
To achieve this we need to be able to add a custom label to our parser compositions, and we need to build offset tracking in our combinators to track at which line and column a failure occurred.

We'll flesh this out in the next set of posts.
 
Last edited:
Error Messaging : Implementation
To enable custom labelling of our parsers; we need to modify our Parser type, which is currently defined as follows:
C#:
type Parser<'a> = Parser of (string -> ParserResult<'a * string>)

To enabled support for a custom label we need to change the Parser to a record type:
C#:
type ParserLabel = string
type Parser<'a> = {
  fn: (string -> ParserResult<'a * string>)
  label: ParserLabel
}

The record type has two properties:
  • fn for a parser function
  • label for a custom parser label
Another problem is that our change has only ensured that the label is part of the Parser but not the ParserResult; so at this point the label will not be available to any ParserResult. To fix this with need to add support for the label in the ParserResult; more specifically for Failed.


Here is the complete type change for label support:
C#:
type ParserLabel = string
type ParserError = string
type ParserResult<'a> =
  | Success of 'a
  | Failed of ParserLabel * ParserError

type Parser<'a> = {
  fn: (string -> ParserResult<'a * string>)
  label: ParserLabel
}


Printing a parser's result
We can create a print function to simplify the printing of a Parser result with the label.
C#:
[<RequireQualifiedAccess>]
module Parser =
  ...

  /// Print a parsed result
  let public print result =
    match result with
    | Success (value, input) -> printfn "%A" value
    | Failed (label, error) -> printfn "Error parsing %s\n%s" label error


Changing a parser's label
When defining a new Parser; we need a way to change a parser's label to accurately assign the role of the Parser.

C#:
[<RequireQualifiedAccess>]
module Parser =
  ...

  /// Update the label in the parser
  let setLabel p label =
    let f input =
      match p.func input with
      | Success s -> Success s
      | Failed (oldLabel, err) -> Failed (label, err)
    { func = f; label = label }


Infix operator to update the label
C#:
/// Infix version of setLabel 
let (<?>) = setLabel


Usage example
C#:
let pDigit = anyOf [ '0'..'9' ]
let pDigitWithLabel = anyOf [ '0'..'9' ] <?> "Digit"  // assign a custom label for this parser

// Parse without label
run pDigit "A123"
|> Parse.print

// Parse with custom label
run pDigitWithLabel "A123"
|> Parse.print


The output from this is...
Code:
Error parsing 9
Unexpected 'A'
Error parsing Digit
Unexpected 'A'
The output from the pDigitWithLabel more appropriately describes the parsing error.
 
Error Messaging : Implementation continued.

Code Breakages
The custom label change breaks some of our existing code; because we've changed the Parser to a record type. Below is a complete listing of all the combinators including the changes needed to support the record type.

The combinators have also now been encapsulated in a module called Parser that has the RequireQualifiedAccess attribute. This attribute mandates the Parser. prefix when using of any combinators; this is standard design approach in F#, e.g. List.map

The Parser operators have also been encapsulated in a separate Operators module to enable these to be flagged with the AutoOpen attribute i.e. which automatically make the operators available when the Endofunk.Parser namespace is opened.

C#:
namespace Endofunk.Parser
open System

type public ParserLabel = string
type public ParserError = string

type public ParserResult<'a> =
  | Success of 'a
  | Failed of ParserLabel * ParserError

type public Parser<'a> = {
  func: (string -> ParserResult<'a * string>)
  label: ParserLabel
}

[<RequireQualifiedAccess>]
module Parser =
  /// Print a parsed result
  let public print result =
    match result with
    | Success (value, input) -> printfn "%A" value
    | Failed (label, error) -> printfn "Error parsing %s\n%s" label error

  /// Run a parser with some input
  let public run p input = p.func input

  /// Update the label in the parser
  let setLabel p label =
    let f input =
      match p.func input with
      | Success s -> Success s
      | Failed (oldLabel, err) -> Failed (label, err)
    { func = f; label = label }

  /// Infix version of setLabel
  let private (<?>) = setLabel

  /// Monad conformance
  let public flatmap f p =
    let label = "Undefined"
    let f input =
      match run p input with
      | Failed (label, err) -> Failed (label, err)
      | Success (head, tail) -> run (f head) tail
    { func = f; label = label }
  /// Alternative function for flatmapP
  let public bind = flatmap
  /// Infix of flatmapP
  let private (>>=) p f = flatmap f p
  /// Lift a value into a Parser
  let public rtn x =
    let f input = Success (x, input)
    { func = f; label = "" }

  /// Functor conformance
  let public map f = flatmap (f >> rtn)
  /// Infix of mapP
  let private (<!>) = map
  /// Infix Reverse of mapP (piping)
  let private (<&>) x f = map f x

  /// Applicative Functor conformance
  let public apply fP xP = fP >>= fun f -> xP >>= fun x -> rtn (f x)
  /// Infix of applyP
  let private (<*>) = apply
  /// Lift a single parameter function into the Parser world and apply
  let public liftA fn aP = rtn fn <*> aP
  /// Lift a two parameter function into the Parser world and apply
  let public liftA2 fn aP bP = liftA fn aP <*> bP
  /// Lift a three parameter function into the Parser world and apply
  let public liftA3 fn aP bP cP = liftA2 fn aP bP <*> cP
  /// Lift a four parameter function into the Parser world and apply
  let public liftA4 fn aP bP cP dP = liftA3 fn aP bP cP <*> dP
  /// Lift a five parameter function into the Parser world and apply
  let public liftA5 fn aP bP cP dP eP = liftA4 fn aP bP cP dP <*> eP
  /// Lift a six parameter function into the Parser world and apply
  let public liftA6 fn aP bP cP dP eP fP = liftA5 fn aP bP cP dP eP <*> fP

  /// Combine two parsers (logical conjunction)
  let public andThen p1 p2 =
    let label = sprintf "%s andThen %s" p1.label p2.label
    p1 >>= fun r1 -> p2 >>= fun r2 -> rtn (r1, r2) <?> label
  /// Infix of andThen
  let private (.>>.) = andThen
  /// Keep only the result of the left side parser
  let andThenFst p1 p2 = p1 .>>. p2 |> map (fun (a, b) -> a)
  let private (.>>) = andThenFst
  /// Keep only the result of the right side parser
  let andThenSnd p1 p2 = p1 .>>. p2 |> map (fun (a, b) -> b)
  let private (>>.) = andThenSnd
  /// Keep only the result of the middle parser
  let public between p1 p2 p3 = p1 >>. p2 .>> p3

  /// Combine two parsers (logical disjunction)
  let public orElse p1 p2 =
    let label = sprintf "%s orElse %s" p1.label p2.label
    let f input =
      let r1 = run p1 input
      match r1 with
      | Success _ -> r1
      | Failed _ -> run p2 input
    { func = f; label = label }
  /// Infix of orElse
  let private (<|>) = orElse
  /// Choose from a list of Parsers
  let public choice pss = List.reduce (<|>) pss

  /// Higher Order Parser
  let public satisfy pred label =
    let f input =
      if String.IsNullOrEmpty(input) then Failed (label, "End of input")
      else
        let head = input.[0]
        if pred head then Success (head, input.[1..])
        else Failed (label, sprintf "Unexpected %c" head)
    { func = f; label = label }

  /// Parse a char
  let public pchar c =
    let label = sprintf "%c" c
    let f s =
      if String.IsNullOrEmpty(s) then Failed (label, "No more input")
      else
        let (head, tail) = (s.[0], s.[1..])
        if head = c then Success (head, tail)
        else Failed (label, (sprintf "Unexpected '%c'" head))
    { func = f; label = label }

  /// Parse a char
  let public pchar2 c =
    let pred ch = (ch = c)
    let label = sprintf "%c" c
    satisfy pred label

  /// Choose from a list of characters
  let public anyOf css =
    let label = sprintf "anyOf %A" css
    css |> List.map pchar |> choice <?> label

  let rec public sequence ps =
    let cons head tail = head::tail
    let consP = liftA2 cons
    match ps with
    | [] -> rtn []
    | head::tail -> consP head (sequence tail)

  let public pstring s =
    s
    |> List.ofSeq // convert to a list of char
    |> List.map pchar // convert to char parsers
    |> sequence // convert to Parser<char list>
    |> map (fun cs -> String(List.toArray cs)) // convert to Parser<string>

  /// Match zero or more occurences of the specified parser
  let rec private zeroOrMore p input =
    let result = run p input
    match result with
    | Failed (_, _) -> ([], input)
    | Success (value, remainder) ->
      // if Success, recurse for all subsequent values
      let (nextValue, nextRemainder) = zeroOrMore p remainder
      (value::nextValue, nextRemainder)

  /// Matches zero or more occurences of the specified parser
  let public many p =
    let label = sprintf "many %s" p.label
    let rec f input = Success (zeroOrMore p input)
    { func = f; label = label}
  /// Matches one or more occurences of the specified parser
  let public many1 p =
    let label = sprintf "many1 %s" p.label
    p >>= fun head -> many p >>= fun tail -> rtn (head::tail) <?> label

  /// Parses an optional occurrence of p and returns an option value.
  let public opt p =
    let label = sprintf "opt %s" p.label
    p <&> Some <|> rtn None <?> label

  /// Parses one or more occurrences of p separated by sep
  let public sepBy1 p sep =
    let sepThenP = sep >>. p
    p .>>. many sepThenP <&> fun (p, pList) -> p::pList
  /// Parses zero or more occurrences of p separated by sep
  let public sepBy p sep = sepBy1 p sep <|> rtn []

[<AutoOpen>]
module Operators =
  /// Infix version of setLabel
  let (<?>) = Parser.setLabel
  /// Infix of flatmapP
  let (>>=) p f = Parser.flatmap f p
  /// Reverse Infix of flatmapP
  let (=<<) = Parser.flatmap
  /// Infix of mapP
  let (<!>) = Parser.map
  /// Reverse Infix of mapP (piping)
  let (<&>) x f = Parser.map f x
  /// Infix of andThen
  let (.>>.) = Parser.andThen
  /// Keep only the result of the left side parser
  let (.>>) = Parser.andThenFst
  /// Keep only the result of the right side parser
  let (>>.) = Parser.andThenSnd
  /// Infix of orElse
  let (<|>) = Parser.orElse


Next Post
In the next post we'll look at how we undo the tight coupling in the base Parser combinator function; namely pchar.
 
Last edited:
Eliminating the tight coupling in pchar
All the combinators except from pchar are loosely coupled with their input; meaning they could easily be adapted to support the parsing of a binary format or a stream of tokens, etc.

To recap
The pchar combinator currently looks like this:
C#:
[<RequireQualifiedAccess>]
module Parser =
  ...

  /// Parse a char
  let public pchar c =
    let label = sprintf "%c" c
    let f s =
      if String.IsNullOrEmpty(s) then Failed (label, "No more input")
      else
        let (head, tail) = (s.[0], s.[1..])
        if head = c then Success (head, tail)
        else Failed (label, (sprintf "Unexpected '%c'" head))
    { func = f; label = label }


Decouple with a higher order combinator
The above pchar base combinator has character equality check baked in; tightly coupled to the input object. Considering the name pchar; a parser of characters; one would expect this.

However having pchar as our base combinator limits the input models that we can parse -- to address this limitation we're going to create a new loosely coupled base combinator.


satisfy combinator
C#:
module Parser =
  ...

  /// Higher Order Parser
  let public satisfy pred label =
    let f input =
      if String.IsNullOrEmpty(input) then Failed (label, "End of input")
      else
        let head = input.[0]
        if pred head then Success (head, input.[1..])
        else Failed (label, sprintf "Unexpected %c" head)
    { func = f; label = label }
The satisfy combinator unlike pchar takes in a pred (predicate function) which it internal uses to determine a parsed match. The second parameter is the custom label for the parser.


Rewriting pchar with satisfy
Using the satisfy combinator, we can simplify pchar:
C#:
module Parser =
  ...

  /// Parse a char
  let public pchar c =
    let pred ch = (ch = c)  // predicate
    let label = sprintf "%c" c  // custom label
    satisfy pred label

The satisfy higher order combinator also lets us write more computationally efficient versions of other parsers, for example:


Digit parser
Up till now we have defined a digit parser as logical disjunction of pchar parsers:
C#:
let pDigit = Parser.anyOf [ '0'..'9' ] <?> "Digit"

With satisfy we can make it more computationally efficient by rewriting it with a predicate function.
C#:
  let pDigit = Parser.satisfy Char.IsDigit "Digit"
Char.IsDigit is a predicate function that indicates whether a character is categorised as a decimal digit.


Whitespace, Letter and LetterOrDigit parser
The Char.IsWhiteSpace, Char.IsLetter, Char.IsLetterOrDigit, Char.IsLower and Char.IsUpper are predicate functions like Char.IsDigit.
C#:
  let pWhitespace = Parser.satisfy Char.IsWhiteSpace "Whitespace"
  let pLetter = Parser.satisfy Char.IsLetter "Letter"
  let pLetterOrDigit = Parser.satisfy Char.IsLetterOrDigit "LetterOrDigit"
  let pLower = Parser.satisfy Char.IsLower "Lowercase"
  let pUpper = Parser.satisfy Char.IsUpper "Uppercase"


Next Post
In the next post we'll look at how we can enhance the Parser to track the parsing offset (line / column) in order to make the reporting of errors more valuable.
 
Last edited:
Parser offset tracking
To track the parsing offset we'll have to replace the string input for record type that:
  • Internally tracks the offset
  • Incrementally serves up characters

Define an offset record type
First thing we need is a type to keep track of an input's X (column) and Y (row) offset.
C#:
type Offset = { X : int; Y : int }

Offset.initial, Offset.incX and Offset.incY
Second we need helper functions to set the initial state, and to increment the X and Y offsets.
C#:
module Offset =
  let initial = { X = 0; Y = 0 }
  let incX offset = { offset with X = offset.X + 1 }
  let incY offset = { X = 0; Y = offset.Y + 1 }


Define an ParserInput record type
We need an input type that merges both the input string and the parsing offset.
C#:
type ParserInput = { Offset : Offset; Lines : string [] }

Offset.create an input / request ParserInput.nextChar
We need helper functions to create an Input from a string and to request the next char from the Input.
C#:
module ParserInput =
  let create s =
    if String.IsNullOrEmpty(s) then { Lines = [||]; Offset = Offset.initial }
    else { Lines = s.Split([| "\r\n"; "\n" |], StringSplitOptions.None); Offset = Offset.initial }
  let current input = if input.Offset.Y < input.Lines.Length then input.Lines.[input.Offset.Y] else "EOF"
  let nextChar input =
    if input.Offset.Y >= input.Lines.Length then input, None
    else
      let line = current input
      if input.Offset.X < line.Length then { input with Offset = input.Offset |> Offset.incX }, Some line.[input.Offset.X]
      else { input with Offset = input.Offset |> Offset.incY }, Some '\n'

ParserInput.create function
The create function:
  • Takes a string and splits lines into a string[] delimited by CRLF or just LF; an empty string results in an empty string[] stored in the Lines property.
  • Assigns the initial offset to the Offset property.
ParserInput.current function
The current function retrieves the current line for processing; and returns "EOF" is the Y index is out of bounds i.e. end of file.

ParserInput.nextChar function
The nextChar function:
  • Checks if Offset.Y is still within bounds; if not it returns a tuple where the second value is None.
  • Checks if the Offset.X is less than the length of the line currently being processed; if so it increments the X offset; otherwise it increments the Y offset and assigns 0 to the X offset.
Note:
The return type of the nextChar function is a tuple; where the:
  • First value is the Input i.e. the type includes string[] of Lines and the parsing Offset in respect of Lines.
  • Second value is the char that is currently being served up for for parsing. When Input's Offset reaches the end of a line; the LF character that was removed during the split is injected and returned.


Testing our ParserInput type
C#:
open System
open Endofunk.Parser

[<EntryPoint>]
let main argv =
  let rec parseAll input =
    [
      match input |> ParserInput.nextChar with
      | _, None -> ()
      | tail, Some ch ->
        yield ch
        yield! parseAll tail
    ]

  "abcdefghi"
  |> ParserInput.create
  |> parseAll
  |> printfn "%A"

  "abc\ndef\r\nghi"
  |> ParserInput.create
  |> parseAll
  |> printfn "%A"

  ""
  |> ParserInput.create
  |> parseAll
  |> printfn "%A"
To test iterating over every char in our Input type; we've created a helper function called parseAll; which is a recursive function that yield's every char in Input in sequence.

The second code block pipes the string "abcdefghi" to the Input.create function; it then pipes the resulting Input type to the parseAll function and then the result of that iteration is piped to a printfn for output to the terminal / console. The same applies to the third and forth blocks; which test the parsing of a multiline and empty string.

The output from this is...
Code:
['a'; 'b'; 'c'; 'd'; 'e'; 'f'; 'g'; 'h'; 'i'; '\010']
['a'; 'b'; 'c'; '\010'; 'd'; 'e'; 'f'; '\010'; 'g'; 'h'; 'i'; '\010']
[]
Note:
The '\010' is the unicode hexadecimal char value for LF.


Next Post
We'll focus on adapting the Parser type to use both the new ParserInput type and to continue supporting the current string type input.

After the next post we'll have done enough to, for example, start writing a parser for either the Logo language and / or JSON parsing.
 
Last edited:
Progress Update
I'll be wrapping up the Parser combinator library in the next day or two; whilst it won't have everything to compete with a commercially ready library like FParsec; it is for the most part adequate for building parsers for a lot of the popular encoding formats, like: CSV, JSON, XML, HTML, etc.

The design approach of Parser mirrors a lot of the design concepts in FParsec including the naming of functions; which will make the transition to using FParsec or any other parser combinator that follows Haskell's Parsec library (on which FParsec was based), for example:

Example Parser
To close out this thread I'll be building an example parser(s); current options are either to build a JSON parser or a Logo language parser (including GUI in either Monogame or Xamarin.Forms) or both.

If you have any preference for the above (or something else in mind), let me know; naturally the work to build a JSON parser is more complex than building a simple parser for a limited Logo turtle graphics command set; especially if with JSON we try to more strictly follow the ECMA design specification, similarly summarised on json.org

The JSON Parser build would therefore focus primarily on deserialization to an AST and a simple print render of that; whereas the Logo parser would focus on parsing the Logo Turtle graphics syntax and rendering that to a GUI; syntax would be limited to the more frequently used commands, for example:
  • forward (fd), left (lt), right (rt), and repeat;
  • Which could easily be enhanced to cover back (bk), penup (pu), pendown (pd), setpensize, set-random-position
  • ...and even to support named procedures.

A future follow on to this thread could look at how to build an end to end custom language parser, interpreter and compiler, e.g. parsing a language file to an AST, and then converting that AST into either Java byte code or Microsoft's CIL code, and finally executing it.
 
Last edited:
Integrating ParserInput
The following is a complete listing of the Parser library including adaption of the Parser combinators to use our the new ParserInput type:
C#:
namespace Endofunk.Parser
open System

type Offset = { X : int; Y : int }

module Offset =
  let initial = { X = 0; Y = 0 }
  let incX offset = { offset with X = offset.X + 1 }
  let incY offset = { X = 0; Y = offset.Y + 1 }

type ParserInput = { Offset : Offset; Lines : string[] }

module ParserInput =
  let create s =
    if String.IsNullOrEmpty(s) then { Lines = [||]; Offset = Offset.initial }
    else { Lines = s.Split([| "\r\n"; "\n" |], StringSplitOptions.None); Offset = Offset.initial }
  let current input = if input.Offset.Y < input.Lines.Length then input.Lines.[input.Offset.Y] else "EOF"
  let nextChar input =
    if input.Offset.Y >= input.Lines.Length then input, None
    else
      let line = current input
      if input.Offset.X < line.Length then { input with Offset = input.Offset |> Offset.incX }, Some line.[input.Offset.X]
      else { input with Offset = input.Offset |> Offset.incY }, Some '\n'

type ParserOffset = { Line : string; X : int; Y : int }

module ParserOffset =
  let fromInput input = { Line = ParserInput.current input; X = input.Offset.X; Y = input.Offset.Y }
    
type public ParserLabel = string
type public ParserError = string

type public ParserResult<'a> =
  | Success of 'a
  | Failed of ParserLabel * ParserError * ParserOffset

type public Parser<'a> = { Func : ParserInput -> ParserResult<'a * ParserInput>; Label : ParserLabel }
  
[<RequireQualifiedAccess>]
module Parser =
  /// Print a parsed result
  let public print result =
    match result with
    | Success(value, input) -> printfn "%A" value
    | Failed(label, error, offset) ->
      printfn "Error parsing %s at line %i column %i\n%s\n%s" label offset.Y offset.X offset.Line (sprintf "%*s^%s" offset.X "" error)

  /// Run a parser with some input
  let public runInput p input = p.Func input
  /// Run the parser with a string as input
  let public run p str = runInput p (ParserInput.create str)

  /// Update the label in the parser
  let setLabel p label =
    let f input =
      match p.Func input with
      | Success s -> Success s
      | Failed(oldLabel, err, offset) -> Failed(label, err, offset)
    { Func = f; Label = label }
  /// Infix version of setLabel
  let private (<?>) = setLabel

  /// Monad conformance
  let public flatmap f p =
    let label = "Undefined"
    let f input =
      match runInput p input with
      | Failed(label, err, offset) -> Failed(label, err, offset)
      | Success(head, tail) -> runInput (f head) tail
    { Func = f; Label = label }
  /// Alternative function for flatmap
  let public bind = flatmap
  /// Infix of flatmap
  let private (>>=) p f = flatmap f p
  /// Lift a value into a Parser
  let public rtn x =
    let label = sprintf "%A" x
    let f input = Success(x, input)
    { Func = f; Label = label }

  /// Functor conformance
  let public map f = flatmap (f >> rtn)
  /// Infix of map
  let private (<!>) = map
  /// Infix Reverse of map (piping)
  let private (<&>) x f = map f x
  /// ALternative Infix Reverse of map (piping)
  let private (|>>) = (<&>)

  /// Applicative Functor conformance
  let public apply fP xP = fP >>= fun f -> xP >>= fun x -> rtn (f x)
  /// Infix of apply
  let private (<*>) = apply
  /// Lift a single parameter function into the Parser world and apply
  let public liftA fn aP = rtn fn <*> aP
  /// Lift a two parameter function into the Parser world and apply
  let public liftA2 fn aP bP = liftA fn aP <*> bP
  /// Lift a three parameter function into the Parser world and apply
  let public liftA3 fn aP bP cP = liftA2 fn aP bP <*> cP
  /// Lift a four parameter function into the Parser world and apply
  let public liftA4 fn aP bP cP dP = liftA3 fn aP bP cP <*> dP
  /// Lift a five parameter function into the Parser world and apply
  let public liftA5 fn aP bP cP dP eP = liftA4 fn aP bP cP dP <*> eP
  /// Lift a six parameter function into the Parser world and apply
  let public liftA6 fn aP bP cP dP eP fP = liftA5 fn aP bP cP dP eP <*> fP

  /// Combine two parsers (logical conjunction)
  let public andThen p1 p2 =
    let label = sprintf "%s andThen %s" p1.Label p2.Label
    p1 >>= fun r1 -> p2 >>= fun r2 -> rtn (r1, r2) <?> label
  /// Infix of andThen
  let private (.>>.) = andThen
  /// Keep only the result of the left side parser
  let andThenFst p1 p2 = p1 .>>. p2 |> map (fun (a, b) -> a)
  /// Infix of andThenFst
  let private (.>>) = andThenFst
  /// Keep only the result of the right side parser
  let andThenSnd p1 p2 = p1 .>>. p2 |> map (fun (a, b) -> b)
  /// Infix of andThenSnd
  let private (>>.) = andThenSnd
  /// Keep only the result of the middle parser
  let public between p1 p2 p3 = p1 >>. p2 .>> p3

  /// Combine two parsers (logical disjunction)
  let public orElse p1 p2 =
    let label = sprintf "%s orElse %s" p1.Label p2.Label
    let f input =
      let r1 = runInput p1 input
      match r1 with
      | Success _ -> r1
      | Failed _ -> runInput p2 input
    { Func = f; Label = label }
  /// Infix of orElse
  let private (<|>) = orElse
  /// Choose from a list of Parsers
  let public choice pss = List.reduce (<|>) pss

  /// Higher Order Parser
  let public satisfy pred label =
    let f input =
      match ParserInput.nextChar input with
      | _, None -> Failed(label, "End of Input", ParserOffset.fromInput input)
      | tail, Some head ->
        if pred head then Success(head, tail)
        else Failed(label, sprintf "Unexpected '%c'" head, ParserOffset.fromInput input)
    { Func = f; Label = label}
  
  let rec public sequence ps =
    let cons head tail = head :: tail
    let consP = liftA2 cons
    match ps with
    | [] -> rtn []
    | head :: tail -> consP head (sequence tail)

  /// Match zero or more occurrences of the specified parser
  let rec private zeroOrMore p input =
    match runInput p input with
    | Failed(_, _, _) -> [], input
    | Success(head, tail) ->
      let (nextHeads, remainingTail) = zeroOrMore p tail
      (head :: nextHeads, remainingTail)

  /// Matches zero or more occurrences of the specified parser
  let public many p =
    let label = sprintf "many %s" p.Label
    let rec f input = Success(zeroOrMore p input)
    { Func = f; Label = label }
  /// Matches one or more occurrences of the specified parser
  let public many1 p =
    let label = sprintf "many1 %s" p.Label
    p >>= fun head -> many p >>= fun tail -> rtn (head :: tail) <?> label

  /// Parses an optional occurrence of p and returns an option value.
  let public opt p =
    let label = sprintf "opt %s" p.Label
    p <&> Some <|> rtn None <?> label

  let optToString opt f =
    match opt with
    | None -> ""
    | Some x -> f x

  /// Parses one or more occurrences of p separated by sep
  let public sepBy1 p sep =
    let sepThenP = sep >>. p
    p .>>. many sepThenP <&> fun (p, pList) -> p :: pList
    <?> "sepBy1"
  /// Parses zero or more occurrences of p separated by sep
  let public sepBy p sep = sepBy1 p sep <|> rtn [] <?> "sepBy"

  // Type parsers
  // -------------------------------------------------------------------
  /// Parse a char
  let public pchar c =
    let pred ch = (ch = c)
    let label = sprintf "%c" c
    satisfy pred label

  /// Choose from a list of characters
  let public anyOf cs =
    let label = sprintf "anyOf %A" cs
    cs
    |> List.map pchar
    |> choice
    <?> label

  /// Convert char list to string
  let private charListToString cs = String(List.toArray cs)

  /// Parses a sequence of zero or more chars
  let manyChars p = many p <&> charListToString <?> "manyChars"
  /// Parses a sequence of one or more chars
  let many1Chars p = many1 p <&> charListToString <?> "many1Chars"

  /// Parse a specific string
  let public pstring s =
    s
    |> List.ofSeq
    |> List.map pchar
    |> sequence
    |> map charListToString
    <?> s

[<AutoOpen>]
module Operators =
  /// Infix version of setLabel
  let (<?>) = Parser.setLabel
  /// Infix of flatmap
  let (>>=) p f = Parser.flatmap f p
  /// Reverse Infix of flatmap
  let (=<<) = Parser.flatmap
  /// Infix of map
  let (<!>) = Parser.map
  /// Reverse Infix of map (piping)
  let (<&>) x f = Parser.map f x
  // Alternative Reverse Infix of <&>
  let (|>>) = (<&>)
  /// Infix of andThen
  let (.>>.) = Parser.andThen
  /// Keep only the result of the left side parser
  let (.>>) = Parser.andThenFst
  /// Keep only the result of the right side parser
  let (>>.) = Parser.andThenSnd
  /// Infix of orElse
  let (<|>) = Parser.orElse

Adjustments were made to the following functions:
  • Changed ParserResult to include the ParserOffset for failure tracking.
  • Changed the print function to support the printing of ParserOffset for failures.
  • Created a new runInput function to run a Parser that uses the new ParserInput type.
  • Adjusted the run function to call the runInput function by taking the string input parameter and converting it to the new ParserInput.
  • Changed the setLabel function to support the ParserOffset
  • Changed flatmap to include the ParserOffset for failures, and to use runInput for the success computation.
  • Changed the orElse function to use runInput for failure computation.
  • Changed the satisfy function to use ParserInput as input and to include the ParserOffset for failures.
  • Changed zeroOrMore function to match against runInput.
 
Last edited:
Library Overview
The Parser combinators are currently a total of 159 lines excluding the comments and blank lines. This will get longer once we start to build the JSON parser, because a number of the sub parsers that will be required are considered general use operations i.e. we're more likely to reuse these when parsing another format, for example:
  • parsing of spaces, whitespace. hex numbers, digits
  • etc...

Next Post
The next post will focus on building a JSON parser that adheres to the ECMA-404 The JSON Data Interchange Standard; which is also summarised on json.org. I'll also btw be using the JSON examples on the same site to validate the completeness of the parser.

The build of the JSON parser will focus only on parsing from a JSON string to an abstract syntax tree (AST). Writing an interpreter of the AST and / or using .Net's reflection library to load this into another data structure (e.g. record / class / struct) is left up the reader.

I instead plan to use Logo turtle graphics language to demonstrate the end to end process of parsing to an AST and then writing an interpreter of that AST to render the result to the screen.
 
Last edited:
JSON Parser
The JSON format is summarised by the following syntax diagram for value i.e. a value can be any one of the following:
  • string
  • number
  • object (a collection of keys and their associated values)
  • array (a collection of other values)
  • true / false (a bool)
  • null
  • ...and whitespace is permissible surrounding any values.

770108



Defining an AST for a JSON value
The JSON value type is a recursive type; because both the array and the object type's value recursively refer back the JSON value type.

An abstract syntax tree is the obvious choice for storing this; i.e. F#'s discriminated union type, a sum type similar to C#'s enum except that it provides for storing of associated values with each of the enum's member values.

C#:
type JValue =
  | JString of string
  | JNumber of float
  | JObject of Map<string, JValue>
  | JArray of JValue list
  | JBool of bool
  | JNull

Our build process will follow this structure i.e. we'll build JString first, then JNumber and so on.


JString parser
From json.org
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented as a single character string. A string is very much like a C or Java string.

The following syntax diagram summarises the specification for the JString parser.
770106
In typical FP fashion; we'll break this seemingly complex syntax diagram into its constituent parts; representing each as a sub parsing function that'll we composite together to create the JString parser:


Any codepoint except " or \
A quote or reverse solidus is not acceptable on its own; hence we define a parser that negates any match to either a single reverse solidus (unicode 005c) or a single quote (unicode 0022); i.e. we don't want a match for quote or reverse solidus. Any other unicode character is acceptable and returns a ParserResult of Success.
C#:
module JSON =
  // --- Parse a JString ---
  let private unescapedChars = ['\u005c'; '\u0022']
  let private pUnescapedChar = Parser.satisfy (fun c -> unescapedChars |> List.contains c |> not) "Unescaped Character"


Escaped Characters
We define a parser for escaped characters i.e. control characters preceded by a reverse solidus. We have defined an list of tuples where the first value is what we expect to find in JSON input string, and the second value is what we want the parser to record. The Parser.choice combinator turns this into a composition of orElse parsers.
C#:
module JSON =
...
  let private escapedChars = ["\\\"", '\u0022'; "\\\\", '\u005c'; "\\/", '\u002f'; "\\b", '\u0008'; "\\f", '\u000c'; "\\n", '\u000a'; "\\r", '\u000d'; "\\t", '\u0009']
  let private pEscapedChar = escapedChars |> List.map (fun (c, r) -> Parser.pstring c >>% r) |> Parser.choice <?> "Escaped Character"
The >>% infix operator is used to replaced the first value in the tuple with the second (i.e. c -> r). This operator is defined as follows as will be added to the Parser library's Operator module.
C#:
[<AutoOpen>]
module Operators =
...
  /// Applies parser p ignores the result and return x
  let (>>%) p x = p <&> fun _ -> x
Basically we have an operator that will ignore search value returned by the parser and replace it with unicode character we want to record, for example:
  • Our parser searches for "\\\""; an escape quote, returning that as result value; which we replace with '\u0022'; the unicode hex value for a quote character.


Unicode Characters
A unicode character is preceded by a reverse solidus \ and the character u and then 4 hexadecimal digits. The pId parser defines the prefix.
The pHex parser defines a orElse parser of the acceptable hexadecimals digits and letters.

convert is a helper function to transform the result of the following parser composition pId >>. pHex .>>. pHex .>>. pHex .>>. pHex to a character i.e. a set of tuples wrapped in a tuple wrapped in another tuple. The (((h1, h2), h3), h4) parameter syntax serves to splat the tupled values as variables h1 to h4; which we then glue together using sprintf and parse to a Int32 using the globalization string style for a hex number; finally we pipe that through to char; which convert an Int32 to a character.
C#:
module JSON =
...
  let private pUnicodeChar =
    let pId = Parser.pchar '\\' >>. Parser.pchar 'u'
    let pHex = Parser.anyOf (['0'..'9'] @ ['A'..'F'] @ ['a'..'f'])
    let convert (((h1, h2), h3), h4) = Int32.Parse(sprintf "%c%c%c%c" h1 h2 h3 h4, Globalization.NumberStyles.HexNumber) |> char
    pId >>. pHex .>>. pHex .>>. pHex .>>. pHex <&> convert <?> "Unicode Character"


Quote Encapsulated Composition of Characters
Next we create a parser that composites together the previous three parsers using the orElse combinator. Finally we return a parser prefixed and suffix by a quote character that we ignore -- the >>. ignores the quote on the left and the .>> ignores the quote on the right; we only return the result of Parser.manyChars pCharacters. The Parser.manyChars matches zero or more occurrences of the pCharacters parser.
C#:
module JSON =
...
  let private pQuotedString =
    let pQuote = Parser.pchar '\"' <?> "Quote"
    let pCharacters = pUnescapedChar <|> pEscapedChar <|> pUnicodeChar
    pQuote >>. Parser.manyChars pCharacters .>> pQuote


JString parser tied to JValue
Finally we take our pQuotedString and return it as a JString encapsulation the parsed string value i.e. a member value of the JValue discriminated union type we declared above.
C#:
module JSON =
...
  let private jString = pQuotedString <&> JString <?> "JString"
 
Last edited:
Top
Sign up to the MyBroadband newsletter
X