Is cl-yacc going to cut it?

Discussion:

Is cl-yacc going to cut it?

Matthew D. Swank

2011-02-04 06:33:24 UTC

I suppose this is only marginally related to common lisp, but everything
I'm talking about is written in common lisp.

I use cl-yacc for a lot of parsing, but one thing that has always seemed
harder than it needs to be is writing lexers to feed it. One thing that
I've found helpful is the creation of a custom lexer for each parser
state by making the action table entry for that state available to the
lexer. This provides the terminals the parser is looking for, and
narrows the tokens the lexer has to look for at each step. However,
this means I am also maintaining my own fork of cl-yacc.

It seems (from my admittedly limited search) that this is not a common
modification of yacc. Before I start bugging the maintainer about my
changes, I want to know: am I abusing yacc?

I do like the separation between low level tokenization the higher level
symbol parse, but is there another general parsing technique, available
as a lisp library of course, that either works at a lower level than
yacc usually does or allows the lexer to access more context about the
parse?

Matt

Raymond Wiker

2011-02-04 07:21:46 UTC

I've used cl-yacc exactly once, and chose to implement "start
conditions" in my lexer - I use a closed-over variable to restrict the
set of patterns that I want to match in the lexer. I set and test the
start-condition in the lexer only, but it would certainly to be
possible to modify the start condition from parser rules, too.

I'll definitely use cl-yacc again, if the need arises - it's another
of those good-quality, low-profile Lisp libraries that "just work"
(tm).

[I use cl-yacc to implement a filter parser for trivial-ldap; the one
supplied with trivial-ldap is somewhat unconventional.]

Post by Matthew D. Swank
I suppose this is only marginally related to common lisp, but everything
I'm talking about is written in common lisp.
I use cl-yacc for a lot of parsing, but one thing that has always seemed
harder than it needs to be is writing lexers to feed it. One thing that
I've found helpful is the creation of a custom lexer for each parser
state by making the action table entry for that state available to the
lexer. This provides the terminals the parser is looking for, and
narrows the tokens the lexer has to look for at each step. However,
this means I am also maintaining my own fork of cl-yacc.
It seems (from my admittedly limited search) that this is not a common
modification of yacc. Before I start bugging the maintainer about my
changes, I want to know: am I abusing yacc?
I do like the separation between low level tokenization the higher level
symbol parse, but is there another general parsing technique, available
as a lisp library of course, that either works at a lower level than
yacc usually does or allows the lexer to access more context about the
parse?
Matt
_______________________________________________
pro mailing list
http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Matthew D. Swank

2011-02-05 03:55:56 UTC

Yeah, that is basically what I'm doing: driving start conditions from
the parser. One hint that there was unnecessary code duplication going
on was when the state in some of my older lexers started looking like
shift entries in the parser's action table.

Matt

Post by Raymond Wiker
I've used cl-yacc exactly once, and chose to implement "start
conditions" in my lexer

Matthew D. Swank

2011-02-04 07:33:56 UTC

It does look nice, and parser combinators are very cool, but I scrape a
lot of semi structured documents. Sometimes it's hard to operate with
all of it in memory, which is a current requirement of
cl-parser-combinators.

Thank you though,

Matt

I've been very pleased with cl-parser-combinators. Not sure what you're trying to parse, but it's pretty flexible and powerful. I've used it for parsing a printed representation of molecules, SMILES strings, and have found it to be a pleasure to work with.
Cyrus

Post by Matthew D. Swank
I suppose this is only marginally related to common lisp, but everything
I'm talking about is written in common lisp.
I use cl-yacc for a lot of parsing, but one thing that has always seemed
harder than it needs to be is writing lexers to feed it. One thing that
I've found helpful is the creation of a custom lexer for each parser
state by making the action table entry for that state available to the
lexer. This provides the terminals the parser is looking for, and
narrows the tokens the lexer has to look for at each step. However,
this means I am also maintaining my own fork of cl-yacc.
It seems (from my admittedly limited search) that this is not a common
modification of yacc. Before I start bugging the maintainer about my
changes, I want to know: am I abusing yacc?
I do like the separation between low level tokenization the higher level
symbol parse, but is there another general parsing technique, available
as a lisp library of course, that either works at a lower level than
yacc usually does or allows the lexer to access more context about the
parse?
Matt
_______________________________________________
pro mailing list
http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Cyrus Harmon

2011-02-04 07:14:45 UTC

I've been very pleased with cl-parser-combinators. Not sure what you're trying to parse, but it's pretty flexible and powerful. I've used it for parsing a printed representation of molecules, SMILES strings, and have found it to be a pleasure to work with.

Cyrus

Post by Matthew D. Swank
I suppose this is only marginally related to common lisp, but everything
I'm talking about is written in common lisp.
I use cl-yacc for a lot of parsing, but one thing that has always seemed
harder than it needs to be is writing lexers to feed it. One thing that
I've found helpful is the creation of a custom lexer for each parser
state by making the action table entry for that state available to the
lexer. This provides the terminals the parser is looking for, and
narrows the tokens the lexer has to look for at each step. However,
this means I am also maintaining my own fork of cl-yacc.
It seems (from my admittedly limited search) that this is not a common
modification of yacc. Before I start bugging the maintainer about my
changes, I want to know: am I abusing yacc?
I do like the separation between low level tokenization the higher level
symbol parse, but is there another general parsing technique, available
as a lisp library of course, that either works at a lower level than
yacc usually does or allows the lexer to access more context about the
parse?
Matt
_______________________________________________
pro mailing list
http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Paul Tarvydas

2011-02-04 14:39:24 UTC

Post by Matthew D. Swank
symbol parse, but is there another general parsing technique, available
as a lisp library of course, that either works at a lower level than
yacc usually does or allows the lexer to access more context about the
parse?

The relatively new PEG packrat parser technologies make it possible to use just one universal description for, both, scanning and parsing. I see that cl-peg exists, but I haven't tried it out.

[I, also, have built for myself a variant of Holt's S/SL (syntax semantic language) that allows preservation of white space and for changing scanners on the fly (that allows parsing different languages embedded in one another), but I have not taken the time to polish it enough for release.]

pt

Nikodemus Siivola

2011-02-04 15:20:39 UTC

Post by Paul Tarvydas
The relatively new PEG packrat parser technologies make it possible
to use just one universal description for, both, scanning and
parsing. I see that cl-peg exists, but I haven't tried it out.

Esrap is another packrat parser for CL:

https://github.com/nikodemus/esrap

I had to parse some semi-structured text and wrote Esrap for that. Its
primary limitations are lacking support for parsing from streams (it
wants a string) and very little documentation.

Cheers,

-- Nikodemus

Thomas M. Hermann

2011-02-04 15:31:14 UTC

I am absolutely biased towards meta-sexp:

"A META parser generator using LL(1) grammars with s-expressions."

https://github.com/vy/meta-sexp

It seems dirt simple to use, at least to me and the performance has been
acceptable.

Regards,

~ Tom
----------------------------------------------------------------
Thomas M. Hermann
Odonata Research LLC
http://www.odonata-research.com/
http://www.linkedin.com/in/thomasmhermann

On Fri, Feb 4, 2011 at 9:20 AM, Nikodemus Siivola <

Post by Nikodemus Siivola

Post by Paul Tarvydas
The relatively new PEG packrat parser technologies make it possible
to use just one universal description for, both, scanning and
parsing. I see that cl-peg exists, but I haven't tried it out.

https://github.com/nikodemus/esrap
I had to parse some semi-structured text and wrote Esrap for that. Its
primary limitations are lacking support for parsing from streams (it
wants a string) and very little documentation.
Cheers,
-- Nikodemus
_______________________________________________
pro mailing list
http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Matthew D. Swank

2011-02-05 04:22:34 UTC

Meta does seem like a parser for people allergic to parsing. Of course
it has a great lisp pedigree as well. I am biased by early exposure to
Ken Thompson's work on regular expressions and LALR parsers.

Matt

Post by Thomas M. Hermann
"A META parser generator using LL(1) grammars with s-expressions."
https://github.com/vy/meta-sexp
It seems dirt simple to use, at least to me and the performance has
been acceptable.
Regards,
~ Tom
----------------------------------------------------------------
Thomas M. Hermann
Odonata Research LLC
http://www.odonata-research.com/
http://www.linkedin.com/in/thomasmhermann
On Fri, Feb 4, 2011 at 9:20 AM, Nikodemus Siivola

Post by Paul Tarvydas
The relatively new PEG packrat parser technologies make it possible
to use just one universal description for, both, scanning and
parsing. I see that cl-peg exists, but I haven't tried it out.

https://github.com/nikodemus/esrap
I had to parse some semi-structured text and wrote Esrap for that. Its
primary limitations are lacking support for parsing from streams (it
wants a string) and very little documentation.
Cheers,
-- Nikodemus
_______________________________________________
pro mailing list
http://common-lisp.net/cgi-bin/mailman/listinfo/pro
_______________________________________________
pro mailing list
http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Matthew D. Swank

2011-02-05 04:16:20 UTC

I had heard about cl-peg, but not Esrap. Most of the time I can get
away with the space requirements need by both parser combinators and and
pack-rat parsers, but there are some large documents where that isn't
practical.

One thing I would like to eventually see is a mature library that can
take a spec like a PEG and generate different types of parsers for it.

Matt

Post by Nikodemus Siivola

Post by Paul Tarvydas
The relatively new PEG packrat parser technologies make it possible
to use just one universal description for, both, scanning and
parsing. I see that cl-peg exists, but I haven't tried it out.

https://github.com/nikodemus/esrap
I had to parse some semi-structured text and wrote Esrap for that. Its
primary limitations are lacking support for parsing from streams (it
wants a string) and very little documentation.
Cheers,
-- Nikodemus
_______________________________________________
pro mailing list
http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Pascal J. Bourguignon

2011-02-04 16:06:04 UTC

Post by Paul Tarvydas

Post by Matthew D. Swank
symbol parse, but is there another general parsing technique, available
as a lisp library of course, that either works at a lower level than
yacc usually does or allows the lexer to access more context about the
parse?

The relatively new PEG packrat parser technologies make it possible to
use just one universal description for, both, scanning and parsing. I
see that cl-peg exists, but I haven't tried it out.

Well, if we may distract the OP from cl-yacc, I'll note that Zebu
contains also a (unoptimized) lexer (it just uses regexps naively to
match tokens).

I would also note that given that context free languages include regular
languages, there's also little theorical point in distinguishing a lexer
from a parser: you can describe the tokens using normal grammar rules.

space-or-comment := { space | comment } .
comment := '#' '|' { comment-chars } '|' '#' .
comment-chars := '|' not-sharp | not-pipe .
not-sharp := space | letter | digit | '|' | '+' | '-' | ... .
not-pipe := space | letter | digit | '#' | '+' | '-' | ... .
identifier := space-or-comment letter { digit | letter } .

etc, so basically the only lexer you need is READ-CHAR.

--
__Pascal Bourguignon__ http://www.informatimago.com/
A bad day in () is better than a good day in {}.

Scott L. Burson

2011-02-05 21:29:39 UTC

On Fri, Feb 4, 2011 at 8:06 AM, Pascal J. Bourguignon

Post by Pascal J. Bourguignon
I would also note that given that context free languages include regular
languages, there's also little theorical point in distinguishing a lexer
from a parser: you can describe the tokens using normal grammar rules.
space-or-comment := { space | comment } .
comment := '#' '|' { comment-chars } '|' '#' .
comment-chars := '|' not-sharp | not-pipe .
not-sharp := space | letter | digit | '|' | '+' | '-' | ... .
not-pipe := space | letter | digit | '#' | '+' | '-' | ... .
identifier := space-or-comment letter { digit | letter } .
etc, so basically the only lexer you need is READ-CHAR.

Except that as a matter of notational convenience, we normally allow
lexical grammars to be ambiguous, and apply the additional
disambiguating rule that the longest possible match at each point in
the input is to be preferred over shorter ones (thus the token "ifx"
is not the token "if" followed by the token "x"). This rule is not
directly expressible in the context-free formalism. Also, as shown by
your example, expressing negation is tedious -- you have to enumerate
the complement of the set of characters you wish to exclude.

Boolean grammars solve both these problems. Let's look at how they work.

The key insight is that we consider nonterminal definitions to be
operations on predicates over strings. The definition

A ::= B | C

is fundamentally a disjunction: a string is an A if it's a B, or if
it's a C. Why not also allow the other boolean operations of
conjunction and negation? This is what boolean grammars do. So you
can have rules like

A ::= B & ~C

which says that a string is an A if it's a B _and_ it's not a C.

If you consider the input to consist not of single characters but of
pairs of (a character and its following character), you can express
the longest-match rule. In practice, rather than making you do this
explicitly, implementations like SBP provide a "followed by" notation
that allows you to constrain the success of a production on the
immediately following character. Thus you can write a production for
"`if' followed by a token break character" fairly succinctly.

See Megacz' paper, which you can download from here:
http://www.cs.berkeley.edu/~megacz/research/

-- Scott

Samium Gromoff

2011-02-14 10:43:16 UTC

Post by Scott L. Burson
Boolean grammars solve both these problems. Let's look at how they work.
The key insight is that we consider nonterminal definitions to be
operations on predicates over strings. The definition
A ::= B | C
is fundamentally a disjunction: a string is an A if it's a B, or if
it's a C. Why not also allow the other boolean operations of
conjunction and negation? This is what boolean grammars do. So you
can have rules like
A ::= B & ~C
which says that a string is an A if it's a B _and_ it's not a C.
If you consider the input to consist not of single characters but of
pairs of (a character and its following character), you can express
the longest-match rule. In practice, rather than making you do this
explicitly, implementations like SBP provide a "followed by" notation
that allows you to constrain the success of a production on the
immediately following character. Thus you can write a production for
"`if' followed by a token break character" fairly succinctly.

I'm surprised that nobody has mentioned OMeta/ometa2 as of yet.

It has this boolean functionality you speak of:

http://subvert-the-dominant-paradigm.net/blog/?p=38

It's unfortunate that the author wasn't very experienced with CL, so his
implementation doesn't feel idiomatic and the code isn't always clean.

Besides this, and a somewhat idiosynractic syntax of OMeta grammars
(which is, to degree, by necessity) it's a very interesting framework.

I've got modifications adding match tracing and source location support,
which I've never got around to submit to the author.

I'd guess the next step would be converting it to use SEXP's for grammar
descriptions.

--
regards,
Samium Gromoff
--
"Actually I made up the term 'object-oriented', and I can tell you I
did not have C++ in mind." - Alan Kay (OOPSLA 1997 Keynote)

Scott L. Burson

2011-02-14 20:26:03 UTC

On Mon, Feb 14, 2011 at 2:43 AM, Samium Gromoff

Post by Samium Gromoff
I'm surprised that nobody has mentioned OMeta/ometa2 as of yet.

I wasn't familiar with it. Based on a quick glance I don't think it
can handle boolean grammars in their full generality. It's PEG-based,
and as I understand, PEGs cannot handle all context-free grammars.
The full boolean grammar formalism, on the other hand, is a strict
superset of context-free.

The big difference in practice is what is happening at runtime. PEGs
are deterministic; SBP, on the other hand, is GLR-based, meaning that
it is constructing multiple possible parse trees simultaneously,
discarding the ones that don't work out. If a string is ambiguous
under the given grammar, GLR will actually return all the possible
parses (there's a reasonably compact representation for this called a
"parse forest"). PEG by definition will only ever give one parse.

Of course, either of those could be a desirable property, depending on
what you're trying to parse and why.

-- Scott

William Halliburton

2011-02-14 22:26:00 UTC

I know I'm late to the parade but if you need to write grammars that can be
easily augmented by other people without them needing to know much, if any,
lisp I can recommend the ebnf parser written by Daniel Herring. Its onion
of macros expands into pretty understandable code also.

http://git.androdna.com/?p=lisp/ebnf-parser

Daniel Herring

2011-02-15 03:27:37 UTC

This post might be inappropriate. Click to display it.

Scott L. Burson

2011-02-04 20:39:40 UTC

Post by Matthew D. Swank
It seems (from my admittedly limited search) that this is not a common
modification of yacc. Before I start bugging the maintainer about my
changes, I want to know: am I abusing yacc?

I've had to do that kind of thing for parsing languages like Cobol
that were designed before the advent of formal parsing theory.

It is an abuse in the sense that it makes it harder to say formally
exactly what language you're parsing, but hey, you do what you have to
do in this business :-)

My own pet parser generator project is a CL reimplementation of Adam
Megacz' Scannerless Boolean Parser:
http://research.cs.berkeley.edu/project/sbp/

Scannerless parsing obviates the kind of games you're having to play
by integrating the lexer into the grammar. Boolean grammars are more
expressive than context-free grammars. Both of these things are cool.
What you don't get in this framework, though, is a proof that your
grammar is unambiguous.

My reimplementation is not far enough along to release, alas, nor do I
really have any time to work on it. Maybe later this year...

-- Scott

Daniel Weinreb

2011-02-04 20:59:25 UTC

So the next time anyone says that there aren't any
libraries for Common Lisp, we can reply that
there are so many good parser libraries that
one must compare notes to figure out which is
best for which situation. So there, ye of little
faith! :)

-- Dan

Post by Scott L. Burson

Post by Matthew D. Swank
It seems (from my admittedly limited search) that this is not a common
modification of yacc. Before I start bugging the maintainer about my
changes, I want to know: am I abusing yacc?

I've had to do that kind of thing for parsing languages like Cobol
that were designed before the advent of formal parsing theory.
It is an abuse in the sense that it makes it harder to say formally
exactly what language you're parsing, but hey, you do what you have to
do in this business :-)
My own pet parser generator project is a CL reimplementation of Adam
http://research.cs.berkeley.edu/project/sbp/
Scannerless parsing obviates the kind of games you're having to play
by integrating the lexer into the grammar. Boolean grammars are more
expressive than context-free grammars. Both of these things are cool.
What you don't get in this framework, though, is a proof that your
grammar is unambiguous.
My reimplementation is not far enough along to release, alas, nor do I
really have any time to work on it. Maybe later this year...
-- Scott
_______________________________________________
pro mailing list
http://common-lisp.net/cgi-bin/mailman/listinfo/pro

Matthew D. Swank

2011-02-05 04:32:30 UTC

In fact I should have consolidated my replies, but by the time I
realized I was probably spamming the list, I was too far in. I used to
get impression that writing parsers in lisp was like moon-shining: lots
of people had rough and ready parsers out in the back woods, but didn't
really like talking about them.

I am grateful, and a little overwhelmed by all the responses here.

Thank you,

Matt

Post by Daniel Weinreb
So the next time anyone says that there aren't any
libraries for Common Lisp, we can reply that
there are so many good parser libraries that
one must compare notes to figure out which is
best for which situation.

Matthew D. Swank

2011-02-05 04:26:27 UTC

I would be very interested to see that if it ever become something you'd
be comfortable releasing.

Post by Scott L. Burson
My own pet parser generator project is a CL reimplementation of Adam
http://research.cs.berkeley.edu/project/sbp/

Matthew D. Swank

2011-02-07 03:15:36 UTC

Post by Scott L. Burson
It is an abuse in the sense that it makes it harder to say formally
exactly what language you're parsing, but hey, you do what you have to
do in this business :-)

I ended up writing a lexer start condtion generator that derives its
state machine from the compiled yacc parser. So the point is moot, I guess.

Matt

20 Replies
6 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Matthew D. Swank 2011-02-04 06:33:24 UTC

Raymond Wiker 2011-02-04 07:21:46 UTC

Matthew D. Swank 2011-02-05 03:55:56 UTC

Matthew D. Swank 2011-02-04 07:33:56 UTC

Cyrus Harmon 2011-02-04 07:14:45 UTC

Paul Tarvydas 2011-02-04 14:39:24 UTC

Nikodemus Siivola 2011-02-04 15:20:39 UTC

Thomas M. Hermann 2011-02-04 15:31:14 UTC

Matthew D. Swank 2011-02-05 04:22:34 UTC

Matthew D. Swank 2011-02-05 04:16:20 UTC

Pascal J. Bourguignon 2011-02-04 16:06:04 UTC

Scott L. Burson 2011-02-05 21:29:39 UTC

Samium Gromoff 2011-02-14 10:43:16 UTC

Scott L. Burson 2011-02-14 20:26:03 UTC

William Halliburton 2011-02-14 22:26:00 UTC

Daniel Herring 2011-02-15 03:27:37 UTC

Scott L. Burson 2011-02-04 20:39:40 UTC

Daniel Weinreb 2011-02-04 20:59:25 UTC

Matthew D. Swank 2011-02-05 04:32:30 UTC

Matthew D. Swank 2011-02-05 04:26:27 UTC

Matthew D. Swank 2011-02-07 03:15:36 UTC

about - legalese

Loading...