| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

Last 100 entries

Data Mining Books; SimpleDateFormat should be synchronized; British Words; Chinese Govt Intercepts External Web To DDOS github; Numbering Permutations; Teenage Engineering - Low Price Synths; GCHQ Can Do Whatever It Wants; Dublinesque; A Cryptographic SAT Solver; Security Challenges; Word Lists for Crosswords; 3D Printing and Speaker Design; Searchable Snowden Archive; XCode Backdoored; Derived Apps Have Malware (CIA); Rowhammer - Hacking Software Via Hardware (DRAM) Bugs; Immutable SQL Database (Kinda); Tor GPS Tracker; That PyCon Dongle Mess...; ASCII Fluid Dynamics; Brandalism; Table of Shifter, Cassette and Derailleur Compatability; Lenovo Demonstrates How Bad HTTPS Is; Telegraph Owned by HSBC; Smaptop - Sunrise (Music); Equation Group (NSA); UK Torture in NI; And - A Natural Extension To Regexps; This Is The Future Of Religion; The Shazam (Music Matching) Algorithm; Tributes To Lesbian Community From AIDS Survivors; Nice Rust Summary; List of Good Fiction Books; Constructing JSON From Postgres (Part 2); Constructing JSON From Postgres (Part 1); Postgres in Docker; Why Poor Places Are More Diverse; Smart Writing on Graceland; Satire in France; Free Speech in France; MTB Cornering - Where Should We Point Our Thrusters?; Secure Secure Shell; Java Generics over Primitives; 2014 (Charlie Brooker); How I am 7; Neural Nets Applied to Go; Programming, Business, Social Contracts; Distributed Systems for Fun and Profit; XML and Scheme; Internet Radio Stations (Curated List); Solid Data About Placebos; Half of Americans Think Climate Change Is a Sign of the Apocalypse; Saturday Surf Sessions With Juvenile Delinquents; Ssh, tty, stdout and stderr; Feathers falling in a vacuum; Santiago 30m Bike Route; Mapa de Ciclovias en Santiago; How Unreliable is UDP?; SE Santiago 20m Bike Route; Cameron's Rap; Configuring libxml with Eclipse; Reducing Combinatorial Complexity With Occam - AI; Sentidos Comunes (Chilean Online Magazine); Hilary Mantel: The Assassination of Margaret Thatcher - August 6th 1983; NSA Interceptng Gmail During Delivery; General IIR Filters; What's happening with Scala?; Interesting (But Largely Illegible) Typeface; Retiring Essentialism; Poorest in UK, Poorest in N Europe; I Want To Be A Redneck!; Reverse Racism; The Lost Art Of Nomography; IBM Data Center (Photo); Interesting Account Of Gamma Hack; The Most Interesting Audiophile In The World; How did the first world war actually end?; Ky - Restaurant Santiago; The Black Dork Lives!; The UN Requires Unaninmous Decisions; LPIR - Steganography in Practice; How I Am 6; Clear Explanation of Verizon / Level 3 / Netflix; Teenage Girls; Formalising NSA Attacks; Switching Brakes (Tektro Hydraulic); Naim NAP 100 (Power Amp); AKG 550 First Impressions; Facebook manipulates emotions (no really); Map Reduce "No Longer Used" At Google; Removing RAID metadata; New Bike (Good Bike Shop, Santiago Chile); Removing APE Tags in Linux; Compiling Python 3.0 With GCC 4.8; Maven is Amazing; Generating Docs from a GitHub Wiki; Modular Shelves; Bash Best Practices; Good Emergency Gasfiter (Santiago, Chile); Readings in Recent Architecture; Roger Casement; Integrated Information Theory (Or Not)

© 2006-2013 Andrew Cooke (site) / post authors (content).

Representing Indentations for Parsing

From: andrew cooke <andrew@...>

Date: Sun, 30 Aug 2009 13:06:31 -0400

What's the best way to represent indentation information for parsing?
There are two side to this question.  First, from a UI perspective -
how best to support clear, concise specification of parsers for a wide
variety of different approaches.  Second, from an implementation
perspective - how to do that efficiently within LEPL.

Taking the second first, LEPL now supports filtering of streams, but
these are costly in terms of memory (apart from an initial filtering
on the entire stream).  That means that context-sensitive filtering of
streams is possible, but better avoided.  A more efficient, but less
flexible, approach is to alter "global" state in a way that depends on
scope - this can be done with "monitors" and has constant time and
space costs.

One issue with UI seems to resolve around finding a good
representation of the data in the stream.  In particular, avoiding
having tokens related to indentation in the stream where they are not
needed (for example, in Python, between parentheses indentation is not
significant).  If - as in that example - this is restricted to a small
fraction of the stream, then filtering is a reasonable approach (since
memory usage is not a problem).

Another issue involves the difference between spaces/tabs and "depth".
 One scheme might assign meaningful indentation to exact multiples of
4 spaces; another might only require that spaces "be consistent"; in
either case tabs may or may not be acceptable, and might have some
fixed size, or shift to the some fixed set of columns.

Yet another issue is blank lines.  If these are significant, they need
to be present in the stream.  If not, they are a nuisance and should
be filtered.

Putting all this together, my first reaction was to calculate a
"depth" for each indentation (using some user defined function of the
space matched at the start of the line) and to place that value in the
(tagged token) stream.  Furthermore, blank lines would be discarded
and a global (but stack-aware) "current indentation level" set as

A NoIndent matcher (token) would then match an indentation token with
a depth equal to the global indentation level.  This would allow a
line with no indentation to be matched.

In addition, an Indent matcher would exist, which takes a list of
matchers (tokens) as an argument.  This matcher: matches an indent
"one more" than the global value; increments the global value;
evaluates the internal matchers; on exit, decrements the global value.

Within the scope of the first Indent matcher, then, a NoIndent matcher
would match lines indented by "one".

That has at least two problems: continguous nested Indents (the second
indent has nothing to match because the indent token was consumed by
the first); second it assumes there's it's possible to calculate the
depth before matching.

The first problem can be addressed by making the Indent match but not
consume the token if some indentation "remains".  The second can be
fixed by using "number of spaces" rather than depth (and having Indent
consume some number, perhaps (FixedIndent), or simply take what it's
given (VariableIndent)).

There's also the case of wanting to ignore indentation.  That is, I
think, going to require a filter.  And NoIndent might be the right
name for that matcher, in which case perhaps Line would be a good
choice for what I was calling NoIndent?  Or Sol?

After writing the above I tried writing a parser (without having any
implementation) and learnt that defining what a line is, is important,
but using a separate Eol() is messy.  It seems better to have various
ways of defining lines.  Related to that, one example use case I had
missed is automatic continuation over multiple lines (indicated by

So perhaps the basic grouping is Line(), which has a related
Multiline() for automatic continuation.  This can be combined with
Indent(), NoIndent() and AnyIndent() (the last ignores indentation).
But Multiline and AnyIndent are both so closely related to Line that
perhaps they should be options on Line.  And if we do that, shouldn't
Indent be, too?

Is there some other syntax we can use?  I don't think so - we really
need this to group matches, so it has to have a (...) form, which
means a function invocation.   Perhaps a single letter function name,
plus flags?  There's always _(...), although that's traditionally for

Maybe it's better to separate the main parser from the structure, and
then have some kind of complex layout expression that is itself
parsed?  I'm thinking of nested lists.  But as soon as you get any
complexity (alternatives for example), you're better using functions
and the standard approach (ie Or()).

Here's a later attempt.  I'm now using Line... to make things appear related.

        word = Token('[a-z_][a-z0-9_]*')
        number = Token(Integer)
        symbol = Token('[^a-z0-9_]')

        # any indent, entire line
        comment = LineAny(symbol('#') + Star(Any()))[:]

        atom = number | word
        # ignore line related tokens
        args = symbol('(') + LineIgnore(atom[:,symbol(',')]) + symbol(')')
        expr = ... # this includes nested function definitions

        # single line function is distinct
        func1 = Line(word('def') + word + args + symbol('=') + expr)
        func = Line(word('def') + word + args + symbol('=') +

        program = (func|func1)[:]

Inline comments would require separating out the "#..." part and
adding it to the end of an expr.

But there's still a problem with nested definitions.  If LineIndent
has consumed the indentation token, what does the Line() outside a
function definition match?

I guess this indicates that LineIndent doesn't consume indentation, it
just modifies the current level.  So perhaps should be called Indent.
And perhaps LineIgnore is better called Freeform?

        # any indent, entire line
        comment = symbol('#') + Star(Any())

        atom = number | word
        # ignore line related tokens
        args = symbol('(') + Freeform(atom[:,symbol(',')]) + symbol(')')
        simple_expr = ...
        expr = Line(simple_expr + Opt(comment))

        line_comment = LineAny(comment)

        # single line function is distinct
        func1 = Line(word('def') + word + args + symbol('=') + expr +
        func = Line(word('def') + word + args + symbol('=') + Opt(comment)

        program = (func|func1)[:]

And it looks like it would be useful for a user to be able to easily
define something that was like Line but added an optional comment (and
in practice this should indeed be possible).

I haven't used LineMulti() above, but that would support implicit
continuation through indented lines.


More Indentation

From: andrew cooke <andrew@...>

Date: Sun, 30 Aug 2009 18:57:52 -0400

So looking at that I noticed that there's an inconsistent mess in how
Line and Indent work together.  Above I had:

        func = Line(word('def') + word + args + symbol('=') + Opt(comment)

There's no operator between the comment and the Indent.  My initial
reaction was that I had missed a &, but then Line would extend pass
end of line (dropping Eol and inferring it from the next indentation
doesn't solve the underlying problem, I think, which is that this is

Looking at the problem again, there are several different "things being done":
1 - Defining what constitutes a single line.
2 - Grouping lines with the same indentation together
3 - Indicating how one *group* is indented relative to another.
The confusion above seems to be confusing (3) with indicating how
*lines* indent relative to each other.

Using this new approach is going to be much more verbose, but also
clearer.  On a more positive note, it might help separate two separate
tasks - definition of lines and blocks - and might even allow more
useful combinators to be defined for specific cases.

It's worth clarifying the difference between line and block.  A line
starts with an empty indentation and ends with the end of line marker.
 There is an exception with Freeformat() which discards indentation
and eol markers - if that is used inside a line then the line can
extend over several lines of text.

A block contains a one or more matchers that consume lines (these can
be simple lines, or they may be combined using Or, And etc).  Because
they work by incrementing/decrementing the number of spaces "used"
they do not alter the stream flow in any way.  This means that there
is very little constraint on how lines inside a block are combined.
If a block contains several lines, separated by commas, then they are
joined by And().

Given the above, Indent() seems like it is really an initial parameter
to a block, which sets the space to consume (perhaps by reading the
indentation, perhaps by simply incrementing by some fixed amount).

So the syntax would be Block(indent, Line(), Line()...) where indent
is some value (number of spaces, or some function)?  But this value is
likely global, so we can set it in config and use a keyword argument
for exceptional cases.

The example then becomes:

        word = Token('[a-z_][a-z0-9_]*')
        number = Token(Integer)
        symbol = Token('[^a-z0-9_]')

        # any indent, entire line
        comment = symbol('#') + Star(Any())

        atom = number | word
        # ignore line related tokens
        args = symbol('(') + Freeform(atom[:,symbol(',')]) + symbol(')')
        simple_expr = ...
        expr = Line(simple_expr + Opt(comment))

        line_comment = LineAny(comment)

        # single line function is distinct
        func1 = Line(word('def') + word + args + symbol('=') + \
                    expr + Opt(comment))
        func = Line(word('def') + word + args + symbol('=') + \
                  Opt(comment)) + Block((expr|func|func1)[:])

        program = (func|func1)[:]


Comment on this post