| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

Can't Help Thinking It's Thoughtcrime; Mefi Quotes; Spray Painting Bike Frame; Weeks 10 + 11; Change: No Longer Possible To Merge Metadata; Books on Old Age; Health Tree Maps; MRA - Men's Rights Activists; Writing Good C++14; Risk Assessment - Fukushima; The Future of Advertising and Surveillance; Travelling With Betaferon; I think I know what I dislike so much about Metafilter; Weeks 8 + 9; More; Pastamore - Bad Italian in Vitacura; History Books; Iraq + The (UK) Governing Elite; Answering Some Hard Questions; Pinochet: The Dictator's Shadow; An Outsider's Guide To Julia Packages; Nobody gives a shit; Lepton Decay Irregularity; An Easier Way; Julia's BinDeps (aka How To Install Cairo); Good Example Of Good Police Work (And Anonymity Being Hard); Best Santiago Burgers; Also; Michael Emmerich (Vibrator Translator) Interview (Japanese Books); Clarice Lispector (Brazillian Writer); Books On Evolution; Looks like Ara (Modular Phone) is dead; Index - Translations From Chile; More Emotion in Chilean Wines; Week 7; Aeon Magazine (Science-ish); QM, Deutsch, Constructor Theory; Interesting Talk Transcripts; Interesting Suggestion Of Election Fraud; "Hard" Books; Articles or Papers on depolarizing the US; Textbook for "QM as complex probabilities"; SFO Get Libor Trader (14 years); Why Are There Still So Many Jobs?; Navier Stokes Incomplete; More on Benford; FBI Claimed Vandalism; Architectural Tessellation; Also: Go, Blake's 7; Delusions of Gender (book); Crypto AG DID work with NSA / GCHQ; UNUMS (Universal Number Format); MOOCs (Massive Open Online Courses); Interesting Looking Game; Euler's Theorem for Polynomials; Weeks 3-6; Reddit Comment; Differential Cryptanalysis For Dummies; Japanese Graphic Design; Books To Be Re-Read; And Today I Learned Bugs Need Clear Examples; Factoring a 67 bit prime in your head; Islamic Geometric Art; Useful Julia Backtraces from Tasks; Nothing, however, is lost with less discomfort than that which, when lost, cannot be missed; Article on Didion; Cost of Living by City; British Slavery; Derrida on Metaphor; African SciFi; Traits in Julia; Alternative Japanese Lit; Pulic Key as Address (Snow); Why Information Grows; The Blindness Of The Chilean Elite; Some Victoriagate Links; This Is Why I Left StackOverflow; New TLS Implementation; Maths for Physicists; How I Am 8; 1000 Word Philosophy; Cyberpunk Reading List; Detailed Discussion of Message Dispatch in ParserCombinator Library for Julia; FizzBuzz in Julia w Dependent Types; kokko - Design Shop in Osaka; Summary of Greece, Currently; LLVM and GPUs; See Also; Schoolgirl Groyps (Maths); Japanese Lit; Another Example - Modular Arithmetic; Music from United; Python 2 and 3 compatible alternative.; Read Agatha Christie for the Plot; A Constructive Look at TempleOS; Music Thread w Many Recommendations; Fixed Version; A Useful Julia Macro To Define Equality And Hash; k3b cdrom access, OpenSuse 13.1; Week 2; From outside, the UK looks less than stellar

© 2006-2015 Andrew Cooke (site) / post authors (content).

Next Step for RXPY/Lepl integration

From: andrew cooke <andrew@...>

Date: Sun, 29 May 2011 11:22:09 -0400

RXPY is a Python 2 project that implements the re2 approach to regexps in pure
Python.  Lepl is a recursive descent parser that can delegate to a regular
expression library in various areas (for example, it can compile some
sub-parsers to regular expressions).

Lepl contains two regexp implementations - NFA and DFA - but they have some
 - They don't implement all regexp features
 - They don't implement the standard Python interface for regular
   expressions (re package)

RXPY addresses both of these issues - all the engines are feature complete and
have the Python interface (note that the docs for RXPY are very incomplete -
the package is in a much better state than described).

Both Lepl and RXPY support both backtracking and "stable" algorithms (stable
in that they don't suffer from combinatorial explosion on certain patterns).
However, the Lepl backtracking engine is particular slow at handling at
pre-compiling a common pattern (for float values).  RXPY has not been tested
on this.

However, Lepl needs more than the standard Python interface.  It requires:
 1 Input to be wrapped in a "stream" abstraction.  This allows "infinite"
   data to be handled with finite resources.
 2 Python 3 support
 3 Multiple and ordered matches
 4 Fast matching of entire groups and lazy matching of sub-groups

The last two points need more explanation:

(3) Multiple and ordered matches are used by the tokenizer.  A set of
different regular expressions are matched against the input and the longest
match wins.  This match must be associated with input pattern (ie we must know
which pattern won); if multiple patterns tie with the longest match we want to
know all patterns; sometimes we may want to tie this back to the ordering of
patterns (in particular, we want the "first" pattern that matches).

This can be achieved by combining named groups with "or" (the | symbol), but
named groups require a slower approach in RXPY.  It may be that the special
case of groups that match the entire string can be optimised into the faster
approach (the main RXPY matcher uses a fast initial match and the re-processes
for groups).  For example, the group name can be embedded in the final (exit)
node(s) of the graph that describe the regular expression.

(4) People tend to use "(...)" to group patterns in regular expressions, even
when they do not want to access the results.  Above I described how RXPY uses
two approaches - a faster, group-free stage and then re-processing for groups.
Currently the second stage is triggered by the expression itself, as it
matches.  Instead, I believe that we can wait until the user requests a named
group.  Since, in many situations, the fast stage can tell whether the
expression matched, this avoids the second stage unless it is "really needed".

One final aim: I would like to make the code particularly well documented.
According to my website logs there is quite a lot of interest in implementing
regular expressions (from students, I guess).  So a clear, well documented
implementation could be useful.


More on Lepl + RXPY

From: andrew cooke <andrew@...>

Date: Sun, 26 Jun 2011 20:09:47 -0400

None of the work above is done yet; I am still moving code from RXPY to Lepl
(I'm modifying it slightly as I go and have been slowed considerably by
the switch to Intellij Idea / PyCharm).

One change I need to make to the code is to make it run with Python 3.  RXPY
was developed under Python 2.  The difference between P 2 and 3 is significant
for code that deals with text because the string/Unicode types changed

RXPY contains the idea of "alphabets" (also present in Lepl).  These define
which "characters" can be matched in the input.  This is not restricted to
text - you could use RXPY to match lists of integers, for example.  But these
are *decoupled* from the input.  RXPY assumes that the input is text - one of
the tasks of an alphabet is to translate from the text representation to the
types that will be matched.

The closes that Python 2's re package has to this is the Unicode/ASCII flag.
So the old RXPY code had ASCII and Unicode alphabets.  But in Pythin 3 there
is a new, and orthogonal twist - re works with both strings and bytes.

More confusingly, the ASCII/Unicode distinction is separate from the
bytes/string distinction.  ASCII/Unicode only affects (as far as I can see)
how character classes are interpreted (things like upper and lower case, or
what is a digit, and what not).

Also, Python 3's re package expects the input expression to match the input
data.  So if you want to match bytes, you give a byte string regexp.

How can this fit with RXPY?  In my opinion RXPY's approach is much cleaner and
I don't want to lose that.  At the same time, compatability is relatively

One easy change is to align alphabets with byte/string rather than
ASCII/Unicode.  So I am going to change RXPY so that it has alphabets for
Unicode and for bytes.  At the same time, the ASCII/Unicode flag is going to
be just a flag - something that is read by both alphabets and changes their
behaviour accordingly.

But what about the coupling between the data format and the format used to
describe the regexp?  That seems completely dumb to me.  What I will do, I
think, is add a special test that duplicates Python's behaviour, but which can
be disabled via an option.  If enabled, it will require that alphabets match
the regexp.  If not, it will handle both unicode and bytes (where the bytes
are representing ASCII characters, where necessary).


Yet More...

From: andrew cooke <andrew@...>

Date: Mon, 27 Jun 2011 00:19:11 -0400

OK, after thinking some more:

- Alphabets will define the type of the input they expect.  All but the bytes
  alphabet will inherit from a base class that expects normal strings
  (unicode).  The bytes alphabet will expect bytes.

- Alphabets will also define the constants used in the parsing (things like
  "?" and "*").  The parser will make comparisons against these.

- Alphabets will be extended, where necesary, to take the ASCII/Unicode flag
  and act accordingly.

This makes RXPY more general, while bringing it closer to Python.


Comment on this post