Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

Copier Quotes for Cat Soft LLC; [Book] Galileo's Middle Finger; VOIP quote for Cat Soft LLC; [Bike] Chinese Carbon Rims; Collection Agencies for Cat Soft LLC; Get Coffee Quotes for Cat Soft LLC; [Bike] Servicing Shimano XT Front Hub HB-M8010; [Bike] Aliexpress Cycling Tops; Now Is Cat Soft LLC's Chance To Save Up To 32% On Mail; Call Center Services for Cat Soft LLC; [Computing] Change to ssh handling of multiple identities?; [Bike] Endura Hummvee Lite II; [Computing] Marble Based Logic; [Link, Politics] Sanity Check For Nuclear Launch; [Link, Science] Entropy and Life; [Link, Bike] Cheap Cycling Jerseys; [Link, Music] Music To Steal 2017; [Link, Future] Simulated Brain Drives Robot; [Link, Computing] Learned Index Structures; Solo Air Equalization; Update: Higher Pressures; Psychology; [Bike] Exercise And Fuel; Continental Race King 2.2; Removing Lowers; Mnesiacs; [Maths, Link] Dividing By Zero; [Book, Review] Ray Monk - Ludwig Wittgenstein: The Duty Of Genius; [Link, Bike, Computing] Evolving Lacing Patterns; [Jam] Strawberry and Orange Jam; [Chile, Privacy] Biometric Check During Mail Delivery; [Link, Chile, Spanish] Article on the Chilean Drought; [Bike] Extended Gear Ratios, Shimano XT M8000 (24/36 Chainring); [Link, Politics, USA] The Future Of American Democracy; Mass Hysteria; [Review, Books, Links] Kazuo Ishiguro - Never Let Me Go; [Link, Books] David Mitchell's Favourite Japanese Fiction; [Link, Bike] Rear Suspension Geometry; [Link, Cycling, Art] Strava Artwork; [Link, Computing] Useful gcc flags; [Link] Voynich Manuscript Decoded; [Bike] Notes on Servicing Suspension Forks; [Links, Computing] Snap, Flatpack, Appimage; [Link, Computing] Oracle is leaving Java (to die); [Link, Politics] Cubans + Ultrasonics; [Book, Link] Laurent Binet; VirtualBox; [Book, Link] No One's Ways; [Link] The Biggest Problem For Cyclists Is Bad Driving; [Computing] Doxygen, Sphinx, Breathe; [Admin] Brokw Recent Permalinks; [Bike, Chile] Buying Bearings in Santiago; [Computing, Opensuse] Upgrading to 42.3; [Link, Physics] First Support for a Physics Theory of Life; [Link, Bike] Peruvian Frame Maker; [Link] Awesome Game Theory Tit-For-Tat Thing; [Food, Review] La Fabbrica - Good Italian Food In Santiago; [Link, Programming] MySQL UTF8 Broken; [Link, Books] Latin American Authors; [Link, Computing] Optimizatin Puzzle; [Link, Books, Politics] Orwell Prize; [Link] What the Hell Is Happening With Qatar?; [Link] Deep Learning + Virtual Tensor Machines; [Link] Scaled Composites: Largest Wingspan Ever; [Link] SCP Foundation; [Bike] Lessons From 2 Leading 2 Trailing; [Link] Veg Restaurants in Santiago; [Link] List of Contemporary Latin American Authors; [Bike] FTHR; [Link] Whoa - NSA Reduces Collection (of US Residents); [Link] Red Bull's Breitbart; [Link] Linux Threads; [Link] Punycode; [Link] Bull / Girl Statues on Wall Street; [Link] Beautiful Chair Video; Update: Lower Pressures; [Link] Neat Python Exceptions; [Link] Fix for Windows 10 to Avoid Ads; [Link] Attacks on ZRTP; [Link] UK Jazz Invasion; [Review] Cuba; [Link] Aricle on Gender Reversal of US Presidential Debate; {OpenSuse] Fix for Network Offline in Updater Applet; [Link] Parkinson's Related to Gut Flora; Farellones Bike Park; [Meta] Tags; Update: Second Ride; Schwalbe Thunder Burt 2.1 v Continental X-King 2.4; Mountain Biking in Santiago; Books on Ethics; Security Fail from Command Driven Interface; Everything Old is New Again; Interesting Take on Trump's Lies; Chutney v6; References on Entropy; Amusing "Alexa.." broadcast; The Shame of Chile's Education System; Playing mp4 gifs in Firefox on Opensuses Leap 42.2; Concurrency at Microsoft; Globalisation: Uk -> Chile; OpenSuse 42.2 and Synaptics Touch-Pads

© 2006-2017 Andrew Cooke (site) / post authors (content).

Next Step for RXPY/Lepl integration

From: andrew cooke <andrew@...>

Date: Sun, 29 May 2011 11:22:09 -0400

RXPY is a Python 2 project that implements the re2 approach to regexps in pure
Python.  Lepl is a recursive descent parser that can delegate to a regular
expression library in various areas (for example, it can compile some
sub-parsers to regular expressions).

Lepl contains two regexp implementations - NFA and DFA - but they have some
 - They don't implement all regexp features
 - They don't implement the standard Python interface for regular
   expressions (re package)

RXPY addresses both of these issues - all the engines are feature complete and
have the Python interface (note that the docs for RXPY are very incomplete -
the package is in a much better state than described).

Both Lepl and RXPY support both backtracking and "stable" algorithms (stable
in that they don't suffer from combinatorial explosion on certain patterns).
However, the Lepl backtracking engine is particular slow at handling at
pre-compiling a common pattern (for float values).  RXPY has not been tested
on this.

However, Lepl needs more than the standard Python interface.  It requires:
 1 Input to be wrapped in a "stream" abstraction.  This allows "infinite"
   data to be handled with finite resources.
 2 Python 3 support
 3 Multiple and ordered matches
 4 Fast matching of entire groups and lazy matching of sub-groups

The last two points need more explanation:

(3) Multiple and ordered matches are used by the tokenizer.  A set of
different regular expressions are matched against the input and the longest
match wins.  This match must be associated with input pattern (ie we must know
which pattern won); if multiple patterns tie with the longest match we want to
know all patterns; sometimes we may want to tie this back to the ordering of
patterns (in particular, we want the "first" pattern that matches).

This can be achieved by combining named groups with "or" (the | symbol), but
named groups require a slower approach in RXPY.  It may be that the special
case of groups that match the entire string can be optimised into the faster
approach (the main RXPY matcher uses a fast initial match and the re-processes
for groups).  For example, the group name can be embedded in the final (exit)
node(s) of the graph that describe the regular expression.

(4) People tend to use "(...)" to group patterns in regular expressions, even
when they do not want to access the results.  Above I described how RXPY uses
two approaches - a faster, group-free stage and then re-processing for groups.
Currently the second stage is triggered by the expression itself, as it
matches.  Instead, I believe that we can wait until the user requests a named
group.  Since, in many situations, the fast stage can tell whether the
expression matched, this avoids the second stage unless it is "really needed".

One final aim: I would like to make the code particularly well documented.
According to my website logs there is quite a lot of interest in implementing
regular expressions (from students, I guess).  So a clear, well documented
implementation could be useful.


More on Lepl + RXPY

From: andrew cooke <andrew@...>

Date: Sun, 26 Jun 2011 20:09:47 -0400

None of the work above is done yet; I am still moving code from RXPY to Lepl
(I'm modifying it slightly as I go and have been slowed considerably by
the switch to Intellij Idea / PyCharm).

One change I need to make to the code is to make it run with Python 3.  RXPY
was developed under Python 2.  The difference between P 2 and 3 is significant
for code that deals with text because the string/Unicode types changed

RXPY contains the idea of "alphabets" (also present in Lepl).  These define
which "characters" can be matched in the input.  This is not restricted to
text - you could use RXPY to match lists of integers, for example.  But these
are *decoupled* from the input.  RXPY assumes that the input is text - one of
the tasks of an alphabet is to translate from the text representation to the
types that will be matched.

The closes that Python 2's re package has to this is the Unicode/ASCII flag.
So the old RXPY code had ASCII and Unicode alphabets.  But in Pythin 3 there
is a new, and orthogonal twist - re works with both strings and bytes.

More confusingly, the ASCII/Unicode distinction is separate from the
bytes/string distinction.  ASCII/Unicode only affects (as far as I can see)
how character classes are interpreted (things like upper and lower case, or
what is a digit, and what not).

Also, Python 3's re package expects the input expression to match the input
data.  So if you want to match bytes, you give a byte string regexp.

How can this fit with RXPY?  In my opinion RXPY's approach is much cleaner and
I don't want to lose that.  At the same time, compatability is relatively

One easy change is to align alphabets with byte/string rather than
ASCII/Unicode.  So I am going to change RXPY so that it has alphabets for
Unicode and for bytes.  At the same time, the ASCII/Unicode flag is going to
be just a flag - something that is read by both alphabets and changes their
behaviour accordingly.

But what about the coupling between the data format and the format used to
describe the regexp?  That seems completely dumb to me.  What I will do, I
think, is add a special test that duplicates Python's behaviour, but which can
be disabled via an option.  If enabled, it will require that alphabets match
the regexp.  If not, it will handle both unicode and bytes (where the bytes
are representing ASCII characters, where necessary).


Yet More...

From: andrew cooke <andrew@...>

Date: Mon, 27 Jun 2011 00:19:11 -0400

OK, after thinking some more:

- Alphabets will define the type of the input they expect.  All but the bytes
  alphabet will inherit from a base class that expects normal strings
  (unicode).  The bytes alphabet will expect bytes.

- Alphabets will also define the constants used in the parsing (things like
  "?" and "*").  The parser will make comparisons against these.

- Alphabets will be extended, where necesary, to take the ASCII/Unicode flag
  and act accordingly.

This makes RXPY more general, while bringing it closer to Python.


Comment on this post