Andrew Cooke | Contents | Latest | RSS | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Choochoo Training Diary

Last 100 entries

[Programming] React Leaflet; AliExpress Independent Sellers; Applebaum - Twilight of Democracy; [Politics] Back + US Elections; [Programming,Exercise] Simple Timer Script; [News] 2019: The year revolt went global; [Politics] The world's most-surveilled cities; [Bike] Hope Freehub; [Restaurant] Mama Chau's (Chinese, Providencia); [Politics] Brexit Podcast; [Diary] Pneumonia; [Politics] Britain's Reichstag Fire moment; install cairo; [Programming] GCC Sanitizer Flags; [GPU, Programming] Per-Thread Program Counters; My Bike Accident - Looking Back One Year; [Python] Geographic heights are incredibly easy!; [Cooking] Cookie Recipe; Efficient, Simple, Directed Maximisation of Noisy Function; And for argparse; Bash Completion in Python; [Computing] Configuring Github Jekyll Locally; [Maths, Link] The Napkin Project; You can Masquerade in Firewalld; [Bike] Servicing Budget (Spring) Forks; [Crypto] CIA Internet Comms Failure; [Python] Cute Rate Limiting API; [Causality] Judea Pearl Lecture; [Security, Computing] Chinese Hardware Hack Of Supermicro Boards; SQLAlchemy Joined Table Inheritance and Delete Cascade; [Translation] The Club; [Computing] Super Potato Bruh; [Computing] Extending Jupyter; Further HRM Details; [Computing, Bike] Activities in ch2; [Books, Link] Modern Japanese Lit; What ended up there; [Link, Book] Logic Book; Update - Garmin Express / Connect; Garmin Forerunner 35 v 230; [Link, Politics, Internet] Government Trolls; [Link, Politics] Why identity politics benefits the right more than the left; SSH Forwarding; A Specification For Repeating Events; A Fight for the Soul of Science; [Science, Book, Link] Lost In Math; OpenSuse Leap 15 Network Fixes; Update; [Book] Galileo's Middle Finger; [Bike] Chinese Carbon Rims; [Bike] Servicing Shimano XT Front Hub HB-M8010; [Bike] Aliexpress Cycling Tops; [Computing] Change to ssh handling of multiple identities?; [Bike] Endura Hummvee Lite II; [Computing] Marble Based Logic; [Link, Politics] Sanity Check For Nuclear Launch; [Link, Science] Entropy and Life; [Link, Bike] Cheap Cycling Jerseys; [Link, Music] Music To Steal 2017; [Link, Future] Simulated Brain Drives Robot; [Link, Computing] Learned Index Structures; Solo Air Equalization; Update: Higher Pressures; Psychology; [Bike] Exercise And Fuel; Continental Race King 2.2; Removing Lowers; Mnesiacs; [Maths, Link] Dividing By Zero; [Book, Review] Ray Monk - Ludwig Wittgenstein: The Duty Of Genius; [Link, Bike, Computing] Evolving Lacing Patterns; [Jam] Strawberry and Orange Jam; [Chile, Privacy] Biometric Check During Mail Delivery; [Link, Chile, Spanish] Article on the Chilean Drought; [Bike] Extended Gear Ratios, Shimano XT M8000 (24/36 Chainring); [Link, Politics, USA] The Future Of American Democracy; Mass Hysteria; [Review, Books, Links] Kazuo Ishiguro - Never Let Me Go; [Link, Books] David Mitchell's Favourite Japanese Fiction; [Link, Bike] Rear Suspension Geometry; [Link, Cycling, Art] Strava Artwork; [Link, Computing] Useful gcc flags; [Link] Voynich Manuscript Decoded; [Bike] Notes on Servicing Suspension Forks; [Links, Computing] Snap, Flatpack, Appimage; [Link, Computing] Oracle is leaving Java (to die); [Link, Politics] Cubans + Ultrasonics; [Book, Link] Laurent Binet; VirtualBox; [Book, Link] No One's Ways; [Link] The Biggest Problem For Cyclists Is Bad Driving; [Computing] Doxygen, Sphinx, Breathe; [Admin] Brokw Recent Permalinks; [Bike, Chile] Buying Bearings in Santiago; [Computing, Opensuse] Upgrading to 42.3; [Link, Physics] First Support for a Physics Theory of Life; [Link, Bike] Peruvian Frame Maker; [Link] Awesome Game Theory Tit-For-Tat Thing; [Food, Review] La Fabbrica - Good Italian Food In Santiago; [Link, Programming] MySQL UTF8 Broken; [Link, Books] Latin American Authors

© 2006-2017 Andrew Cooke (site) / post authors (content).

Next Step for RXPY/Lepl integration

From: andrew cooke <andrew@...>

Date: Sun, 29 May 2011 11:22:09 -0400

RXPY is a Python 2 project that implements the re2 approach to regexps in pure
Python.  Lepl is a recursive descent parser that can delegate to a regular
expression library in various areas (for example, it can compile some
sub-parsers to regular expressions).

Lepl contains two regexp implementations - NFA and DFA - but they have some
problems:
 - They don't implement all regexp features
 - They don't implement the standard Python interface for regular
   expressions (re package)

RXPY addresses both of these issues - all the engines are feature complete and
have the Python interface (note that the docs for RXPY are very incomplete -
the package is in a much better state than described).

Both Lepl and RXPY support both backtracking and "stable" algorithms (stable
in that they don't suffer from combinatorial explosion on certain patterns).
However, the Lepl backtracking engine is particular slow at handling at
pre-compiling a common pattern (for float values).  RXPY has not been tested
on this.

However, Lepl needs more than the standard Python interface.  It requires:
 1 Input to be wrapped in a "stream" abstraction.  This allows "infinite"
   data to be handled with finite resources.
 2 Python 3 support
 3 Multiple and ordered matches
 4 Fast matching of entire groups and lazy matching of sub-groups

The last two points need more explanation:

(3) Multiple and ordered matches are used by the tokenizer.  A set of
different regular expressions are matched against the input and the longest
match wins.  This match must be associated with input pattern (ie we must know
which pattern won); if multiple patterns tie with the longest match we want to
know all patterns; sometimes we may want to tie this back to the ordering of
patterns (in particular, we want the "first" pattern that matches).

This can be achieved by combining named groups with "or" (the | symbol), but
named groups require a slower approach in RXPY.  It may be that the special
case of groups that match the entire string can be optimised into the faster
approach (the main RXPY matcher uses a fast initial match and the re-processes
for groups).  For example, the group name can be embedded in the final (exit)
node(s) of the graph that describe the regular expression.

(4) People tend to use "(...)" to group patterns in regular expressions, even
when they do not want to access the results.  Above I described how RXPY uses
two approaches - a faster, group-free stage and then re-processing for groups.
Currently the second stage is triggered by the expression itself, as it
matches.  Instead, I believe that we can wait until the user requests a named
group.  Since, in many situations, the fast stage can tell whether the
expression matched, this avoids the second stage unless it is "really needed".

One final aim: I would like to make the code particularly well documented.
According to my website logs there is quite a lot of interest in implementing
regular expressions (from students, I guess).  So a clear, well documented
implementation could be useful.

Andrew

More on Lepl + RXPY

From: andrew cooke <andrew@...>

Date: Sun, 26 Jun 2011 20:09:47 -0400

None of the work above is done yet; I am still moving code from RXPY to Lepl
(I'm modifying it slightly as I go and have been slowed considerably by
the switch to Intellij Idea / PyCharm).

One change I need to make to the code is to make it run with Python 3.  RXPY
was developed under Python 2.  The difference between P 2 and 3 is significant
for code that deals with text because the string/Unicode types changed
significantly.

RXPY contains the idea of "alphabets" (also present in Lepl).  These define
which "characters" can be matched in the input.  This is not restricted to
text - you could use RXPY to match lists of integers, for example.  But these
are *decoupled* from the input.  RXPY assumes that the input is text - one of
the tasks of an alphabet is to translate from the text representation to the
types that will be matched.

The closes that Python 2's re package has to this is the Unicode/ASCII flag.
So the old RXPY code had ASCII and Unicode alphabets.  But in Pythin 3 there
is a new, and orthogonal twist - re works with both strings and bytes.

More confusingly, the ASCII/Unicode distinction is separate from the
bytes/string distinction.  ASCII/Unicode only affects (as far as I can see)
how character classes are interpreted (things like upper and lower case, or
what is a digit, and what not).

Also, Python 3's re package expects the input expression to match the input
data.  So if you want to match bytes, you give a byte string regexp.

How can this fit with RXPY?  In my opinion RXPY's approach is much cleaner and
I don't want to lose that.  At the same time, compatability is relatively
important.

One easy change is to align alphabets with byte/string rather than
ASCII/Unicode.  So I am going to change RXPY so that it has alphabets for
Unicode and for bytes.  At the same time, the ASCII/Unicode flag is going to
be just a flag - something that is read by both alphabets and changes their
behaviour accordingly.

But what about the coupling between the data format and the format used to
describe the regexp?  That seems completely dumb to me.  What I will do, I
think, is add a special test that duplicates Python's behaviour, but which can
be disabled via an option.  If enabled, it will require that alphabets match
the regexp.  If not, it will handle both unicode and bytes (where the bytes
are representing ASCII characters, where necessary).

Andrew

Yet More...

From: andrew cooke <andrew@...>

Date: Mon, 27 Jun 2011 00:19:11 -0400

OK, after thinking some more:

- Alphabets will define the type of the input they expect.  All but the bytes
  alphabet will inherit from a base class that expects normal strings
  (unicode).  The bytes alphabet will expect bytes.

- Alphabets will also define the constants used in the parsing (things like
  "?" and "*").  The parser will make comparisons against these.

- Alphabets will be extended, where necesary, to take the ASCII/Unicode flag
  and act accordingly.

This makes RXPY more general, while bringing it closer to Python.

Andrew

Comment on this post