Wrestling PCRE

In my opinion Perl revolutionized programming by making certain routines into daily tools, two of the most important ones are hash table and regular expression(regex). Both hash table and regex are mature algorithms well before Perl, but Perl managed to fit these tools into programmers' tool belt. The convenience eventually alters thinking -- with hammer at hand, every nail is a pleasure; with Perl, we start to see everywhere calling for hash and regex, and programming become easier or higher-level.

There are mature libraries for both hash table and regex in C. In fact, there are many of them. However, none can honestly fit to the tool belt. For hash tables, there are string key management issues and bucket size considerations. For regex, there is the process of compiling, execution and capture. However, I think it is possible to program in C in a high-level style as Perl. All we needed is an extra meta-layer to help us transform our high-level syntax into lower level C code. In my last post, I demonstrated how to do that with MyDef for hash table. In this post, let's do something about regex.

Using PCRE

For hash table, I essentially rolled a custom hash table implementation because it is relatively simple and it is not unusual for applications that require low-level tweaking for optimum performance. Regex is not difficult to implement for original basic form, but we will not be satisfied with basic in this case, don't we? We will need a "Perl-compatible-regular-expression" -- all the fancy features we have been used to. For that, it is complicated and it makes sense to re-use other people's wheel. We'll just use PCRE.

It is illusionary to think PCRE makes regex as easy as in Perl. It takes one minute to learn Perl's regex and start using it. I spent full 10 minutes on pcre.org(which is not my first time) before I went on to google "PCRE tutorial". Nevertheless, I'll assume readers are familiar with both regex and PCRE. Let's start writing code.

Sample Program

For convenience, we'll write the following sample program. First,


$ wget -O t.html www.reddit.com
                                
Then we'll program in C to extract the links from "t.html". We'll use this regex:

/class="title [^"]*" href="(http[^"]*)"/g
                                
Let's not worry that it may not be the perfect regex. It worked when I tested it. And this is the point of having regex on your tool belt: we can write high-level code without worrying about too much details -- we always can spend time later to tune the code -- and we should.

Without any fuss, I have the following working code:


#include <pcre.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>

int main(int argc, char** argv){
    char * s_file;
    struct stat fstat;
    int tn_ret;
    int n;
    char * s;
    FILE * file_in;
    int tn_pcre_pos = 0;
    pcre * re1;
    const char * pcre_err_msg;
    int pcre_err_pos;
    int pn_pcre_off[60];
    char * ts_url;

    re1 = pcre_compile("class=\"title [^\"]*\" href=\"(http[^\"]*)\"", 0, &pcre_err_msg, &pcre_err_pos, NULL);
    if(!re1){
        puts("pcre_compile error: class=\"title [^\"]*\" href=\"(http[^\"]*)\"");
    }
    s_file = "t.html";
    tn_ret = stat(s_file, &fstat);
    n = fstat.st_size;
    s=(char*)malloc((n+1)*sizeof(char));
    memset(s, 0, (n+1)*sizeof(char));
    file_in = fopen(s_file, "rb");
    if(file_in == NULL){
        fprintf(stderr, "Can't open %s\n", s_file);
        exit(1)
    }
    else{
        fread(s, 1, n, file_in);
        fclose(file_in);
    }
    while(pcre_exec(re1, NULL, s, strlen(s), tn_pcre_pos, 0, pn_pcre_off, 60)>0 && (tn_pcre_pos=pn_pcre_off[1])){
        ts_url = s + pn_pcre_off[2];
        printf("link: %.*s\n", pn_pcre_off[3]-pn_pcre_off[2], ts_url);
    }
    free(s);
    free(re1);
    return 0;
}
                            

We can compile and run to see it works:


$ gcc -o test test.c -lpcre && ./test
link: http://bigstory.ap.org/...
link: http://i.imgur.com/...
...
                            

So far so good, but ...

I'm Not Happy

It is too much struggle to write such a simple program, and the code looks like noise.

I have this theory of readable code -- A readable code should only contain two parts: the part we know to ignore and the part that can fit into our working memory, which are 5-7 items, and it should mostly consists of the latter. In the C code above, through training, I easily can ignore the #include ... lines, and I skip over the main declaration, then my brain kicks into the reading mode, which get lost after 5-7 lines. So I back out, ignore the variable declaration part, then again get lost after a few lines...

Well, I lied. Actually I know what the code does, and I can read through all the code lines, but it is nevertheless a struggle, constantly trying to sort out the contexts.

I need do something about it ... and this article is not about C ...

MyDef -- level 0

The idea is to have a meta layer above the code we normally write, so we can implement higher-level intelligence (in common language, conventions) to translate from higher level syntax to the precise code that runs. And this layer is more of an ad-hoc nature, just like our daily real life.

To illustrate, we'll start with level 0. I implemented this meta-layer in Perl and I called it MyDef; but in principle, you can implement your meta-layer in any languages and call it whatever. What is important is the idea.

At level 0, the layer simply spits out the code as is:


page: test
    #include <pcre.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <sys/stat.h>

    int main(int argc, char** argv){
        char * s_file;
        struct stat fstat;
        ...
                            

Exactly the same code as before except now they are tucked under a page directive. A page is simply a file (you can relate that to an html page). The immediate purpose is to mark up the code into context hierarchy (ideally to have 5-7 items at each context). To achieve that, we just need certain macro-level markup syntax. I chose to use indentation as it is the most readable/simple way. Again, it is OK to take other preference.

So here it is, at level 0, with one markup -- a page/file. We'll need an extra translation step now:


$ mydef_page -mc test.def
PAGE: test
    --> [./test.c]
$ gcc -o test test.c -lpcre && ./test
...
                            

mydef_page is the MyDef translator. -mcuses the C output module.

So far so good, but nothing really changed. So, what's the point?

MyDef -- Level 1

Now, let's try some markup:


page: test
    $call includes
    int main(int argc, char** argv){
        $call declare_variables
        $call read_in_file
        $call regex_compile
        while ($(regex_match)){
            $call regex_capture, ts_url
            printf("link: %.*s\n", $(ts_url_len), ts_url)
        }
        $call clean_up
    }

subcode: includes
    ...
subcode: declare_variables
    ...
subcode: read_in_file
    ...
subcode: regex_compile
    ...
    $(export:regex_match=...)
subcode: regex_capture(var)
    ...
subcode: clean_up
    ...
                            

For brevity, I omitted some of the code -- they are nothing but block of C code straight from the original.

There is nothing changed at the C output. But with the level-1 markup, we reorganized the code into a much more readable fashion. Within the first 10 lines of main code, we understand the program in its entirety. Then, we selectively navigate to individual code block. At each code block, we have a clear local context (such as the purpose of the code) as well as shorter size to work on.

In addition to subcode blocks, MyDef also provides in-line macros, which are those "$(...)". Again the choice was more of my ad-hoc decisions. We need a way to not to interfere with the native language (so that if we know how to write code in native language, we do not need re-learn). I have grown to like the ugly signature of these macros, as they give me clear indication of their context.

The subcode can carry parameters, and it seems to be a good idea for read_in_file:


page: test
    int main(){
        ...
        $call read_in_file, "t.html"
        ...
    }
subcode: read_in_file(file)
    s_file = $(file);
    stat(s_file, &fstat);
    n = fstat.st_size;
    s=(char*)malloc((n+1)*sizeof(char));
    memset(s, 0, (n+1)*sizeof(char));
    file_in = fopen(s_file, "rb");
    if(file_in == NULL){
        fprintf(stderr, "Can't open %s\n", s_file);
        exit(1)
    }
    else{
        fread(s, 1, n, file_in);
        fclose(file_in);
    }
                            

Assuming we want to conform to ANSI C89, which does not allow ad-hoc variable declarations, we still can do something with our meta-layer:


subcode:: declare_variables
    char * s_file
    struct stat fstat
    int n
    char * s
    FILE * file_in
subcode: read_in_file(file)
    s_file = $(file);
    stat(s_file, &fstat);
    ...
                            

The idea is to have concatenatable variable declaration blocks scattered where they are being used/referred.

We need some way to recognize that certain subcode should be concatenated (subcode declare_variables here), so I choose to use a double colon here. Again, ad-hoc, but simple and works.

The way we open file and then check its error is boilerplate code. They are necessary to function, but feels like noise or distraction anyway. -- "You know what I mean, now do it for me!" To answer, MyDef have implemented following functional style subcode:


subcode: open_r(file)
    file_in = fopen(s_file, "rb");
    if(file_in == NULL){
        fprintf(stderr, "Can't open %s\n", s_file);
        exit(1)
    }
    else{
        BLOCK
        fclose(file_in);
    }
subcode: read_in_file(file)
    s_file = $(file);
    stat(s_file, &fstat);
    n = fstat.st_size;
    s=(char*)malloc((n+1)*sizeof(char));
    memset(s, 0, (n+1)*sizeof(char));
    &call open_r, $(file)
        fread(s, 1, n, file_in);
                            

MyDef use &call to understand that there is indented code below to be inserted into the middle of the block.

When you program for long, you'll have lots of boiler plate code -- a set of your own slangs. It is only natural to tuck them into your tool box. Let's save open_r as well as file_stat into c/file.def, then we can simply $call them:


include: c/file.def
page: test
    ...
    $call file_stat, s_file
    n=$(file_size)
    ...
    &call open_r, s_file
        fread(s, 1, n, file_in);
    ...
                            

In fact, now the code is so succinct, I felt to be OK to leave them in the main code rather than separate them out.

In a same fashion, we can have trivial cases like:


&call while, $(regex_match)
    ...
                            

So we save the curly braces. Whether it is worthwhile is up to you.

MyDef -- Level 2

I hope you are following along so far. (If you are still looking for a tutorial on PCRE, I hope you realize now that this is not it.) The basic idea is to have this flexible meta-layer at your disposal so that you can achieve anything you fancy (or seriously demand). The general macro features described at level 1, although they are of the most important functions -- organizing code in the best semantic way -- they fall short to achieve high-level syntax. To reach that, we need two additional layers of customizations. At level 2, we will write plug-in code. At level 3, we'll write native output modules.

The plug-in code is still of ad-hoc nature, where we simply write custom translation routine in the code that implements our meta-layer -- Perl, in the case of MyDef. In the output module, we implement code that are specific to the language we are programming, and often take on a more systematic approach. In our example, it is C, and we have output_c.pm. In output_c module, we develop facilities to manage include files, variables, memory allocations, as well as curly braces and semicolons, etc. It will be difficult to describe either layer in detail because they inevitably requires knowledge in all aspects of MyDef, which, unless you are already sold, will have little interest in. So here I'll just do a whirlwind tour to illustrate the key ideas.

Level 2 plug-in allows for implementations of arbitrary syntax. Of course we are not talking about whole program-wise complicated syntax such as the entire C syntax set. Rather, we are talking about small domain-specific syntax, such as SQL, Regex, or the peculiar ones such as Einstein Notations, or sequence definition syntax.

We should not make these arbitrary syntax mix with global syntax. For example, if we want to introduce a template syntax or class syntax to C and have it blend with all other C syntax, not only it will be difficult to realize, but also the user will find the language become more complex. To accommodate arbitrary DSL, the solution is to have all the DSL syntaxes conform to a simple macro syntax. In MyDef, we require all special custom syntax start with $, followed by a verb. For level 2 plug-in syntax, we further require it all start with $call and then the verb. With this convention (more or less arbitrarily defined), custom syntaxes become very easy for programmer to handle: whenever we see the leading $, we understand this is a custom syntax, and the verb that follows defines the context of the syntax. It turns out we are excellent in handling this type of complexity -- think about the amount of vocabulary we handle daily.

In practice, we also embrace ambiguity. Within the narrow context of local DSL, it often become easy to make guesses of unfamiliar syntaxes. For example:


$while condition
    block of code
                            

$call define_sequence fib: 0, 1, a0+a1
                            

$if s=~/(while|for)/
    $call regex_capture, s_name
    $print $s_name loop!
                            

All above code works in my C programs. Do you have trouble understand them?

Well, maybe the first time it is more like a CogAt test, but do you think you will continuously having trouble with these syntaxes?

Oh, sorry, my writing style is like my life -- ad-hoc. Let me wrap this section with relevant level 2 code snippet (hint, skip it):


#---- file c/regex.def -------------------------------
include: c/string.def
subcode:: _autoload
    $plugin(regex) pcre_condition
    $plugin(regex) pcre_statement
    $plugin(regex_capture) pcre_plugin_capture

#=========================================================
perlcode: pcre_statement
    $call pcre_plugin_match, statement

perlcode: pcre_condition
    $call pcre_plugin_match, condition

#----------------------------------------
subcode: pcre_plugin_match(condition)
    my ($str, $pattern, $option)
    $if $param=~/^(\S+)\s*=~\s*\/(.*)\/(.*)/
        ($str, $pattern, $option)=($1, $2, $3)
    #----------------------------------------
    $if $pattern
        $pattern=~s/\\/\\\\/g
        $pattern=~s/"/\\"/g
        $cur_function->{regex_idx}++
        my $idx=$cur_function->{regex_idx}
        MyDef::compileutil::set_current_macro("re_var", $str)
        MyDef::compileutil::set_current_macro("re_pat", $pattern)

        my $pos=0
        $if $option=~/g/
            func_add_var("int tn_pcre_pos=0")
            $pos="tn_pcre_pos"
        $if $option=~/i/
            $option='PCRE_CASELESS'
        $else
            $option="0"
        MyDef::compileutil::set_current_macro("re_opt", $option)

        MyDef::compileutil::call_sub("pcre_match_compile, re$idx")
        my $var=find_var($str)
        my $strlen="strlen($str)"
        $if $var and $var->{strlen}
            $strlen=$var->{strlen}
        my $capsize=60
        func_add_var("int pn_pcre_off[$capsize]")

        $(if:condition=condition)
            $condition="pcre_exec(re$idx, NULL, $str, $strlen, $pos, 0, pn_pcre_off, $capsize)>0"
            $if $pos
                $condition.=" && ($pos=pn_pcre_off[1])"
        $(else)
            push @$out, "pcre_exec(re$idx, NULL, $str, $strlen, $pos, 0, pn_pcre_off, $capsize);"

#--------------------------------------------
#-- compile $(re_pat)
subcode: pcre_match_compile(re)
    $include pcre
    $uselib pcre
    $(block:fn_init)
        $local pcre * $(re)
        $set_var_attr $(re), exit=free
        $local const char * pcre_err_msg, int pcre_err_pos
        $(re)=pcre_compile("$(re_pat)", $(re_opt), &pcre_err_msg, &pcre_err_pos, NULL)
        $if !$(re)
            $print pcre_compile error: $(re_pat)

#=========================================================
perlcode: pcre_plugin_capture
    $if $param=~/(.*?)\s*=>\s*(.*\d)/
        my ($g1, $g2)=($1, $2)
        my @names=split /,\s*/, $g1
        my @indexes=split /,\s*/, $g2
        $for $i=0:@names
            my $name=$names[$i]
            my $i0=$indexes[$i]*2
            my $i1=$i0+1
            MyDef::compileutil::call_sub("pcre_capture, $name, $i0, $i1")
    $else
        my @names=split /,\s*/, $param
        $for $i=0:@names
            my $name=$names[$i]
            my $i0=($i+1)*2
            my $i1=$i0+1
            MyDef::compileutil::call_sub("pcre_capture, $name, $i0, $i1")

# --------------------------------------------
subcode: pcre_capture(name, i0, i1)
    $(set:off=pn_pcre_off[$(i0)])
    $(set:size=pn_pcre_off[$(i1)]-pn_pcre_off[$(i0)])
    $my char * $(name)
    $(name)=$(re_var)+$(off)
    $set_var_attr $(name), strlen=$(size)
                            

Don't worry if you can't understand the code. It requires reading documentation (which I haven't written, but once you do, you'll see it was actualy straight forward). For the purpose of this article, I just want to impress you with the basic idea how it works.

When we define the code using perlcode (instead of subcode), the code is translated as Perl code (the native code that implements MyDef), and run using Perl's eval, which translates the custom syntax into desired output. Since it is code, we can translate anything into anything -- in principle.

As complex as it seems, the above code did not nearly capture all perl regex options. Perl regex is truely a monster! In here, I choose to follow the 80-20 rule: use 20% effort to facilitate 80% of life. As time goes on, I may choose to make above code even more complex, but I'll adopt lazy evaluation strategy here -- wait until demands accumulates.

MyDef -- Level 3

Well, I already explained what is level 3. For C, this is MyDef/output_c.pm, and it is the -mc in the mydef_page command line we seen earlier. Like layer 2 or layer 3, this layer can be very simple. For example, you can use the general output module to write any code (where you only have level 1 and level 2 facility). In the Perl output module, it is also very thin except $for, $if/$elif/$else, $while, and $foreach to make braces and semicolons optional. In the case of C, I have made a monster in the C module. However, if you happen to enjoy vanilla C, you may opt to skip it. In fact, optional is the dominant character. You can simply write C statements with semicolons at the end and MyDef will pass them through. However, I find myself lately less and less require that pass through feature.

Obviously I can't explain my entire MyDef code base here. Let me present you the final code for our sample program:


include: c/files.def
include: c/regex.def

page: test, basic_frame
    s_file="t.html"
    $call stat_file, s_file
    n=$(fsize)
    $local_allocate(n+1, 0) s

    &call open_r, s_file
        fread(s, 1, n, file_in)

    $while s=~/class="title [^"]*" href="(http[^"]*)"/g
        $regex_capture ts_url
        $print link: $ts_url
                            

That's it. It compiles into exactly the same code as I shown at the very top. I didn't change a bit, only refactoring.

Yes, it does some type inference, simple scope based garbage collection, auto include standard c header files. Not much, just the 20% effort to facilitate 80% of life.

Now it is easier to read and write (and maintain), don't you think? (And for those who TL;DR, their loss ;) )

Let me wrap up by testing it:


$ mydef_page -mc test.def && gcc -o test test.c -lpcre && ./test
PAGE: test
    --> [./test.c]
link: http://bigstory.ap.org/...
link: http://i.imgur.com/...
...
                            

PS

If you have been a professional programmer for long, you may have frown a lot every time I use "ad-hoc". In corporate, we want robustness, we want stability, we want well-designed solutions...

Well, firstly, we need remember with my "ad-hoc" meta-layer solution, we do not get worse off than the original C code that we would write directly. Rather, we are constantly refining the code into a much maintainable or higher-level form, yet it still allows lower level tuning. Secondly, real life is of ad-hoc nature. And this is my understanding of programming: programming is not the solution; programming is the process of working out solution, and a process of refining solution.

And I argue that ideal libraries does not exist. The current obsession of relying on libraries encourages a breed of programmers who only know how to use the library but have no idea how the library works. I would like to promote a new culture, where each programmer maintains his own libraries. Or at least have several layers of library where each programmer choose the number of layers to customize and maintain on their own.

We all do this in our real life. We maintain our own dictionary, own conventions, preferences, and habits.

-- Wouldn't that result in "re-inventing" the wheel?

What do you think the wheel means? A physical wheel on your bike or an idea?

-- How would it scale in industry?

Probably can't. But I think we need both, the conveyor belt workers and engineers. At different stage we may need different proportion.

[ Try MyDef Demo ]

[X]

Recent Posts