Abstract Syntax Tree

complexity of compiler design
  1. Introduction: Writing Parser Imperatively
  2. Parsing CSV: Imperative Lexer
  3. Parsing HTML: Parsing Stack
  4. Expression: Operator Precedence Based Parsing
  5. REPL, Brackets, and Variables
  6. Statement: Top-down Context Based Parsing
  7. Full Javascript Parser - Abstract Syntax Tree

An abstract syntax tree (AST) is a term in computer science for a tree structure representing the syntactic structure of a source code. Any variable types that itself can hold a list of children variables are suitable for representing AST. In python, all variables are dynamic typed and can be either a scalar or list (array or tuple), which is perfect for the purpose. In this chapter, we'll rewrite our parser to return AST. All the parsing logic will remain the same. All we need change is on what to do when we reducing the parsing stack. For example, previously, when we reduce an operator "+", we reduce it by actually add the two numbers and reduce into

(value, "num").

This time, we will simply reduce it to

( ( (num1, "num"), (num2, "num") ), "+").

It is visually more complex, but not necessary in concept once we understand the principle: each node in the AST is a tuple of value and a type (in string) to tell what node it is. Depend on the node type, the value can be either a tuple (or array) holding more nodes or some simple value if it is simple type like "num". In fact, we'll expand our primitive "num" type into more varieties including "int", "float", "string". We will also have types for compound Javascript data such as "array", "object" and "function". Since we are simply using literal string for type specification, we may even add ad-hoc new types as we need.

There is no standard way to design AST nodes. For example the "+" node we mentioned early, alternatively, we may represent it as:

( ( "+", (num1, "num"), (num2, "num")), "binary" ) .

It is slightly more complex but it is more convenient for later evaluation or processing as we may have a common routine for all "binary" operations rather than having to differentiating "+" or any other myriads of operators.

Almost all parser implementations out there produce AST. If you want to be general and do not know what to do with particular token nodes immediately, AST or similar kind is the only approach. However I would like to point out that for particular applications where we know what to do with each syntax node specifically, going through AST is often just one-extra layer of complexity. That is why in this series I tried to delay the introduction of AST as late as I can by sticking to the numerical calculator context. If we just want a numerical answer and we don't want to keep the code (implementing functions), then reducing each node into a number is much simpler than producing AST which we have to differentiate all types and later evaluating the AST which need implement evaluation code on a case by case bases all over again. Many other simple parser applications such as code formatter/styler, syntax highlighter, simple code analyzer (such as counting functions or searching symbols) all can be directly implemented in the parser without producing AST at all. Of course AST is important. For complex projects such as optimizing compiler or full scale static analyser, where the logic is too complex to fit within the parser comfortably and potentially requires several passes of processing, producing AST would be a wise decision.

Expression Nodes

The code for reducing operators actually get much simpler now. We delay the logic of individual operators and simply differentiate types based on number of operands:


#---- parse_javascript.def: reduce ----
    subcode: reduce
        # - group operators ----------
        $if $(type:-2) == 'boc'
            $call @reduce_context
            break
        $elif $(type:-2) == '('
            cur = stack[-1]
            stack[-2:]=[]
            break
        $call @reduce_other
        # - normal operators ----------
        $elif $(type:-2) == ','
            $call reduce_list
        $elif $(type:-2) == "unary"
            t = ( ($(atom:-2), stack[-1]), "unary_exp")
            stack[-2:]=[t]
        $elif $(type:-2) == ':' and len(stack)>5 and $(type:-4)=='?'
            t = ( (stack[-5], stack[-3], stack[-1]), "conditional_exp")
            stack[-5:]=[t]
        $else
            t = ( ($(type:-2), stack[-3], stack[-1]), "binary_exp")
            stack[-3:]=[t]
                            

Postfix operators are processed at matching stage with a stub reduce_postfix:


#---- parse_javascript.def: reduce postfix ----
    subcode: reduce_postfix
        stack[-1] = ( (op, stack[-1]), "postfix_exp")
                            

List operators (comma) can be treated like binary operators; but semantically, they are more like arbitrary-operand cluster. To emphasize semantics, let's treat it differently:


#---- parse_javascript.def: reduce list ----
    subcode: reduce_list
        t=stack[-3]
        $if t[1]=="list_exp"
            t[0].append( stack[-1] )
        $else
            t = ( [stack[-3], stack[-1]], "list_exp")
        stack[-3:]=[t]
                            

Let's give it a quick test. Review the main code:


#---- parse_javascript.def: main ----
page: parse_javascript, basic_frame
    module: python

    &call quote, s
        a=1
        1+2*3;
        !a?2:3

    print parse_javascript(s)
                            

Let's try run it:


$ mydef_run parse_javascript.def
PAGE: parse_javascript
    --> [./parse_javascript.py]
python ./parse_javascript.py
(('=', ('a', 'identifier'), (1.0, 'num')), 'binary_exp')
(('+', (1.0, 'num'), (('*', (2.0, 'num'), (3.0, 'num')), 'binary_exp')), 'binary_exp')
(((('!', ('a', 'identifier')), 'unary_exp'), (2.0, 'num'), (3.0, 'num')), 'conditional_exp')
                            

It works, albeit a bit messy. We probably need a pretty printing function to make debugging more pleasant, but that is another topic.

More data types

Previously, to keep things simple, we only dealt with number literals. Now that we adopted AST, we can easily include all types of Javascript literals (this is because we don't have to decide on what to do with them). Let's recognize some more data types that is intrinsic in Javascript.


#---- parse_javascript.def: match primary ----
    subcode:: match
        $if $(type:-1) not in atom_type
            #- keyword data -----------
            &call if_match_break, true|false
                cur = (m.group(0), "bool")
            &call if_match_break, null
                cur = (m.group(0), "null")

            #- string ------------
            &call if_match_break, \"(([^\\\"]|\\.)*)\"
                s=js_double_string(m.group(1))
                cur = (s, "string")
            &call if_match_break, '(([^\\']|\\.)*)'
                s=js_single_string(m.group(1))
                cur = (s, "string")
            &call if_match_break, /([^\\/]|\\.)+/[igm]*
                cur = (m.group(0), "regex")

            #- numbers ------------
            &call if_match_break, r"0x[0-9a-f]+", re.I
                cur = (int(m.group(0), 16), "int")
            &call if_match_break, r"0[0-7]*", re.I
                cur = (int(m.group(0), 8), "int")
            &call if_match_break, (\d+)(\.\d+)?([eE][+-]?\d+)?
                $if m.group(2) or m.group(3)
                    cur = (float(m.group(0)), "float")
                $else
                    cur = (int(m.group(1)), "int")
                            

Note that we put the code block under a context of not following an atom token. In classical declarative lexing, introduction of context is an extra layer of complexity; but in imperative parsing, context simplifies the complexity. The introduction of context in imperative parsing does not introduce any changes else where. As a contrast, in declarative style, context means a full set of rules for every context. In imperative parsing, the introduction of context narrows down the possibility and simplifies logic. Here, as we putting the code under the context of not following another atom, we free our worry of potential collisions with an operator. In fact there is one if we don't use the context -- the regex syntax starts with a /, which is also used for the operator of division. If we don't narrow down the context, we will gobble up the entire source code between two division operators as if a big regex.

In fact in the first version, I didn't narrow down the context, then it quickly reveals as soon as I test my code against some real world code (I tested against jquery.min.js). It raises the exception telling me that "two adjacent atoms were not allowed". The fix is quick as soon as I sees the offending atom -- a half screen long "regex"! Just narrow down the context and we fixed it.

The common languages of today are designed with context free in mind. They do introduce ambiguity from time to time (like the division and regex) but typically limited. I think if the designer of the languages had imperative parsing in mind and do not associate context with cost, they will introduce more context dependent features where essentially the language will be made up with many domain specific sub-languages. The over all language model will be simpler as the sub-language is confined within context and thus does not need to be concerned with collision to other contexts; and the global language can be simpler as well, because it implements most of the desired feature to various sub-languages and globally can have less features. To an extreme, think about LISP, its global syntax is just S-expression and it allows arbitrary macros to realise specific yet powerful features within S-expression. Think about HTML, its global syntax is just opening and closing of tags, with a lot of specific tag dependent syntax within. But both cases are extreme. If we allow a more complex global syntax (statement and general operator precedence) and also allow multitudes of context for unique features, we may arrive at a language that both are simple and powerful to use.

We need translate from literal string to python's native values. For integer and floating point, we can do it directly with int and float built-in functions. Strings are a little trickier. Note that that \n is literally "\\n" (a slash followed with letter n) instead of newline. If Python and Javascript uses exact same string escaping rule, then we can simply pass it to python eval and let python handle them for us. Alas, there is always slight variations concerning with extra back slashes. To keep it simple (rather because of laziness), I used some hacks here:


#---- support.def: unescape literal strings ----
fncode: js_double_string(s)
    r_esc_extra = re.compile(r"\\([^0btnvfrxuX'\"\\])")
    s = r_esc_extra.sub(r'\1', s)
    return eval('"'+s+'"')

fncode: js_single_string(s)
    #--TODO: detect extra escapes e.g. \#
    r_double = re.compile('(?!\\)"')
    s = r.sub('\\"', s)
    return eval('"'+s+'"')
                            

Ideally, I would like to simply write a string parser. It is trivial, but I am concerned with Python's string being immutable. In the process of building up strings, I may have to build up many copies of partial strings. It may not be a big issue, but enough to keep me feel icky.

Brackets -- Function Call, Index, and Composite Literals (JSON?)

Javascript also specifies multitude of use of brackets (including parentheses, square brackets, and curly braces), and they may have different meanings. Parentheses may be simple grouping to override precedence or may be a function call; square brackets could be indexing an element or could be an array literal; curly braces could be the introduction of a compound statement or literals for an object. However, the logic to handle them are essentially the same as basic parentheses, so we know how to handle them. With multitude of contexts and in particular, different contexts potentially requiring different token parsing and error checking (e.g. "colon" in conditional expression and object literal), we will need a state variable to keep track:


#---- frame.def: context ----
    subcode:: setup
        context_stack = []
        cur_context = {'type':"global", 'statements':[]}
        last_context= None

    subcode: start_context(type, expect)
        context_stack.append: cur_context
        cur_context = {"type":"$(type)", "expect":"$(expect)"}
        stack.append: ("$(type)", "boc")

    subcode: restart_context(type, expect)
        # e.g. if (cond_context) if_block context 
        context_stack.append: cur_context
        cur_context = last_context
        cur_context["type"]   = "$(type)"
        cur_context["expect"] = "$(expect)"
        stack.append: ("$(type)", "boc")

    subcode: pop_context
        last_context= cur_context
        cur_context = context_stack.pop()
                            

All contexts are marked by a boc token having precedence of zero. It will be reduced by any operator that is also having a precedence of zero -- the closing bracket. Ideally, all brackets are matched so the right closing bracket will reduce the right boc token; but mistakes happens and we want to detect them. Therefore, each context also remembers which operator it is expecting as its closing one.

Now we are ready to handle the brackets:


#---- parse_javascript.def: match brackets ----
    subcode:: match
        &call if_match_continue, [\[\(\{]
            op = m.group(0)
            $if $(type:-1) in atom_type
                $if op == '('
                    $call start_context, fcall, )
                $elif op== '['
                    $call start_context, index, ]
                $else
                    $call error, "error opening bracket '"+op+"', forgot ';'?"
            $else
                $if op == '{'
                    $call start_context, object, }
                $elif op == '['
                    $call start_context, array, ]
                $else # plain old parenthesis!
                    stack.append: ('(', '(')

        &call if_match_break, [\]\)\}]
            op = m.group(0)
            cur = (op, ')')
            #- operators need to be separated by atom
            #-   empty brackets is a problem. Hack it.
            $if $(type:-1)=='(' or $(type:-1)=="boc"
                stack.append: ('', '')
                            

That is how we start the context. We wrap the context at closing brackets (review here):


#---- parse_javascript.def: reduce context ----
    subcode: reduce_context
        $if cur_context['expect']==cur[0]
            t_op      = cur
            t_context = $(atom:-2)
            t_atom    = stack[-1]
            cur = None
            stack[-2:]=[]
            $call pop_context
            $if 0
                pass
            $call @reduce_context_cases
        $elif cur_context['expect']==';'
            #- Javascript allows omitting ; before closing } (and before ')' in the "for" statement)
            #-   so we supply one here
            src_pos-=1
            cur=(';', ';')
        $else
            $call error, "Brackets mismatch, expect "+cur_context['expect']+", got "+cur[0]

    subcode:: reduce_context_cases
        $elif t_context == "fcall"
            stack[-1]=( (stack[-1], t_atom), "fcall")
        $elif t_context == "index"
            stack[-1]=( (stack[-1], t_atom), "index")
        $elif t_context == "array"
            $if t_atom[1]==""
                t_atom=[]
            cur = (t_atom, "array")
        $elif t_context == "object"
            cur = (t_atom, "object")
                            

Inside the array and object literal, the list is built up by the comma operator. The special case is when the brackets are empty. At matching, we inserted an "sentinel", ("", ""), which need be converted for various contexts. For array and objects, it is converted to empty lists. For object literal, we may opt to build up a dictionary (hash) rather than a list. Here we choose lists for simplicity and generality. For applications that want to do pretty printing or re-formating, the order of the object keys are important.

Oh, we still have one piece missing -- the colon in the object literal. Currently, it will be treated as a binary operator. That works. However, there are certain restrictions to the object literal syntax and the colon operator should only appear in a object context and after a string or an identifier; and the semantic meaning is an attribute rather than an operation. So it makes sense to do some special treatment here.


#---- parse_javascript.def: reduce object_item ----
    subcode:: reduce_other
        $elif $(type:-2)==':' and cur_context["type"]=="object" and ($(type:-4)==',' or $(type:-4)=="boc")
            # check $(type:-3) is identifier or string
            t_name = $(atom:-3)
            t = ( (t_name, stack[-1]), "object_item")
            stack[-3:]=[t]
                            

Here we pretty much parses the popular JSON as well. JSON is more restrictive than Javascript composite literals, so we may need more sanity checking. And since JSON is a pure data format, we probably want to directly build a dictionary rather than a list of object-items.

Statement and Compound Statement

We still haven't handled compound statement yet. As we are building up AST, we probably shouldn't throw away statements. It is tempted to treat ';' similar to they way we treated ',' operator -- building up lists. Alas, semantically, semicolons are not joining lists. Rather, it is packing/wrapping statement. To illustrate the subtle difference, for comma, there is no action until a second item appear after and need to be joined. For semicolons, we often need do some side-effect kind of action for the statement we are wrapping. For example, in the if statement, the semicolon in the end wraps the branch statement and also need to close the if context. Therefore, we treat semicolons very differently than a normal operators. They are processed in the post_process part and never gets shifted into stack.

How are we going to pack statements? They can't be packed onto the parsing stack, as it will interrupt the rest of the shift-reduce logic. We have to pack it into the context.


#---- parse_javascript.def: post process ----
    subcode: post_process
        $if $(cur_type) == ';'
            stmt = stack.pop()
            got_statement(stmt)
            continue

    fncode: got_statement(stmt)
        nonlocal src_pos, cur_context
        #- because ';' has precedence of 1
        #--  on wrapping statement, $(type:-1) has to be 'boc'
        $if "statements" in cur_context
            cur_context['statements'].append(stmt)
        $elif cur_context["expect"]!=';'
            $call error, "Context mismatch, not expecting statement"
        $else
            t_context = stack[-1][0]
            stack.pop()
            $call pop_context
            $if 0
                pass
            $call @statement_context_cases

        # print stmt
                            

And compound statement is handled as:


#---- parse_javascript.def: compound statement ----
    subcode:: reduce_context_cases
        $elif t_context =="compound"
            last_context["statements"].append(t_atom)
            stmt = (last_context["statements"], "compound")
            got_statement(stmt)
                            

But wait, we haven't covered where to start the compound statement yet. Compound statement starts with opening curly brace, which is the same as an object literal. In fact, this is very ambiguous and we need differentiate the context. Compound statement, as a statement, should only appear right after a boc token; but not all boc tokens, only those expecting statements (by this, we also forbid object literals to appear right after a statement boc):


#---- parse_javascript.def: match statements ----
    subcode:: setup
        non_stmt_boc = ["fcall", "index", "object", "array"]
        non_stmt_boc.extend: ["if_cond", "while_cond", "switch_cond", "with_cond", "catch_cond", "function_param"]

    subcode:: match
        # All statements start at 'boc'
        $if $(type:-1) == 'boc' and $(atom:-1) not in non_stmt_boc
            &call if_match_continue, {
                $call start_context, compound, }
                cur_context["statements"]=[]
            $call @match_statements
                            

We also place a stub for parsing all other statements. As a side effect, our parser will parse keyword in a non-statement context (parts of expression) fine. Some of the statement can only be parsed within the statement context. For example, the labeled statement, how is it possible to differentiate after a '?' operator? The statement context, in fact, the introduction of statement at all (against expression), allows more flexibility in creating the statement syntax. An expression is expected to appear anywhere and therefore has to be unambiguous globally -- more feature generally means more odd symbols. A statement on the other hand, creates its own unique context that free from the global context (other than the context boundaries -- the being of context keyword and end of context markers). Statements are essentially DSLs.

If, While, and the Other

We are getting a bit impatient here, so let's put all the rest of the Javascript syntax in.


#---- statements.def: parse statements ----
fncode:: parse_javascript
    #-- common subcode ---------
    subcode: match_cond(keyword)
        $if_match $(keyword)
            $call skip_space_wide
            $if_match (
                BLOCK
            $else
                $call error, "$(keyword) missing ("

    #-- if  ----------------
    subcode:: match_statements
        &call if_match_continue, if
            $call skip_space_wide
            $if_match (
                $call start_context, if_cond, )
                #- branch_list for
                #-    if cond1: block1
                #-    elif cond2: block2
                #-    ...
                #-    else (None): blockN
                cur_context["branch_list"]=[]
            $else
                $call error, "if missing ("

    subcode:: reduce_context_cases
        $elif t_context =="if_cond"
            $call restart_context, if_block, ;
            cur_context["condition"]=t_atom

    subcode:: statement_context_cases
        $elif t_context =="if_block"
            t = (last_context["condition"], stmt)
            last_context["branch_list"].append(t)
            $if_match else
                $call skip_space_wide
                &call match_cond, if
                    $call restart_context, if_cond, )
                $else
                    $call restart_context, if_block, ;
                    cur_context["condition"]=None
            $else
                stmt = (last_context["branch_list"], "if")
                got_statement(stmt)

    #-- while ----------------------------------
    subcode:: match_statements
        &call match_cond, while
            $call start_context, while_cond, )
            continue

    subcode:: reduce_context_cases
        $elif t_context == "while_cond"
            $call restart_context, while_block, ;
            cur_context["condition"]=t_atom

    subcode:: statement_context_cases
        $elif t_context == "while_block"
            stmt = ( (last_context["condition"], stmt), "while")
            got_statement(stmt)

    #-- do while -----------------------------
    subcode:: match_statements
        &call if_match_continue, do
            $call start_context, do_block, ;

    subcode:: statement_context_cases
        $elif t_context == "do_block"
            &call match_cond, while
                last_context["block"]=stmt
                $call restart_context, do_cond, )
            $else
                $call error, "do missing while"

    subcode:: reduce_context_cases
        $elif t_context == "do_cond"
            stmt = ( (t_atom, last_context["block"]), "do_while")
            $call skip_space_narrow
            $if_match [;\r\n]+
                got_statement(stmt)
            $elif src[src_pos]=='}'
                got_statement(stmt)
            $else
                $call error, "unterminated do-while statement"

    #-- for ------------------------
    subcode:: match_statements
        &call match_cond, for
            $call start_context, for_cond, )
            #- for conditions are made up from 3 statements
            #-    or single statement (for/in)
            cur_context["statements"]=[]
            continue

    subcode:: reduce_context_cases
        $elif t_context == "for_cond"
            $call restart_context, for_block, ;
            t_cond_list = cur_context["statements"]
            t_cond_list.append: t_atom
            cur_context["condition"]=t_cond_list
            del cur_context["statements"]

    subcode:: statement_context_cases
        $elif t_context == "for_block"
            stmt = ( (last_context["condition"], stmt), "for")
            got_statement(stmt)

    #-- with ----------------------------------
    subcode:: match_statements
        &call match_cond, with
            $call start_context, with_cond, )
            continue

    subcode:: reduce_context_cases
        $elif t_context == "with_cond"
            $call restart_context, with_block, ;
            cur_context["condition"]=t_atom

    subcode:: statement_context_cases
        $elif t_context == "with_block"
            stmt = ( (last_context["condition"], stmt), "with")
            got_statement(stmt)

    #-- switch ------------------------
    subcode:: match_statements
        &call match_cond, switch
            $call start_context, switch_cond, )
            continue

    subcode:: reduce_context_cases
        $elif t_context == "switch_cond"
            $call restart_context, switch_block, ;
            cur_context["condition"]=t_atom

    subcode:: statement_context_cases
        $elif t_context == "switch_block"
            stmt = ( (last_context["condition"], stmt), "switch")
            got_statement(stmt)

    #-- try/catch/finally ---------------
    #- try  ----------------
    subcode:: match_statements
        &call if_match_continue, try
            $call skip_space_wide
            $if_match {
                $call start_context, try_block, }
                cur_context["statements"]=[]
            $else
                $call error, "try missing {"

    subcode:: reduce_context_cases
        $elif t_context =="try_block"
            last_context["statements"].append(t_atom)
            last_context["try_block"]=last_context["statements"]
            del last_context["statements"]
            last_context["catch_cond"]=None
            last_context["catch_block"]=None
            last_context["finally_block"]=None
            &call match_cond, catch
                $call restart_context, catch_cond, )
            $else
                stmt = ( last_context, "try")
                got_statement(stmt)
        $elif t_context == "catch_cond"
            last_context["catch_cond"] = t_atom
            $if_match {
                $call restart_context, catch_block, }
                cur_context["statements"]=[]
            $else
                $call error, "catch missing {"
        $elif t_context == "catch_block"
            last_context["statements"].append(t_atom)
            last_context["catch_block"]=last_context["statements"]
            del last_context["statements"]
            $if_match finally
                $call skip_space_wide
                $if_match {
                    $call restart_context, finally_block, }
                    cur_context["statements"]=[]
                $else
                    $call error, "finally missing {"
            $else
                stmt = ( last_context, "try")
                got_statement(stmt)
        $elif t_context == "finally_block"
            last_context["statements"].append(t_atom)
            last_context["finally_block"]=last_context["statements"]
            del last_context["statements"]
            stmt = ( last_context, "try")
            got_statement(stmt)

    #-- var -------------------
    subcode:: match_statements
        $if_match var
            $call start_context, var, ;
            continue

    subcode:: statement_context_cases
        $elif t_context == "var"
            stmt = (stmt, "var")
            got_statement(stmt)

    #-- return/throw/break/continue -------------------
    subcode:: match_statements
        $if_match (return|throw|break|continue)\b
            $call start_context, return, ;
            cur_context["name"]=m.group(1)
            cur_context["expect_;"]=1
            continue

    subcode:: statement_context_cases
        $elif t_context == "return"
            stmt = (stmt, last_context["name"])
            got_statement(stmt)

    #-- case labels ----
    subcode:: match_statements
        &call if_match_continue, default
            $call skip_space_wide
            $if_match :
                stmt = (None, "case")
                got_statement(stmt)
            $else
                $call error, "default label missing :"

        &call if_match_continue, case
            $call start_context, case, ;

    subcode:: match_colon_special
        $if cur_context["type"]=="case" and not $(type:-2)=="?"
            # functionally equivallent to ';'
            cur = (';', ';')

    subcode:: statement_context_cases
        $elif t_context=="case"
            stmt = (stmt, "case")
            got_statement(stmt)

    #-- lablel ----
    subcode:: match_statements
        &call if_match_continue, ($(pat_identifier)):
            $call start_context, label, ;
            cur_context["label"]=m.group(1)

    subcode:: statement_context_cases
        $elif t_context == "label"
            t = (last_context["label"], stmt)
            stmt = (t, "labelled_statement")
            got_statement(stmt)
                            

Last, we need to deal with functions. Functions in Javascript are expressions rather than statements, except when it is a declaration which should be a statement. This is in fact as ambiguous as the case for compound statements and object literals. And similarly, it can only be resolved by checking the statement-expecting boc context:


#---- statements.def: function ----
fncode:: parse_javascript
    subcode:: match_keyword
        $if_match function
            $call @check_statement_context
            $call skip_space_wide
            #- anonymous ---------
            $if_match (
                $call start_context, function_param, )
                continue
            #- named  ---------
            $if_match $(pat_identifier)
                name = m.group(0)
                $call skip_space_wide
                $if_match (
                    $call start_context, function_param, )
                    cur_context["name"]=name
                    $if statement_context
                        cur_context["declaration"]=1
                    continue
                $else
                    $call error, "function missing (parameters)"
        # -----------------------------
        subcode: check_statement_context
            statement_context=0
            $if $(type:-1) == 'boc' and $(atom:-1) not in non_stmt_boc
                statement_context=1

    subcode:: reduce_context_cases
        $elif t_context=="function_param"
            last_context["param"]=t_atom
            $call skip_space_wide
            $if_match {
                $call restart_context, function_block, }
                cur_context["statements"]=[]
            $else
                $call error, "function missing {"
        $elif t_context == "function_block"
            last_context["statements"].append(t_atom)
            #- Whether we treat it as a statement or expression matters!
            $if "declaration" in last_context
                stmt = (last_context, "function")
                got_statement(stmt)
            $else
                cur = (last_context, "function")
                            

That is it for the parser! Next, debugging and do something with the AST. We'll leave that to the next chapter.

Check out the code in repository - python_8

Is it Top-down or Bottom-up

Readers who went through compiler course may try to think the parsing technique either to be top-down or bottom-up. So how about our imperative parser? The way we use operator precedence directed shift-reduce is obviously a bottom-up technique. However, if we re-examine the statement-parsing code above, that is very much top-down driven. In imperative parsing, we are not confined by either top-down or bottom-up, we organize code according to the nature of the problem. In expression syntax, every part of expression can appear any where separated by arbitrary operators and parentheses, a bottom-up approach is the most natural method. If we try to write a recursive descend parser, with 14-levels operator precedence, we have to invent 14 non-terminal terms and we have to go through 14 layers of function abstraction even when we are just parsing a single integer! With bottom-up approach, an integer is an integer, and we only grow our parsing stack as the actual parsing problem implies, no more no less. On the other hand, for statements, each defines its own context. If we use bottom up method, we will be mixing all statement contexts into one global context, and we will be dealing with ambiguities and we will be forced with unnecessary look-ahead in the attempt to resolve ambiguities. With top-down approach, we isolate the context and avoids the entire layer of complexity that is invented to deal with ambiguities. In top-down imperative parsing, there is no ambiguities. The language we parse is defined by the parser. Of course the parser could be parsing a wrong language that does not meet our intention. To write the correct parser, the right order of lexing and reduction and the right amount of context confinement may be necessary. It is an art rather than a science. But the problem we are dealing with -- languages, is an art. Programming is an art.

  1. Introduction: Writing Parser Imperatively
  2. Parsing CSV: Imperative Lexer
  3. Parsing HTML: Parsing Stack
  4. Expression: Operator Precedence Based Parsing
  5. REPL, Brackets, and Variables
  6. Statement: Top-down Context Based Parsing
  7. Full Javascript Parser - Abstract Syntax Tree

[ Try MyDef Demo ]

[X]

Recent Posts