Parsing HTML

complexity of compiler design
  1. Introduction: Writing Parser Imperatively
  2. Parsing CSV: Imperative Lexer
  3. Parsing HTML: Parsing Stack
  4. Expression: Operator Precedence Based Parsing
  5. REPL, Brackets, and Variables
  6. Statement: Top-down Context Based Parsing
  7. Full Javascript Parser - Abstract Syntax Tree

Level up from CSV, let's look at HTML. HTML contains nested (or hierarchical) elements. Any nested syntax cannot be parsed with regular expression. They require a stack to keep track of the context.

Stack is a rather abstract term in computer science. In practice, it is often simply an array or a list and always being accessed at the end, either append (push) or pop.

Without further ado, let's start.


include: python/parse.def
                            

Here we included a library python/parse.def (it is in your $MYDEFLIB), where we stores some common macros. For example, we will use parse_loop such as:


while src_pos<src_len:
    ... ...
                                
It is not difficult to write such code directly, but as we repeat it again and again, we'll find using a nickname even more intuitive:

&call parse_loop
    ... ...
                                
Where we define parse_loop in python/parse.def

subcode: parse_loop
    $while src_pos<src_len
        BLOCK
                                
Not only we give the loop a semantic meaning, it also let us add or modify the implementation of the underlying mechanism -- such as insert include files or add more checking and debugging -- without affecting the higher level code. &call parse_loop means a parse loop, details may vary, but we can understand its meaning regardless of its details.


#---- html.def: page ----
include: python/parse.def

page: html, basic_frame
    module: python

    s = "<!DOCTYPE html><html><head><title>Test</title></head><body><h1>Test</h1><p>Hello World!</p></body></html>"
    page = parse_html(s)
    debug_dom(page["dom"])
                            

Before anything, we first setup the problem -- we'll implement a function, parse_html, which parses the HTML string into an object whose main element is the DOM tree.


#---- html.def: parse_html ----
fncode: parse_html(src)
    tag_stack=[]
    cur_list=[]
    tag_stack.append: {'_name':'root', '_list':cur_list}

    $call parse_it

    return {"dom":tag_stack[0]}
                            

The key variable here is the tag_stack; it establishes the context during the parse loop. Whenever we encounter an opening tag, we'll push the current context onto the stack; when we meet the end tag, we restore the previous context by popping off the stack. In HTML, each tag (an element) may contain a list of child nodes, we use variable cur_list to hold the list as we parse.


#---- html.def: parse loop ----
    subcode: parse_it
        &call parse_loop
            $if_match <!--.*?-->
                #- comment ----
                continue
            $if_match r"<!DOCTYPE.*?>", re.I
                #- doctype ----
                continue
            $if_match [^<]+
                #- text node ----
                cur_list.append: m.group(0)
                continue
            $if_match <(\w+)(.*?)>
                #- start tag ----
                s_name=m.group(1)
                s_attr=m.group(2)
                $call match_start_tag
                continue
            $if_match </(\w+).*?>
                #- end tag ----
                s_name=m.group(1)
                $call match_end_tag
                continue
                            

HTML is designed to be easily parsed. It is either opening tag, end tag, or text content (or comment and doctype declaration). It is more complex at start tag, as we need parse attributes and handle some tags that do not have end tags.


#---- html.def: start_tag ----
    subcode: match_start_tag
        tag = {'_name':s_name}
        parse_attribute(tag, s_attr)

        #- append to the child node list 
        cur_list.append: tag

        $if '/' in tag or re.match(r"area|base|br|col|embed|hr|img|input|keygen|link|meta|param|source|track|wbr", s_name, re.I)
            #- self closed
            pass
        $elif re.match(r"script|style|textarea|title", s_name, re.I)
            #- raw text: let's grab it
            $call scan_raw_text
            $(if:0)
                $if s_name.lower() == "style"
                    $call @parse_style_sheet
                $elif s_name.lower() == "script"
                    $call @parse_script
        $else
            #- for the rest, push a new context
            tag_stack.append: tag
            cur_list=[]
            tag['_list'] = cur_list

    #- end_tag ----------------------------
    subcode: match_end_tag
        $if s_name == tag_stack[-1]['_name']
            tag_stack.pop()
            cur_list=tag_stack[-1]['_list']
        $else
            #- error --
            #-    drop missed tags or do nothing
            j=len(tag_stack)-2
            while j>0:
                if s_name == tag_stack[j]['_name']
                    while len(tag_stack)>=j
                        tag_stack.pop()
                    cur_list=tag_stack[-1]['_list']
                    break
                j-=1
                            

The end tag is simpler, just pop the context; unless there is some kind of tag mis-matching error where we need exercise heuristic intelligent guessing. To keep it simple, let's just drop any missed tags.


#---- html.def: end_tag ----
                            

We are almost done. Here is how we handle the tags with raw text (script, style, textarea, title). They cannot have child nodes other than the raw content. For script and style tags, they will need separate parsers.


#---- html.def: raw_text ----
    subcode: scan_raw_text
        i_start=src_pos
        &call sub_loop
            $if_match ([^<]|<[^/])+
                #- grab until a clsoing tag
                continue
            $if_match </(\w+).*?>
                #- check whether it is the right closing tag
                $if m.group(1)==s_name
                    break
                $else
                    continue
            #- skip any unforseen weird character
            src_pos+=1

        i_end=src_pos
        $(export:raw_text=src[i_start:i_end])
        tag['_text'] = src[i_start:i_end]
                            

Tag attributes require a separate parser. Like CSV, only lexing is needed.


#---- html.def: parse_attribute ----
fncode: parse_attribute(tag, src)
    &call parse_loop
        $if_match \s+
            #- skip spaces and newline --
            continue
        $if_match /
            #- self closing --
            tag['/']=1
            continue
        $if_match ([^\s'\"\\=<>`]+)
            #- got a name --
            s_attr_name=m.group(1)
            $if_match \s*=\s*
                #- has value
                $call parse_attr_value
            $else
                #- no value (set to 1)
                tag[s_attr_name]=1
            continue

    #- value can also be complicated ...
    subcode: parse_attr_value
        &call sub_loop
            $if_match \"((?:[^\\\"]+|\\.)*)\"
                #- double quoted value
                tag[s_attr_name] = m.group(1)
                break
            $if_match '((?:[^\\\']+|\\.)*)'
                #- single quoted value
                tag[s_attr_name] = m.group(1)
                break
            $if_match [^\s'\"=<>`]+
                #- without quote
                tag[s_attr_name] = m.group(0)
                break
            #- skip any unforseen error --
            i+=1
                            

Last, don't forget some simple debug routine


#---- html.def: debug_dom ----
fncode: debug_dom(node)
    $def print_node(node, level)
        $if isinstance(node, str)
            print "    " * level, node
        $else
            print "    " * level, node['_name']
            $if "_list" in node
                $for t in node['_list']
                    print_node(t, level+1)

    print_node(node, 0)
                            

The code is available at github.

  1. Introduction: Writing Parser Imperatively
  2. Parsing CSV: Imperative Lexer
  3. Parsing HTML: Parsing Stack
  4. Expression: Operator Precedence Based Parsing
  5. REPL, Brackets, and Variables
  6. Statement: Top-down Context Based Parsing
  7. Full Javascript Parser - Abstract Syntax Tree

[ Try MyDef Demo ]

[X]

Recent Posts