Building a COOL lexical analyzer generator using JLex

The reason for making this post, is to give an overview of the procedure I followed while making a lexical analyzer generator for the COOL programming language, under this Stanford course.

CHEATING:

Not exactly, but that's how I made the lexical analyzer generator (LAG). Why you ask? Yeah, coming on that topic. What would you call copying someone else's code your starting point of work? PLAGIARISM!

Yes, I definitely did it.

I copied whole of the code from here. "Boy you're a bad one!" might be your line of choice, but I'd rather ask you to wait and listen, that when I tried to submit this code (not to skip the work, just curious about the accuracy of the solution), and got a score of 11 out of 63.

"What's a score?" you ask? Well, it tells us about the successful recognition of tokens in cool program files. So, a score of 11 out of 63 would tell that there are a lot of cool files whose tokens, weren't recognized properly.

This, was good. The effort put to make this score better made me understand a plethora of stuff, that I didn't know about earlier.

So, here I'll discuss the modifications I made to the code, and the reasons for doing so. Some are general explanations on the things that I think I should've known before.

STRINGS:

Oh dear. I love strings. They are so, curiously built, have so much inside them, still remain under a single name. Ohh the unity.

1. How to recognize Strings?
Well, we all know that a string starts with a quote, that is the '\"' character. So, you'll be needed to recognize that when in the initial (YYINITIAL) state, and then change the state to a state that represents a string only. We can call it the STRING state.
The initialization state and it's code looks like this :

<YYINITIAL> \"  { 
  string_buf.delete(0, string_buf.length());
  yybegin(STRING); 
  stringTooLong=false;
  nullInString=false;
  stringError=false;
  skip = false ;
 }

2. What to do when a string has just started?
You'll definitely want to save the string and return it when you recognize one. So, in order to do that, you have two options:

Use a String object, or
Use the String buffer.

We're more used to using the String class, but it has a problem, that it is immutable. We can't change the stuff that has gone inside the string. But wouldn't we have to just add characters as we see and then return the string when the ending quote is seen? No. I wish it was that easy.

We would have to use the String buffer, as we need to make changes to characters already added to the buffer (will explain later you hungry beast).

So, when the string recognition phase has just started, we'll initialize a string buffer.

string_buf.delete(0, string_buf.length());

Here, this initialization is to delete the previous string saved (if any) inside the buffer.

The next step is to change the state from the YYINITIAL to STRING. This can be done by :

yybegin(STRING) ;

Next is to check for a "null" in a string, or if the string is too long. If the previous string had a "null" in it then the boolean variable nullInString would be true. So when we see a new string, we'll change it back to false. Same for the variable stringTooLong.

There is one more variable skip inside the initialization, but we'll discuss it later.
3. What to do when inside the string?This one is pretty complex. We'll reach this stage after the current state (which is given by yy_lexical_state ), has changed from YYINITIAL to STRING state. This can be divided into three major parts:

The ending quote.
All the stuff inside the string, except the newline, or a quote
The newline

The newline is considered separately because in COOL, we need to escape the newline. So, extra measures have to be taken when we see a newline inside a string.

A. The ending quote has been recognized while traversing the string

<STRING> \" {
 // If an escape character appears before a quote, we must put a quote into the string
  if(stringTooLong) {
        yybegin(YYINITIAL) ;
 
    return new Symbol(TokenConstants.ERROR, "String constant too long");

  } else if(nullInString) {
        yybegin(YYINITIAL) ;
       return new Symbol(TokenConstants.ERROR, "Null character appeared in string");
  } else if(stringError) {
        yybegin(YYINITIAL) ;
    return new Symbol(TokenConstants.ERROR, "Error occurred while parsing string");
  } 
  

  if(!skip && string_buf.length()>0 && string_buf.charAt(string_buf.length()-1)=='\\') {
    string_buf.setCharAt(string_buf.length()-1, '\"');
  } else {
    yybegin(YYINITIAL); 
    return new Symbol(TokenConstants.STR_CONST, AbstractTable.stringtable.addString(string_buf.toString())); 
  }
 }

When the ending quote has been recognized, then we are sure that the current string has ended. So, if some error ocurred while the recognition of the string, then we can reject the string after raising an error.

Note that I will not discuss Symbol, TokenConstants, and AbstractTable here, (because I just don't understand them properly yet). They will be added as a footnote later or as another blog post, as soon as I understand them.

We'll have to change the current state to YYINITIAL in order to signify that what comes next is code not string.

This is done for all the errors that could be raised.

Now did you realize, that a quote can also come inside a string, when escaped. Boom. That was an increase in complexity. So, how will you know that the current quote is THE end quote, or just some normal escaped quote inside the string?

This can be done by checking the last entry of the string buffer. That is, checking the

string_buf.charAt(string_buf.length()-1)

for '\\'. If it is '\\' then we know that the quote is escaped, and thus it simply should be added to the string buffer, as '\"'. Note that we replaced '\\' by '\"' inside the String buffer. The input doesn't take "\n" as '\n' but as '\\' and 'n'.

But when we know that the current quote is not escaped and is not to be skipped, then we can safely conclude that the current quote is THE ENDING QUOTE. And we can safely go back to the YYINITIAL state and add the string to the AbstractTable.

What about "skip"?
Wait, until normal character recognition comes in. Just wait. Please.

B. All the stuff inside the string, except the newline, or a quote

<STRING> [^\n\"] { // Code for handling other characters encountered in strings
  // including special cases for escaped characters
  if(string_buf.length()==0) 
  {
    string_buf.append(yytext());
  } else 
  {
    
    // Check if null character appears in input stream
    if(yytext().charAt(0)=='\0' || nullInString) 
 {
      nullInString=true;
    } else
    {
      int length=string_buf.length();
      if(string_buf.charAt(string_buf.length()-1)=='\\') 
   {
  if(skip)
        {
   string_buf.append(yytext()) ;
   skip = false ;
        }
  else
       {
        switch(yytext().charAt(0))
  {
  case 'b':
    string_buf.setCharAt(length-1, '\b');
    break;
  case 't':
    string_buf.setCharAt(length-1, '\t');
    break;
  case 'n':
    string_buf.setCharAt(length-1, '\n');
    break;
  case 'f':
    string_buf.setCharAt(length-1, '\f');
    break;
  case '\\':
    string_buf.setCharAt(length-1, '\\');
    skip = true ;
    break;
  default:
    string_buf.setCharAt(length-1,yytext().charAt(0)) ;
  }
       }
 
      }
     else
  {
  string_buf.append(yytext());
     }
    }

  }
    if(string_buf.length()>=MAX_STR_CONST)
  {
    stringTooLong=true;
  }
 }

This looks scary. It isn't. Simply, check if the buffer is empty, if yes then add whatever is coming in. Then some error checking. Bleh. Boring.

Remember when I said that "\n" is seen as '\\' and 'n'? Yeah, now we'll have to do that for all escaped characters. The special ones like a backspace and tabs are like "\b" and "\t", so they're actually '\\' with 'b' and '\\' with 't'. So, if the last character seen is a '\\' then we'll have to handle this escaping procedure.

Note that we can't have a newline here, so we don't have to worry about it much.

By "handling this escaping procedure" I mean that we'll replace the seen '\\' with the correct characters, like '\t' and '\r' et cetera.

What's a 'n' doing in there? Aren't newlines not allowed?

Yeah but that isn't a newline! That is just "\n" in a string. A newline is \n or "
"

Pressing "Enter" creates a newline. Simply writing "\n" won't simulate it$^1$. Though we can explicitly tell the receiver that my "\n" means a newline, then only it can simulate a newline. Even though without the said agreement, \n has no real meaning of it's own.

So, we'll do the same for a "\n" as we did for "\t" or "\b".

Also note the '\c' has no special significance, and is equivalent to 'c'.

What if the last '\\' seen, itself is an escaped '\\'?

Boom. Boom. Boom. High complexity levels. Existential crisis in here.

Yeah, what will you do now? If we didn't do anything, then strings like "Hello \\\kitty", which actually is "Hello \kitty" is converted to "Hello kitty", removing the '\\' altogether. This happens in this sequence:

"\\" is converted to "\"
"\k" is converted to "k"

This is problematic. The solution that seemed apt to me was to skip the next character from this "if the previous character was '\\' check" altogether, signifying that the '\\' is not to escape anything at all.

So, skip is initialized to false. It is made true when a '\\' is seen after a '\\' and is made false again, when a character is skipped.

We must skip the conversion of a escaped newline into a newline, if the last '\\' doesn't signify the newline's escape. Same for a quote.

We also must re-initialize skip to false when we see a new string.

Everything else in the code is self explanatory and I will not bother to explain them.

C. A newline is seen

<STRING> \n { // Code for newline characters
  // If a newline appears in a string, it must be escaped, else error
  if(string_buf.length()==0 || skip || string_buf.charAt(string_buf.length()-1)!='\\') {
    curr_lineno++;
    yybegin(YYINITIAL) ;
    return new Symbol(TokenConstants.ERROR, "Unterminated string constant");
  } else {
    curr_lineno++ ;
    // Replace '\' character in string buffer with newline
    string_buf.setCharAt(string_buf.length()-1, '\n');
  }
 }

A newline is seen in the string. This is a rare and auspicious occasion. No. Not really.

We just need to check if the current newline is legit or not. If it isn't then we need to return an error, otherwise we add it to the buffer after making some modifications.

COMMENTS:

Comments in cool are of the form (*...*). They can be nested as well. Again when a '(' is seen before a '*' while in the initial YYINITIAL state, then we can say that a comment has begun and thus go to the BLOCK-COMMENT state.

When in the BLOCK-COMMENT state if a '*' or a ')' or a '(' is seen then, it's okay! They shouldn't be together though. if we see a "(*" then we increase the nestedCommentCount and when a "*) " is seen then we decrease it. When it reaches zero, then we can simply change our current state back to YYINITIAL. Here is the code telling you the same thing, you dumbo:

<YYINITIAL> "(*" { yybegin(BLOCK_COMMENT); }
<BLOCK_COMMENT> \(\*  { nestedCommentCount++; }
<BLOCK_COMMENT> \n  { curr_lineno++; }
<BLOCK_COMMENT> [^\n\)\*] { /* Do Nothing */ }
<BLOCK_COMMENT> \*  { /* Do nothing */ }
<BLOCK_COMMENT> \)  { /* Do Nothing */ }
<BLOCK_COMMENT> \*\) {
  if(nestedCommentCount!=0) {
    nestedCommentCount--;
  } else {
    yybegin(YYINITIAL);
  }
 
}

Minor Points:

Pressing "Enter" in windows means "\r\n" not "\n" only. '\r' is carriage return, that brings back the cursor back to the initial point, and '\n' is simply the newline or the linefeed.
Know that [==] isn't the same as "==" or ==. It simply means a =.

These are the main points of the things I learned while building the LAG for COOL. I consciously excluded discussion over JLex' functioning, and the minor errors that I got. They aren't that important as they are either easy to learn or debug.

Here is my code for the LAG:

Blatherstrike - The strike of the nonsense

Tuesday, 20 March 2018