JLex Tutorial

Assignment 1
JLex manual

To demonstrate how JLex is used, we will show a JLex input file which generates a lexer for simple arithmetic expressions. The JLex input file below is structured such that the generated lexer will interface with a parser generated by CUP (which you will do in Assignment 2).

For simplicity, we will call the file that is inputted into JLex exp.lex, and the file that JLex generates is exp.lex.java. The name of the class defined in exp.lex.java is called Yylex. Consequently, you would have to move this file to Yylex.java before compiling it or the compiler would print a non-fatal error message (warning).

Here is exp.lex (color-coded in browsers that support colored table cells to make the following discussion easier):

import sym.*;               // definition of terminals returned by scanner

import java_cup.runtime.*;  // definition of scanner/parser interface

/* semantic value of token returned by scanner */
class TokenValue {          
  public int lineBegin;
  public int charBegin;
  public String text;
  public String filename;   

  TokenValue() {
  }

  TokenValue(String text, int lineBegin, int charBegin, String filename) {
    this.text = text; 
    this.lineBegin = lineBegin; 
    this.charBegin = charBegin;
    this.filename = filename;
  }

  public String toString() { 
    return text;
  }

  public boolean toBoolean() {
    return Boolean.valueOf(text).booleanValue();  
  }
  
  public int toInt() {
    return Integer.valueOf(text).intValue();
  }
}
%%
%implements java_cup.runtime.Scanner
%function next_token
%type java_cup.runtime.Symbol

%eofval{
  return new Symbol(sym.EOF, null);
%eofval}

%{
  public String sourceFilename;
%}

%line
%char
%state COMMENTS

ALPHA=[A-Za-z_]
DIGIT=[0-9]
ALPHA_NUMERIC={ALPHA}|{DIGIT}
IDENT={ALPHA}({ALPHA_NUMERIC})*
NUMBER=({DIGIT})+
WHITE_SPACE=([\ \n\r\t\f])+
%%
<YYINITIAL> {NUMBER} { 
  return new Symbol(sym.tNUMBER, new TokenValue(yytext(), yyline, yychar, sourceFilename)); 
}
<YYINITIAL> {WHITE_SPACE} { }

<YYINITIAL> "+" { 
  return new Symbol(sym.tPLUS, new TokenValue(yytext(), yyline, yychar, sourceFilename)); 
} 
<YYINITIAL> "-" { 
  return new Symbol(sym.tMINUS, new TokenValue(yytext(), yyline, yychar, sourceFilename));
} 
<YYINITIAL> "*" { 
  return new Symbol(sym.tTIMES, new TokenValue(yytext(), yyline, yychar, sourceFilename));
} 
<YYINITIAL> "/" { 
  return new Symbol(sym.tDIVIDE, new TokenValue(yytext(), yyline, yychar, sourceFilename));
} 
<YYINITIAL> ";" { 
  return new Symbol(sym.tSEMI, new TokenValue(yytext(), yyline, yychar, sourceFilename));
} 
<YYINITIAL> "//" {
  yybegin(COMMENTS);
}
<COMMENTS> [^\n] {
}
<COMMENTS> [\n] {
  yybegin(YYINITIAL);
}
<YYINITIAL> . {
  return new Symbol(sym.error, null);
}

There are three sections in the file, separated by %%: 1) user code (rose background), 2) JLex directives (yellow background) 3) regular expression rules (lavender background). We will discuss each section in turn.

User Code (Rose Background)

The first section contains code that you want to be included at the top of exp.lex.java, such as what package you want Yylex to be a part of or what classes to import.

import sym.*;

This line imports the definition for the sym class. sym contains the manifest constants used by Yylex to tell the parser what type of token it is returning. sym looks something like:

public class sym {
  /* terminals */
  static final int EOF = 0;
  static final int tDIVIDE = 7;
  static final int tMINUS = 5;
  static final int tTIMES = 6;
  static final int tNUMBER = 2;
  static final int tPLUS = 4;
  static final int tIDENT = 3;
  static final int error = 1;
}

The Java constants use "t" as a prefix to denote a terminal symbol -- and to differentiate the name from a Java reserved word. The values of the constants are not important. The reason the constants are defined in sym is because the parser you will generate from CUP in assignment 2 will use the same class.

import java_cup.runtime.*;

This line imports the definition for the Symbol class. When the parser asks an instance of Yylex for another token, the Yylex object returns an instance of the Symbol class. This is explained in more detail in the JLex Directives section.

class TokenValue {
  public int lineBegin;
  public int charBegin;
  public String text;
  public String filename;   
  .  .  .
}

The scanner returns a value of type Symbol to the parser when it is called. The Symbol value is an object with two fields: 1) a token number (i.e., one of the constants from sym and 2) a semantic value (e.g., the identifer, number, character, or string constnat). TokenValue is the type for the semantic value. It has the following fields

The methods defined in TokenValue, such as toString(), are not used in assignment 1. They will be used later to access a semantic value as the appropriate type value.

Using the TokenValue class is explained in more detail in the Regular Expression Rules section.

JLex Directives (Yellow Background)

%type Symbol

This specification tells JLex that Yylex should return an instance of the class Symbol when the parser asks for the next token. Symbol is part of the java_cup.runtime package which explains why we had to import that package at the top of the file. Yylex returns a Symbol so the parser generated by Cup can use Yylex.

Symbol has several constructors. The constructor we will use for the scanner is

Symbol(int tokenType, Object semanticValue)

The tokenType you pass is defined in the class sym. The semanticValue is an instance of TokenValue.

%eofval{
  return new Symbol(sym.EOF, null);
%eofval}

This directive tells the lexer to return sym.EOF when Yylex encounters the end of the file that it is scanning.

%{
  public String sourceFilename;
%}

This adds an instance variable to the Yylex class, in this case, a public String called sourceFilename.

%line

This tells JLex to generate code for Yylex that will keep track of which line in the JO99 file Yylex is reading. Line 0 is the first line.

%char

This tells JLex to generate code for Yylex that will keep track of which character in the JO99 file Yylex is reading. Character 0 is the first character.

%state COMMENTS

This defines a lexical state called COMMENTS. A lexical state is like a state in a finite state automaton. It is used to control which regular expressions are evaluated. The lexical state YYINITIAL is implicitly defined. You will see how lexical states are used in the next section.

ALPHA=[A-Za-z_]
DIGIT=[0-9]
ALPHA_NUMERIC={ALPHA}|{DIGIT}
IDENT={ALPHA}({ALPHA_NUMERIC})*
INTEGER=({DIGIT})+
WHITE_SPACE=([\ \n\r\t\f])+

These are regular definitions, each of which give a name to a regular expression. These names are used in the next section.

Regular Expression Rules (Lavender Background)

<YYINITIAL> {NUMBER} { 
  return new Symbol(sym.tNUMBER, new TokenValue(yytext(), yyline, yychar, sourceFilename)); 
}
<YYINITIAL> {WHITE_SPACE} { }

<YYINITIAL> "+" { 
  return new Symbol(sym.tPLUS, new TokenValue(yytext(), yyline, yychar, sourceFilename)); 
} 
<YYINITIAL> "-" { 
  return new Symbol(sym.tMINUS, new TokenValue(yytext(), yyline, yychar, sourceFilename));
} 
<YYINITIAL> "*" { 
  return new Symbol(sym.tTIMES, new TokenValue(yytext(), yyline, yychar, sourceFilename));
} 
<YYINITIAL> "/" { 
  return new Symbol(sym.tDIVIDE, new TokenValue(yytext(), yyline, yychar, sourceFilename));
} 
<YYINITIAL> ";" { 
  return new Symbol(sym.SEMI, new TokenValue(yytext(), yyline, yychar, sourceFilename));
} 
<YYINITIAL> "//" {
  yybegin(COMMENTS);
}
<COMMENTS> [^\n] {
}
<COMMENTS> [\n] {
  yybegin(YYINITIAL);
}
<YYINITIAL> . {
  return new Symbol(sym.error, null);
}

This section tells Yylex what to do when it recognizes a particular regular expression. For example, when Yylex recognizes a NUMBER, it will return a Symbol constructed with the token type sym.tNUMBER and a TokenValue that contains

The first three values are defined for you.

For example, suppose Yylex reads a JO99 file, which must have the file extension .j09 as per the language definition, that contains:

x + 56 / 4
y - 23 * f

When it reads the sequence of characters 23, it will recognize it as a number token. yytext() returns the string "56". yyline will be 1, and yychar will be 15 (assuming a line break is one character). The scanner will return the value produced by the constructor call

new Symbol(sym.tNUMBER, new TokenValue("56", 1, 15, sourceFilename))

One group of rules is worth noting:

<YYINITIAL> "//" {
  yybegin(COMMENTS);
}
<COMMENTS> [^\n] {
}
<COMMENTS> [\n] {
  yybegin(YYINITIAL);
}

This specification says:

Also notice the very last rule:

<YYINITIAL> . {
  return new Symbol(sym.error, null);
}

Regular expression rules are evaluated from top to bottom. The regular expression . matches any character, so this rule is always used if no other rule above it was satisfied. In this case, that means there is an error.

So what?

For assignment 1 you should copy the files sym.java, jo99.exp, and scantest.java from the directory ~cs164/jo99/a1 to your working directory and modify jo99.exp to create your scanner. The assignment write-up explains how to run the scanner generator and test program.

Assignment 1
JLex manual


cs164@cory.eecs.berkeley.edu
Copyright © 1998-9 by the Regents of the University of California
Last modified: