2,680 reads

Build Your Own Programming Language Part III: Improving Lexical Analysis with Regex Lookaheads

by Alexander MakeevApril 25th, 2022

Too Long; Didn't Read

In this part of creating your own programming language we will improve our lexical analysis with lookaheads and lookbehinds. Lookaheads allow you to create your own `^` and `$` sub-expressions. With their help, you can set up a condition that will be met at the beginning and at the end of the line or won’t, and this condition will not be a part of the “matched” expression. Lookahead “looks” forward and accordingly should be placed at end.

featured image - Build Your Own Programming Language Part III: Improving Lexical Analysis with Regex Lookaheads

In this part of creating your own programming language, we will improve our lexical analysis with Regex lookaheads and lookbehinds (please check out Part I and Part II).

Basically, lookaheads and lookbehinds allow you to create your own ^ and $ sub-expressions (anchors). With their help, you can set up a condition that will be met at the beginning and at the end of the line or won’t, and this condition will not be a part of the “matched” expression.

Lookbehind can “look” back and should be placed at the beginning of the regular expression. Lookahead “looks” forward and accordingly should be placed at the end.

There are four types of lookaheads and lookbehinds with corresponding regex syntax:

Positive lookahead: (?=pattern)
Negative lookahead (?!pattern)
Positive lookbehind (?<=pattern)
Negative lookbehind (?<!pattern)

Let’s dive into code and see how we can improve lexical analysis for our programming language. We will go through the declared Token types with the corresponding regex value to define an appropriate lexeme:

Keyword lexeme if|then|end|print|input|struct|arg

Let’s suppose we defined a variable starting with a Keyword:

print_value = 5
print print_value

During the lexical analysis we will capture this print_value variable by the Keyword lexeme instead of being parsed as a variable. To read it properly, we need to be sure that after the keyword is placed either a space or the end of the line. It could be done with the help of a positive lookahead in the end of the Keyword expression:

(if|then|end|print|input|struct|arg)(?=\\s|$)

Logical lexeme true|false

The same rule is applied to the Logical lexeme:

true_value = true

To read the Logical lexeme properly, it should be separated either by space or by the end of the line:

(true|false)(?=\\s|$)

Numeric lexeme [0-9]+

For the Numeric values, we will introduce the ability to read floating-point numbers with a decimal point and numbers with a negative/positive sign in front of the number:

negative_value = -123
float_value = 123.456
leading_dot_float_value = .123
ending_dot_float_value = 123.

First, to match optional + or - we add a character class:

[+-]?[0-9]+

Then to read a float number, we introduce a dot character with arbitrary numbers set after it:

[+-]?[0-9]+[.]?[0-9]*

There could be a special case when our number starts with a leading dot, so we add an extra alternation:

[+-]?([0-9]+[.]?[0-9]*|[.][0-9]+)

We can also replace an alternation by applying a positive lookahead with a requirement to contain maximum 1 dot and at least 1 number after an arbitrary sign declaration. In this case, our main expression for the float number will have a greedy * quantifier:

[+-]?((?=[.]?[0-9])[0-9]*[.]?[0-9]*)

Lastly, to map floating-point numbers, we modify NumericValue to accept Double type as a generic Value type:

public class NumericValue extends Value<Double> {
    public NumericValue(Double value) {
        super(value);
    }

    @Override
    public String toString() {
        if ((getValue() % 1) == 0) //print integer without fractions
            return String.valueOf(getValue().intValue());
        return super.toString();
    }
}

Operator ([+]|[-]|[*]|[/]|[>]|[<]|[=]{1,2}|[!]|[:]{2}|[(]|[)]|new)

struct Object
	arg number
end

new_value = new Object [ 5 ]

For the new operator, we as well need to introduce a positive lookahead separating value with a space or with the end of the line:

([+]|[-]|[*]|[/]|[>]|[<]|[=]{1,2}|[!]|[:]{2}|[(]|[)]|new(?=\s|$))

In this part we slightly improved our lexical analyzer with positive lookaheads. Now our toy language is a little bit smarter and able to parse more complex variable expressions.