Introduction

Welcome to the Motoko Regex Engine Documentation, your go-to guide for leveraging the power of regular expressions in the Motoko programming language. This engine provides robust tools for pattern matching, searching, and text processing.

Inspired by other established regex libraries, this regex engine adapts their capabilities to meet the needs of Motoko.


Installation and Import

Install the Motoko Regex Engine using:

mops add regex

Import it into your project with:

import Regex "mo:regex";

What is a Regular Expression?

A regular expression (regex) is a sequence of characters defining a search pattern. Regex is widely used in text processing for tasks such as:

  • Searching for text patterns (e.g., keywords in a document).
  • Validating formats (e.g., email addresses or phone numbers).
  • Extracting data from structured text (e.g., logs or CSV files).

For example, the regex ^\d{3}-\d{2}-\d{4}$ matches a string formatted as a Social Security Number, such as 123-45-6789.


Key Features

Pattern Support

  • Anchors: ^ (start of string), $ (end of string).
  • Character classes: [a-z], [^0-9].
  • Quantifiers: *, +, ?, {m,n}.
  • Groups: (), (?:).
  • Alternation: | (logical OR).
  • Escapes: \d, \w, \s, etc.

Flags

Flags are optional boolean values that modify the behavior of regex matching. They are set during instantiation and cannot be changed afterward. The engine currently supports:

  • caseSensitive: Case-sensitive matching (default is true).
  • multiline: Enables multiline matching.

Example with null flags (default behavior):

let regex = Regex.Regex("\d{3}-\d{2}-\d{4}", null);

API Functions

  • match: Check for a full match of the pattern in the input text.
  • search: Locate the first occurrence of the pattern in the input.
  • findAll: Retrieve all matches for the pattern.
  • findIter: Iterate over matches.

Syntax


Supported Syntax

Motoko regex supports a variety of syntax features for defining patterns. These include:

  • Character matching (a, b, c, etc.)
  • Alternation (|)
  • Grouping (())
  • Character classes ([] with support for ranges like [a-z])
  • Quantifiers (*, +, ?, {n}, {n,m})
  • Anchors (^, $)

Quantifiers

Quantifiers specify how many times a preceding element must occur for a match.

Supported Quantifiers

QuantifierMeaningExample
*Match 0 or more timesa* matches "", "a", "aaa"
+Match 1 or more timesa+ matches "a", "aaa"
?Match 0 or 1 timea? matches "", "a"
{n}Match exactly n timesa{2} matches "aa"
{n,}Match at least n timesa{2,} matches "aa", "aaa"
{n,m}Match between n and m timesa{2,4} matches "aa", "aaa", "aaaa"

Quantifier Modes

Quantifiers can operate in different modes:

  • Greedy: Matches as many occurrences as possible.
  • Lazy (? after quantifier): Matches as few as possible. E.g., a+? matches fewer occurrences of "a".

Invalid Quantifiers

Certain quantifier patterns are not allowed:

  • Redundant modifiers, such as a{2}+ or a{2}*.
  • Empty quantifiers, e.g., {} or {,}.
  • Multiple commas in ranges, e.g., {2,,4}.

Metacharacters

Metacharacters represent special patterns or symbols.

MetacharacterMeaningExample
.Match any character except \na.b matches "acb"
\wMatch word characters (alphanumeric + _)\w+ matches "abc123"
\WMatch non-word characters\W matches "@"
\dMatch digits (0-9)\d+ matches "123"
\DMatch non-digits\D matches "a"
\sMatch whitespace\s+ matches " "
\SMatch non-whitespace\S matches "a"

Character Classes

Character classes allow matching sets of characters.

  • [abc]: Matches any character a, b, or c.
  • [^abc]: Matches any character except a, b, or c.
  • [a-z]: Matches any character in the range a to z.

Nested Quantifiers

Quantifiers inside character classes must be explicitly defined. Nested or redundant quantifiers, like [a-z]{2}+, are not allowed.


Anchors

Anchors specify positions in the text.

AnchorMeaningExample
^Start of the string^abc matches "abc" at the beginning
$End of the stringabc$ matches "abc" at the end
\bWord boundary\bword\b matches "word"
\BNon-word boundary\Bword matches "word" not at a boundary

Groups and Group Modifiers

Groups are enclosed in parentheses () and can be modified for specific behaviors.

Supported Group Modifiers

ModifierSyntaxMeaning
Non-capturing(?:...)Groups without capturing
Positive Lookahead(?=...)Asserts that what follows matches
Negative Lookahead(?!...)Asserts that what follows does not match
Positive Lookbehind(?<=...)Asserts that what precedes matches
Negative Lookbehind(?<!...)Asserts that what precedes does not match

Escaped Characters

Escape sequences represent special characters.

Escape SequenceMeaning
\\Literal backslash
\nNewline
\tTab
\w, \WWord/Non-word characters
\d, \DDigit/Non-digit
\s, \SWhitespace/Non-whitespace

Invalid escape sequences throw an error.


Prohibited Patterns

  • Invalid group modifiers: e.g., (?).
  • Empty groups: () is not allowed.
  • Empty character classes: [] results in an error.
  • Redundant or conflicting quantifiers: a{2}+.

Error Handling

The Motoko regex engine provides detailed error feedback to help developers identify and fix issues in their regular expressions. Below is a list of all possible errors, their meanings, and typical scenarios where they might occur.

Error Types

ErrorDescriptionCause
#UnexpectedCharacterAn invalid character was encountered during parsing.Using a character that is not allowed in regex syntax, such as unescaped special characters.
#UnexpectedEndOfInputThe regex input ended unexpectedly, leaving constructs incomplete.Omitting closing brackets, parentheses, or quantifier ranges.
#GenericErrorA generic error message providing additional context.Various syntax or logic errors not covered by specific error types.
#InvalidQuantifierRangeA malformed or invalid quantifier range was used.Using invalid quantifier syntax, e.g., {,}, {,3}, {a,b}.
#InvalidEscapeSequenceAn invalid escape sequence was encountered.Using unrecognized escape sequences like \q or \x without proper syntax.
#UnmatchedParenthesisA closing parenthesis ) does not match any preceding opening parenthesis (.Missing or extra closing parentheses in the regex pattern.
#MismatchedParenthesisParentheses do not form a valid pairing.Nested parentheses are incorrectly matched or unbalanced, e.g., ((a)b]).
#UnexpectedTokenAn unexpected token was encountered during parsing.Using misplaced or unrecognized tokens in the regex pattern.
#UnclosedGroupA group construct is not properly closed with a closing parenthesis ).Missing a closing parenthesis in a group definition.
#InvalidQuantifierA quantifier is malformed or applied in an invalid context.Using redundant or conflicting quantifiers, e.g., a{2}+.
#EmptyExpressionThe regex input is empty or contains no valid expressions.Providing an empty string or expression with no meaningful content.
#NotCompiledThe regex has not been compiled before attempting to use it.There was an error during compilation of the reject object, this may be due to any of the previous errors. That error will be specified in the #NotCompiled variant.

Flags

Flags in Motoko regex provide flexibility by modifying the behavior of the regex matching process. This section covers the available flags, their purpose, and how to use them effectively.


Overview

Flags are optional parameters that can alter how the regex engine interprets and processes a pattern. They enable features such as case-insensitive matching and handling multiline inputs.


Available Flags

1. CASE_SENSITIVE

  • Type: Bool

  • Default: true

  • Description: Determines whether the regex should consider case when matching characters.

  • Behavior:

    • When caseSensitive = true: Matches are case-sensitive (default behavior).
    • When caseSensitive = false: Matches are case-insensitive, treating uppercase and lowercase letters as equivalent.
  • Example:

    let regex = Regex.Regex("abc", ?{caseSensitive = false});
    assert(regex.search("ABC") == #FullMatch);
    

2. MULTILINE

  • Type: Bool

  • Default: false

  • Description: Alters the behavior of anchors (^ and $) to match at line boundaries rather than the start or end of the entire string.

  • Behavior:

    • When multiline = true:
      • ^ matches the start of any line.
      • $ matches the end of any line.
    • When multiline = false (default behavior):
      • ^ matches the beginning of the entire input.
      • $ matches the end of the entire input.
  • Example:

    let regex = Regex.Regex("^abc$", ?{multiline = true});
    assert(regex.search("abc\ndef\nabc") == #FullMatch);
    

Combining Flags

You can combine flags to fine-tune the regex engine's behavior. For example:

let regex = Regex.Regex("abc", ?{caseSensitive = false; multiline = true});
assert(regex.search("ABC\ndef\nabc") == #FullMatch);

In this example:

  • The pattern abc is matched regardless of case.
  • The engine processes each line independently due to multiline = true.

Default Behavior Without Flags

If no flags are specified, the engine uses the following defaults:

  • caseSensitive = true
  • multiline = false

This means:

  • Matching is case-sensitive.
  • Anchors (^ and $) match only at the start and end of the entire input.

Best Practices

  • Use caseSensitive = false for patterns that need to ignore case differences, such as matching user inputs in a case-insensitive manner.
  • Use multiline = true when processing multi-line text, such as logs or formatted documents, where each line might have independent matching requirements.

Functions

This section provides an overview of the (current) functions available in this library. Use the links below to navigate to detailed documentation for each function.


Overview

The search() function scans an input string for the first occurrence of the regex pattern. Unlike match(), which requires the pattern to span the entire input, search() identifies the first substring that satisfies the pattern.

Signature

public func search(text: Text): Result.Result<Match, RegexError>

Parameters

ParameterTypeDescription
textTextThe input string to search for the first match

Return Value

Type: Result.Result<Match, RegexError>

Success Case

Returns a Match object containing:

  • The matched substring (value)
  • The position of the match within the input string
  • Captured groups (if any)

No Match Case

Returns a Match object with:

  • status = #NoMatch
  • Empty value

Error Case

Returns RegexError (#NotCompiled) only if the pattern failed to compile during instantiation

Behavior

Input Validation

  • If the regex instantiation failed (due to an invalid pattern), returns RegexError (#NotCompiled)

Search Process

  1. Scans the input string character by character
  2. Identifies if a potential match could begin at the current position
  3. Delegates to match() for full matching starting from that position

Result Construction

  • On finding a match:
    • Returns a Match object with details of the match
  • If no match is found after scanning the string:
    • Returns a Match object with status = #NoMatch

Example Usage

1. Successful Match

Pattern: "a+" Input: "xxaaayy"

let pattern = Regex.Regex("a+", null);
let result = pattern.search("xxaaayy");
switch (result) {
    case (#ok(match)) Debug.print("First match: " # match.value);  // Output: "aaa"
    case (#err(error)) Debug.print("Error: " # debug_show(error));
};

Output:

First match: aaa

2. No Match Found

Pattern: "z+" Input: "xxaaaayy"

let pattern = Regex.Regex("z+", null);
let result = pattern.search("xxaaaayy");
switch (result) {
    case (#ok(match)) {
        switch (match.status) {
            case (#NoMatch) Debug.print("No match found.");
            case (#FullMatch) Debug.print("First match: " # match.value);
        };
    };
    case (#err(error)) Debug.print("Error: " # debug_show(error));
};

Output:

No match found.

3. Invalid Pattern

Scenario: Creating a regex with an invalid pattern

let pattern = Regex.Regex("[a-");
let result = pattern.search("xxaaaayy");
switch (result) {
    case (#ok(match)) Debug.print("First match: " # match.value);
    case (#err(error)) Debug.print("Error: " # debug_show(error)); // Output: #NotCompiled
};

Output:

Error: #NotCompiled

Overview

The match() function is a core API for performing regex-based matching. It takes an input string and matches it against a precompiled regex represented as an NFA. The function handles matching mechanics, including state transitions, greedy and lazy quantifiers, and group captures.


Signature

public func match(text: Text): Result.Result<Match, RegexError>

Parameters

ParameterTypeDescription
textTextThe input string to be matched against the compiled regex.

Return Value

Result.Result<Match, RegexError>:

  • On Success (Match):
    • Contains details of the match, such as the matched substring, captured groups, and spans.
  • On Failure (RegexError):
    • Indicates why the matching process failed (e.g., regex not compiled).

Behavior

  1. Input Validation:

    • Checks if the regex has been compiled.
    • Returns #NotCompiled error if the regex is unavailable.
  2. Matching Process:

    • Delegates the actual matching logic to the matcher.match function.
    • Traverses the NFA based on input characters.
    • Respects greedy and lazy quantifier modes.
    • Handles capture groups and anchors (e.g., ^, $).
  3. Result Construction:

    • Builds a Match object for successful matches.
    • Returns RegexError for failures.

Example Usage

1. Successful Match

let pattern = Regex.Regex("h.*o",null); 
let result = pattern.match("hello");

switch (result) {
  case (#ok(match)) {
    Debug.print("Matched value: " # match.value);
  };
  case (#err(error)) {
    Debug.print("Error: " # debug_show(error));
  };
}

Output:

Matched value: hello

2. No Match

let pattern = Regex.Regex("z+",null);
let result = pattern.match("hello");

switch (result) {
  case (#ok(match)) {
    Debug.print("Matched value: " # match.value);
  };
  case (#err(error)) {
    Debug.print("Error: " # debug_show(error));
  };
}

Output:

#ok: status = #NoMatch

Input Validation

  • Before matching, the function ensures the regex is compiled.

  • If nfa is null, the function returns:

    #err(#NotCompiled)
    

Delegation to matcher.match

  • The compiled NFA, input text, and optional flags are passed to matcher.match.
  • matcher.match performs:
    • State Transitions:
      • Moves between states in the NFA based on input characters.
    • Greedy and Lazy Quantifiers:
      • Greedy quantifiers consume as much input as possible.
      • Lazy quantifiers stop at the first valid match.
    • Capture Groups:
      • Tracks and extracts group matches.
    • Anchors:
      • Ensures patterns anchored to the start (^) or end ($) are respected.

Overview

The findAll() method returns an array of all non-overlapping matches of the regex pattern in the input text. Unlike findIter(), this method collects all matches at once into an array.

Signature

public func findAll(text: Text): Result.Result<[Match], RegexError>

Parameters

ParameterTypeDescription
textTextThe input string to search for matches

Return Value

Type: Result.Result<[Match], RegexError>

Success Case

Returns an array of Match objects, where each contains:

  • The matched substring (value)
  • The position of the match within the input string
  • Any captured groups
  • Match status (#FullMatch)

Error Case

Returns RegexError (#NotCompiled) if the pattern failed to compile during instantiation

Match Collection Process

  1. Starts from the beginning of the input string
  2. Collects all non-overlapping matches into an array
  3. Preserves the order of matches as they appear in the text
  4. Returns an empty array if no matches are found

Overview

The findIter() method returns an iterator that yields all non-overlapping matches of the regex pattern in the input text. This method is memory-efficient as it generates matches lazily instead of collecting them all at once like findAll().

Signature

public func findIter(text: Text): Result.Result<Iter.Iter<Match>, RegexError>

Parameters

ParameterTypeDescription
textTextThe input string to search for matches

Return Value

Type: Result.Result<Iter.Iter<Match>, RegexError>

Success Case

Returns an iterator that yields Match objects, where each contains:

  • The matched substring (value)
  • The position of the match within the input string
  • Any captured groups
  • Match status (#FullMatch)

Error Case

Returns RegexError (#NotCompiled) if the pattern failed to compile during instantiation

Iteration Process

  1. Starts from the beginning of the input string
  2. For each match found:
    • Yields a Match object
    • Advances to the position after the current match
  3. Continues until no more matches are found
  4. Automatically handles the internal state between iterations

Match Generation

  • Matches are generated one at a time as the iterator is consumed
  • Non-overlapping matches are guaranteed
  • The iteration order follows the text from left to right

Overview

The Match record represents the result of a pattern matching operation on text, typically used in regular expression or string matching operations. It contains detailed information about what was matched, where it was found, and any captured groups within the match.

Structure

public type Match = {
    string: Text;
    value: Text;
    status: {
        #FullMatch;
        #NoMatch;
    };
    position: (Nat, Nat);
    capturedGroups: ?[(Text,Nat)];
    spans: [(Nat, Nat)];
    lastIndex: Nat;
};

Field Descriptions

string: Text

The original input string that was searched for matches. This field preserves the complete text that was analyzed during the matching operation.

value: Text

The actual matched substring found in the original text. This represents the specific portion of text that satisfied the matching criteria.

status

An enumerated type that indicates the match result:

  • #FullMatch: Indicates a successful match was found
  • #NoMatch: Indicates no match was found in the input string

position: (Nat, Nat)

A tuple containing the start and end indices of the match in the original string:

  • First value: Starting index of the match
  • Second value: Ending index of the match

capturedGroups: ?[(Text,Nat)]

An optional array of tuples containing captured groups from the match:

  • Each tuple contains:
    • Text: The captured text
    • Nat: The index of the captured group
  • null if no groups were captured

spans: [(Nat, Nat)]

An array of tuples representing the character spans of the respective match

lastIndex: Nat

The index where the last match ended. This is particularly useful when performing multiple sequential matches on the same text.

Usage Examples

Basic Match Checking

if (matchResult.status == #FullMatch) {
    Debug.print("Found match: " # matchResult.value);
} else {
    Debug.print("No match found");
};

Working with Captured Groups

switch (matchResult.capturedGroups) {
    case (null) { Debug.print("No groups captured") };
    case (?groups) {
        for ((text, index) in groups.vals()) {
            Debug.print("Group " # debug_show(index) # ": " # text);
        };
    };
};

Regex Examples for Motoko Regex Engine

Table of Contents

Internet Computer Identifiers

Principal ID

Pattern to validate Principal ID format.

// Anonymous Principal will be rejected
let principalPattern = Regex.Regex("^[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{3}$", null);
public func validatePrincipalId(id: Text): Bool {
    switch(principalPattern.match(id)) {
        case (#ok(result)) {
            switch(result.status) {
                case (#FullMatch) true;
                case (#NoMatch) false;
            };
        };
        case (#err(_)) false;
    };
};

Account ID

Pattern to validate Account ID format.

// Account ID (32 bytes in hexadecimal)
let accountPattern = Regex.Regex("^[0-9a-f]{64}$", null);
public func validateAccountId(id: Text): Bool {
    switch(accountPattern.match(id)) {
        case (#ok(result)) {
            switch(result.status) {
                case (#FullMatch) true;
                case (#NoMatch) false;
            };
        };
        case (#err(_)) false;
    };
};

Contributing to Motoko Regex Engine

Thank you for your interest in contributing to the Motoko Regex Engine project! This guide will help you get started with contributing to the project.

Table of Contents

Getting Started

  1. Fork the repository:

  2. Clone your fork:

    git clone https://github.com/YOUR-USERNAME/motoko_regex_engine.git
    cd motoko_regex_engine
    
  3. Add the upstream repository:

    git remote add upstream https://github.com/Demali-876/motoko_regex_engine.git
    
  4. Create a new branch for your work:

    git checkout -b feat/your-feature-name
    

Development Process

  1. Set up your development environment:

    • Install the DFINITY SDK (dfx)
    • Install Node.js and npm
    • Run npm install to install dependencies
  2. Make your changes:

    • Keep your changes focused and concise
    • Update documentation if needed
  3. Keep your fork up to date:

    git fetch upstream
    git rebase upstream/main
    

Commit Guidelines

We use Conventional Commits for clear and standardized commit messages. Each commit message should be structured as follows:

<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

Types:

  • feat: A new feature
  • fix: A bug fix
  • docs: Documentation only changes
  • style: Changes that do not affect the meaning of the code
  • refactor: A code change that neither fixes a bug nor adds a feature
  • perf: A code change that improves performance
  • test: Adding missing tests or correcting existing tests
  • chore: Changes to the build process or auxiliary tools

Examples:

feat(parser): add support for lookahead assertions

fix(matcher): resolve infinite loop in nested groups

docs: update API documentation for search method

refactor(compiler): simplify NFA construction logic

Pull Request Process

  1. Push your changes to your fork:

    git push origin feat/your-feature-name
    
  2. Create a Pull Request:

  3. PR Review Process:

    • Maintainers will review your PR
    • Address any requested changes
    • Once approved, your PR will be merged

Testing Requirements

While formal tests are not required, you must provide evidence that you've tested your changes. This can include:

  • Screenshots of the feature working
  • Console output showing successful execution
  • Example usage and results
  • Description of test cases you've tried

Example test evidence in PR:

Tested the new alternation operator with:
1. Simple patterns: "a|b" against "a" and "b"
2. Complex patterns: "(foo|bar)+" against "foofoobar"
3. Edge cases: "a||b" and "|a|b|"

Results:
- All patterns matched correctly
- No infinite loops or crashes
- Proper error handling for invalid patterns

Thank you for contributing to the Motoko Regex Engine project!