Introduction

Welcome to the Motoko Regex Engine Documentation, your go-to guide for leveraging the power of regular expressions in the Motoko programming language. This engine provides robust tools for pattern matching, searching, and text processing.

Inspired by other established regex libraries, this regex engine adapts their capabilities to meet the needs of Motoko.

Installation and Import

Install the Motoko Regex Engine using:

mops add regex

Import it into your project with:

import Regex "mo:regex";

What is a Regular Expression?

A regular expression (regex) is a sequence of characters defining a search pattern. Regex is widely used in text processing for tasks such as:

Searching for text patterns (e.g., keywords in a document).
Validating formats (e.g., email addresses or phone numbers).
Extracting data from structured text (e.g., logs or CSV files).

For example, the regex ^\d{3}-\d{2}-\d{4}$ matches a string formatted as a Social Security Number, such as 123-45-6789.

Key Features

Pattern Support

Anchors: ^ (start of string), $ (end of string).
Character classes: [a-z], [^0-9].
Quantifiers: *, +, ?, {m,n}.
Groups: (), (?:).
Alternation: | (logical OR).
Escapes: \d, \w, \s, etc.

Flags

Flags are optional boolean values that modify the behavior of regex matching. They are set during instantiation and cannot be changed afterward. The engine currently supports:

caseSensitive: Case-sensitive matching (default is true).
multiline: Enables multiline matching.

Example with null flags (default behavior):

let regex = Regex.Regex("\d{3}-\d{2}-\d{4}", null);

API Functions

match: Check for a full match of the pattern in the input text.
search: Locate the first occurrence of the pattern in the input.
findAll: Retrieve all matches for the pattern.
findIter: Iterate over matches.

Syntax

Syntax

Supported Syntax

Motoko regex supports a variety of syntax features for defining patterns. These include:

Character matching (a, b, c, etc.)
Alternation (|)
Grouping (())
Character classes ([] with support for ranges like [a-z])
Quantifiers (*, +, ?, {n}, {n,m})
Anchors (^, $)

Quantifiers

Quantifiers specify how many times a preceding element must occur for a match.

Supported Quantifiers

Quantifier	Meaning	Example
`*`	Match 0 or more times	`a*` matches "", "a", "aaa"
`+`	Match 1 or more times	`a+` matches "a", "aaa"
`?`	Match 0 or 1 time	`a?` matches "", "a"
`{n}`	Match exactly `n` times	`a{2}` matches "aa"
`{n,}`	Match at least `n` times	`a{2,}` matches "aa", "aaa"
`{n,m}`	Match between `n` and `m` times	`a{2,4}` matches "aa", "aaa", "aaaa"

Quantifier Modes

Quantifiers can operate in different modes:

Greedy: Matches as many occurrences as possible.
Lazy (? after quantifier): Matches as few as possible. E.g., a+? matches fewer occurrences of "a".

Invalid Quantifiers

Certain quantifier patterns are not allowed:

Redundant modifiers, such as a{2}+ or a{2}*.
Empty quantifiers, e.g., {} or {,}.
Multiple commas in ranges, e.g., {2,,4}.

Metacharacters

Metacharacters represent special patterns or symbols.

Metacharacter	Meaning	Example
`.`	Match any character except `\n`	`a.b` matches "acb"
`\w`	Match word characters (alphanumeric + `_`)	`\w+` matches "abc123"
`\W`	Match non-word characters	`\W` matches "@"
`\d`	Match digits (`0-9`)	`\d+` matches "123"
`\D`	Match non-digits	`\D` matches "a"
`\s`	Match whitespace	`\s+` matches " "
`\S`	Match non-whitespace	`\S` matches "a"

Character Classes

Character classes allow matching sets of characters.

[abc]: Matches any character a, b, or c.
[^abc]: Matches any character except a, b, or c.
[a-z]: Matches any character in the range a to z.

Nested Quantifiers

Quantifiers inside character classes must be explicitly defined. Nested or redundant quantifiers, like [a-z]{2}+, are not allowed.

Anchors

Anchors specify positions in the text.

Anchor	Meaning	Example
`^`	Start of the string	`^abc` matches "abc" at the beginning
`$`	End of the string	`abc$` matches "abc" at the end
`\b`	Word boundary	`\bword\b` matches "word"
`\B`	Non-word boundary	`\Bword` matches "word" not at a boundary

Groups and Group Modifiers

Groups are enclosed in parentheses () and can be modified for specific behaviors.

Supported Group Modifiers

Modifier	Syntax	Meaning
Non-capturing	`(?:...)`	Groups without capturing
Positive Lookahead	`(?=...)`	Asserts that what follows matches
Negative Lookahead	`(?!...)`	Asserts that what follows does not match
Positive Lookbehind	`(?<=...)`	Asserts that what precedes matches
Negative Lookbehind	`(?<!...)`	Asserts that what precedes does not match

Escaped Characters

Escape sequences represent special characters.

Escape Sequence	Meaning
`\\`	Literal backslash
`\n`	Newline
`\t`	Tab
`\w`, `\W`	Word/Non-word characters
`\d`, `\D`	Digit/Non-digit
`\s`, `\S`	Whitespace/Non-whitespace

Invalid escape sequences throw an error.

Prohibited Patterns

Invalid group modifiers: e.g., (?).
Empty groups: () is not allowed.
Empty character classes: [] results in an error.
Redundant or conflicting quantifiers: a{2}+.

Error Handling

The Motoko regex engine provides detailed error feedback to help developers identify and fix issues in their regular expressions. Below is a list of all possible errors, their meanings, and typical scenarios where they might occur.

Error Types

Error	Description	Cause
`#UnexpectedCharacter`	An invalid character was encountered during parsing.	Using a character that is not allowed in regex syntax, such as unescaped special characters.
`#UnexpectedEndOfInput`	The regex input ended unexpectedly, leaving constructs incomplete.	Omitting closing brackets, parentheses, or quantifier ranges.
`#GenericError`	A generic error message providing additional context.	Various syntax or logic errors not covered by specific error types.
`#InvalidQuantifierRange`	A malformed or invalid quantifier range was used.	Using invalid quantifier syntax, e.g., `{,}`, `{,3}`, `{a,b}`.
`#InvalidEscapeSequence`	An invalid escape sequence was encountered.	Using unrecognized escape sequences like `\q` or `\x` without proper syntax.
`#UnmatchedParenthesis`	A closing parenthesis `)` does not match any preceding opening parenthesis `(`.	Missing or extra closing parentheses in the regex pattern.
`#MismatchedParenthesis`	Parentheses do not form a valid pairing.	Nested parentheses are incorrectly matched or unbalanced, e.g., `((a)b])`.
`#UnexpectedToken`	An unexpected token was encountered during parsing.	Using misplaced or unrecognized tokens in the regex pattern.
`#UnclosedGroup`	A group construct is not properly closed with a closing parenthesis `)`.	Missing a closing parenthesis in a group definition.
`#InvalidQuantifier`	A quantifier is malformed or applied in an invalid context.	Using redundant or conflicting quantifiers, e.g., `a{2}+`.
`#EmptyExpression`	The regex input is empty or contains no valid expressions.	Providing an empty string or expression with no meaningful content.
`#NotCompiled`	The regex has not been compiled before attempting to use it.	There was an error during compilation of the reject object, this may be due to any of the previous errors. That error will be specified in the `#NotCompiled` variant.

Flags

Flags in Motoko regex provide flexibility by modifying the behavior of the regex matching process. This section covers the available flags, their purpose, and how to use them effectively.

Overview

Flags are optional parameters that can alter how the regex engine interprets and processes a pattern. They enable features such as case-insensitive matching and handling multiline inputs.

Available Flags

1. CASE_SENSITIVE

Type: Bool
Default: true
Description: Determines whether the regex should consider case when matching characters.
Behavior:
- When caseSensitive = true: Matches are case-sensitive (default behavior).
- When caseSensitive = false: Matches are case-insensitive, treating uppercase and lowercase letters as equivalent.

Example:

let regex = Regex.Regex("abc", ?{caseSensitive = false});
assert(regex.search("ABC") == #FullMatch);

2. MULTILINE

Type: Bool
Default: false
Description: Alters the behavior of anchors (^ and $) to match at line boundaries rather than the start or end of the entire string.
Behavior:
- When multiline = true:
  - ^ matches the start of any line.
  - $ matches the end of any line.
- When multiline = false (default behavior):
  - ^ matches the beginning of the entire input.
  - $ matches the end of the entire input.

Example:

let regex = Regex.Regex("^abc$", ?{multiline = true});
assert(regex.search("abc\ndef\nabc") == #FullMatch);

Combining Flags

You can combine flags to fine-tune the regex engine's behavior. For example:

let regex = Regex.Regex("abc", ?{caseSensitive = false; multiline = true});
assert(regex.search("ABC\ndef\nabc") == #FullMatch);

In this example:

The pattern abc is matched regardless of case.
The engine processes each line independently due to multiline = true.

Default Behavior Without Flags

If no flags are specified, the engine uses the following defaults:

caseSensitive = true
multiline = false

This means:

Matching is case-sensitive.
Anchors (^ and $) match only at the start and end of the entire input.

Best Practices

Use caseSensitive = false for patterns that need to ignore case differences, such as matching user inputs in a case-insensitive manner.
Use multiline = true when processing multi-line text, such as logs or formatted documents, where each line might have independent matching requirements.

Functions

This section provides an overview of the (current) functions available in this library. Use the links below to navigate to detailed documentation for each function.

Overview

The search() function scans an input string for the first occurrence of the regex pattern. Unlike match(), which requires the pattern to span the entire input, search() identifies the first substring that satisfies the pattern.

Signature

public func search(text: Text): Result.Result<Match, RegexError>

Parameters

Parameter	Type	Description
text	Text	The input string to search for the first match

Return Value

Type: Result.Result<Match, RegexError>

Success Case

Returns a Match object containing:

The matched substring (value)
The position of the match within the input string
Captured groups (if any)

No Match Case

Returns a Match object with:

status = #NoMatch
Empty value

Error Case

Returns RegexError (#NotCompiled) only if the pattern failed to compile during instantiation

Behavior

Input Validation

If the regex instantiation failed (due to an invalid pattern), returns RegexError (#NotCompiled)

Search Process

Scans the input string character by character
Identifies if a potential match could begin at the current position
Delegates to match() for full matching starting from that position

Result Construction

On finding a match:
- Returns a Match object with details of the match
If no match is found after scanning the string:
- Returns a Match object with status = #NoMatch

Example Usage

1. Successful Match

Pattern: "a+" Input: "xxaaayy"

let pattern = Regex.Regex("a+", null);
let result = pattern.search("xxaaayy");
switch (result) {
    case (#ok(match)) Debug.print("First match: " # match.value);  // Output: "aaa"
    case (#err(error)) Debug.print("Error: " # debug_show(error));
};

Output:

First match: aaa

2. No Match Found

Pattern: "z+" Input: "xxaaaayy"

let pattern = Regex.Regex("z+", null);
let result = pattern.search("xxaaaayy");
switch (result) {
    case (#ok(match)) {
        switch (match.status) {
            case (#NoMatch) Debug.print("No match found.");
            case (#FullMatch) Debug.print("First match: " # match.value);
        };
    };
    case (#err(error)) Debug.print("Error: " # debug_show(error));
};

Output:

No match found.

3. Invalid Pattern

Scenario: Creating a regex with an invalid pattern

let pattern = Regex.Regex("[a-");
let result = pattern.search("xxaaaayy");
switch (result) {
    case (#ok(match)) Debug.print("First match: " # match.value);
    case (#err(error)) Debug.print("Error: " # debug_show(error)); // Output: #NotCompiled
};

Output:

Error: #NotCompiled

Overview

The match() function is a core API for performing regex-based matching. It takes an input string and matches it against a precompiled regex represented as an NFA. The function handles matching mechanics, including state transitions, greedy and lazy quantifiers, and group captures.

Signature

public func match(text: Text): Result.Result<Match, RegexError>

Parameters

Parameter	Type	Description
`text`	`Text`	The input string to be matched against the compiled regex.

Return Value

Result.Result<Match, RegexError>:

On Success (Match):
- Contains details of the match, such as the matched substring, captured groups, and spans.
On Failure (RegexError):
- Indicates why the matching process failed (e.g., regex not compiled).

Behavior

Input Validation:
- Checks if the regex has been compiled.
- Returns #NotCompiled error if the regex is unavailable.
Matching Process:
- Delegates the actual matching logic to the matcher.match function.
- Traverses the NFA based on input characters.
- Respects greedy and lazy quantifier modes.
- Handles capture groups and anchors (e.g., ^, $).
Result Construction:
- Builds a Match object for successful matches.
- Returns RegexError for failures.

Example Usage

1. Successful Match

let pattern = Regex.Regex("h.*o",null); 
let result = pattern.match("hello");

switch (result) {
  case (#ok(match)) {
    Debug.print("Matched value: " # match.value);
  };
  case (#err(error)) {
    Debug.print("Error: " # debug_show(error));
  };
}

Output:

Matched value: hello

2. No Match

let pattern = Regex.Regex("z+",null);
let result = pattern.match("hello");

switch (result) {
  case (#ok(match)) {
    Debug.print("Matched value: " # match.value);
  };
  case (#err(error)) {
    Debug.print("Error: " # debug_show(error));
  };
}

Output:

#ok: status = #NoMatch

Input Validation

Before matching, the function ensures the regex is compiled.
If nfa is null, the function returns:
```
#err(#NotCompiled)
```

Delegation to `matcher.match`

The compiled NFA, input text, and optional flags are passed to matcher.match.
matcher.match performs:
- State Transitions:
  - Moves between states in the NFA based on input characters.
- Greedy and Lazy Quantifiers:
  - Greedy quantifiers consume as much input as possible.
  - Lazy quantifiers stop at the first valid match.
- Capture Groups:
  - Tracks and extracts group matches.
- Anchors:
  - Ensures patterns anchored to the start (^) or end ($) are respected.

Overview

The findAll() method returns an array of all non-overlapping matches of the regex pattern in the input text. Unlike findIter(), this method collects all matches at once into an array.

Signature

public func findAll(text: Text): Result.Result<[Match], RegexError>

Parameters

Parameter	Type	Description
text	Text	The input string to search for matches

Return Value

Type: Result.Result<[Match], RegexError>

Success Case

Returns an array of Match objects, where each contains:

The matched substring (value)
The position of the match within the input string
Any captured groups
Match status (#FullMatch)

Error Case

Returns RegexError (#NotCompiled) if the pattern failed to compile during instantiation

Match Collection Process

Starts from the beginning of the input string
Collects all non-overlapping matches into an array
Preserves the order of matches as they appear in the text
Returns an empty array if no matches are found

Overview

The findIter() method returns an iterator that yields all non-overlapping matches of the regex pattern in the input text. This method is memory-efficient as it generates matches lazily instead of collecting them all at once like findAll().

Signature

public func findIter(text: Text): Result.Result<Iter.Iter<Match>, RegexError>

Parameters

Parameter	Type	Description
text	Text	The input string to search for matches

Return Value

Type: Result.Result<Iter.Iter<Match>, RegexError>

Success Case

Returns an iterator that yields Match objects, where each contains:

The matched substring (value)
The position of the match within the input string
Any captured groups
Match status (#FullMatch)

Error Case

Returns RegexError (#NotCompiled) if the pattern failed to compile during instantiation

Iteration Process

Starts from the beginning of the input string
For each match found:
- Yields a Match object
- Advances to the position after the current match
Continues until no more matches are found
Automatically handles the internal state between iterations

Match Generation

Matches are generated one at a time as the iterator is consumed
Non-overlapping matches are guaranteed
The iteration order follows the text from left to right

replace()

The replace() function substitutes matches in the input string with a specified replacement string. This function allows specifying a maximum number of replacements (maxReplacements).

Signature

public func replace(text: Text, replacement: Text, maxReplacements: ?Nat): Result.Result<Text, RegexError>

Parameters

Parameter	Type	Description
`text`	`Text`	The input string to perform replacements on.
`replacement`	`Text`	The string to replace matches with.
`maxReplacements`	`?Nat`	Optional limit on the number of replacements. If `null`, replaces all.

Return Value

Result.Result<Text, RegexError>:

On Success (Text):
- The updated string after performing replacements.
On Failure (RegexError):
- Indicates why the operation failed (e.g., invalid regex or replacement).

Behavior

Input Validation:
- Checks if the regex has been compiled.
- Returns #NotCompiled error if unavailable.
Replacement Process:
- Matches the regex pattern in the input string.
- Replaces each match with the specified string.
- Respects the maxReplacements limit if provided.
Result Construction:
- Returns the updated string on success.
- Returns RegexError for invalid inputs or uncompiled regex.

Example Usage

1. Replace All Matches

let replaceRegex = Regex.Regex("Hello");
let result = replaceRegex.replace("Hello world, Hello universe", "Hi", null);

switch (result) {
  case (#ok(updatedText)) {
    Debug.print("Updated text: " # updatedText);
  };
  case (#err(error)) {
    Debug.print("Error: " # debug_show(error));
  };
}

Output:

Updated text: Hi world, Hi universe

2. Replace with Limit

let replaceRegex = Regex.Regex("Hello");
let result = replaceRegex.replace("Hello world, Hello universe", "Hi", ?1);

switch (result) {
  case (#ok(updatedText)) {
    Debug.print("Updated text: " # updatedText);
  };
  case (#err(error)) {
    Debug.print("Error: " # debug_show(error));
  };
}

Output:

Updated text: Hi world, Hello universe

sub()

The sub() function substitutes matches in the input string with a specified replacement string. Unlike replace, it allows the use of regex patterns in the replacement string.

Signature

public func sub(text: Text, replacement: Text, maxSubstitutions: ?Nat): Result.Result<Text, RegexError>

Parameters

Parameter	Type	Description
`text`	`Text`	The input string to perform substitutions on.
`replacement`	`Text`	The string (can include regex) to replace matches with.
`maxSubstitutions`	`?Nat`	Optional limit on the number of substitutions. If `null`, substitutes all.

Return Value

Result.Result<Text, RegexError>:

On Success (Text):
- The updated string after performing substitutions.
On Failure (RegexError):
- Indicates why the operation failed (e.g., invalid regex or replacement).

Behavior

Input Validation:
- Checks if the regex has been compiled.
- Returns #NotCompiled error if unavailable.
Substitution Process:
- Matches the regex pattern in the input string.
- Substitutes each match with the specified replacement string.
- Respects the maxReplacements limit if provided.
Result Construction:
- Returns the updated string on success.
- Returns RegexError for invalid inputs or uncompiled regex.

Implementation

public func sub(text: Text, replacement: Text, maxReplacements: ?Nat): Result.Result<Text, RegexError> {
    switch (nfa) {
        case (null) #err(#NotCompiled);
        case (?compiledNFA) {
            matcher.sub(compiledNFA, text, replacement, maxReplacements)
        };
    }
};

Example Usage

1. Substituting All Matches

let subRegex = Regex.Regex("\d+");
let result = subRegex.sub("I have 10 bananas and 20 apples", "many", null);

switch (result) {
  case (#ok(updatedText)) {
    Debug.print("Updated text: " # updatedText);
  };
  case (#err(error)) {
    Debug.print("Error: " # debug_show(error));
  };
}

Output:

Updated text: I have many bananas and many apples

2. Substituting with Limit

let subRegex = Regex.Regex("\\d+");
let result = subRegex.sub("I have 10 bananas and 20 apples", "many", ?1);

switch (result) {
  case (#ok(updatedText)) {
    Debug.print("Updated text: " # updatedText);
  };
  case (#err(error)) {
    Debug.print("Error: " # debug_show(error));
  };
}

Output:

Updated text: I have many bananas and 20 apples

split()

The split() function divides a string into substrings based on matches of the regex pattern, with an optional limit on the number of splits (maxSplit).

Signature

public func split(text: Text, maxSplit: ?Nat): Result.Result<[Text], RegexError>

Parameters

Parameter	Type	Description
`text`	`Text`	The input string to be split based on the regex pattern.
`maxSplit`	`?Nat`	Optional limit on the number of splits. If `null`, splits all.

Return Value

Result.Result<[Text], RegexError>:

On Success ([Text]):
- A vector of substrings resulting from splitting the input string.
On Failure (RegexError):
- Indicates why the operation failed (e.g., invalid regex).

Behavior

Input Validation:
- Checks if the regex has been compiled.
- Returns #NotCompiled error if unavailable.
Splitting Process:
- Matches the regex pattern in the input string.
- Divides the string at each match, respecting the maxSplit limit if provided.
- Handles edge cases (e.g., no matches or empty input).
Result Construction:
- Returns a vector of substrings on success.
- Returns RegexError for invalid inputs or uncompiled regex.

Example Usage

1. Splitting Without Limit

let splitRegex = Regex.Regex(",");
let result = splitRegex.split("one,two,three", null);

switch (result) {
  case (#ok(parts)) {
    Debug.print("Split result: " # debug_show(parts));
  };
  case (#err(error)) {
    Debug.print("Error: " # debug_show(error));
  };
}

Output:

Split result: ["one", "two", "three"]

2. Splitting With Limit

let splitRegex = Regex.Regex(",");
let result = splitRegex.split("one,two,three", ?1);

switch (result) {
  case (#ok(parts)) {
    Debug.print("Split result: " # debug_show(parts));
  };
  case (#err(error)) {
    Debug.print("Error: " # debug_show(error));
  };
}

Output:

Split result: ["one", "two,three"]

inspectRegex()

Returns a detailed text representation of the NFA structure for a compiled regular expression pattern.

Overview

The inspectRegex() function allows you to examine the internal NFA (Non-deterministic Finite Automaton) structure of a compiled regular expression. This is particularly useful for debugging complex patterns and understanding how the regex engine processes matches.

Return Value

Type: Result.Result<Text, RegexError>
Success: Returns #ok(Text) with formatted NFA representation
Error: Returns #err(#NotCompiled) if regex is not compiled

// Create regex for basic SSN pattern: ^\d{3}-\d{2}-\d{4}$
let regex = Regex.new("^\\d{3}-\\d{2}-\\d{4}$");

// Inspect the NFA structure
switch(regex.inspectRegex()) {
    case (#ok(nfa)) Debug.print(nfa);
    case (#err(e)) Debug.print("Error: Not compiled");
};

Example Output

=== NFA State Machine ===
Initial State → 0
Accept States → [11]

=== Transitions ===
From State 0:
  #Range('0', '9') → 1 (#Greedy)

From State 1:
  #Range('0', '9') → 2 (#Greedy)

From State 2:
  #Range('0', '9') → 3 (#Greedy)

From State 3:
  #Char('-') → 4

From State 4:
  #Range('0', '9') → 5 (#Greedy)

From State 5:
  #Range('0', '9') → 6 (#Greedy)

From State 6:
  #Char('-') → 7

From State 7:
  #Range('0', '9') → 8 (#Greedy)

From State 8:
  #Range('0', '9') → 9 (#Greedy)

From State 9:
  #Range('0', '9') → 10 (#Greedy)

From State 10:
  #Range('0', '9') → 11 (#Greedy)

=== Assertions ===
Anchor: {aType = #StartOfString; position = 0}
Anchor: {aType = #EndOfString; position = 11}

Pattern Breakdown

The NFA structure represents the SSN pattern ^\d{3}-\d{2}-\d{4}$ as follows:

Start anchor at position 0
States 0-2: First three digits (\d{3}) with greedy quantifiers
State 3: Literal hyphen (-)
States 4-5: Next two digits (\d{2}) with greedy quantifiers
State 6: Second hyphen (-)
States 7-10: Final four digits (\d{4}) with greedy quantifiers
State 11: Accept state
End anchor at position 11

inspectState(state: Types.State)

Examines and returns the transitions for a specific state in the NFA (Non-deterministic Finite Automaton) structure.

Overview

The inspectState() function allows you to inspect the transitions from any particular state in the compiled regex pattern. This is useful for debugging specific parts of pattern matching behavior.

Parameters

state: Nat - The state number to inspect

Return Value

Type: Result.Result<Text, RegexError>
Success: Returns #ok(Text) with formatted transitions from the specified state
Error: Returns #err(#NotCompiled) if regex is not compiled

Using the same SSN pattern (^\d{3}-\d{2}-\d{4}$), let's inspect state 4, which handles the first digit after the first hyphen:

let regex = Regex.Regex("^\d{3}-\d{2}-\d{4}$");

switch(regex.inspectState(3)) {
    case (#ok(transitions)) Debug.print(transitions);
    case (#err(e)) Debug.print("Error: Not compiled");
};

Example Output

From State 3:
  #Char('-') → 4

This shows the transitions available for state 3 in the NFA. There is only one transition available for state 3 which is the dash character.

enableDebug(b: Bool)

Toggles debug logging for the regex matcher. When enabled, it provides detailed logging of the pattern matching process.

Parameters

b: Bool - Set to true to enable debug logging, false to disable it

Overview

The Match record represents the result of a pattern matching operation on text, typically used in regular expression or string matching operations. It contains detailed information about what was matched, where it was found, and any captured groups within the match.

Structure

public type Match = {
    string: Text;
    value: Text;
    status: {
        #FullMatch;
        #NoMatch;
    };
    position: (Nat, Nat);
    capturedGroups: ?[(Text,Nat)];
    spans: [(Nat, Nat)];
    lastIndex: Nat;
};

Field Descriptions

string: Text

The original input string that was searched for matches. This field preserves the complete text that was analyzed during the matching operation.

value: Text

The actual matched substring found in the original text. This represents the specific portion of text that satisfied the matching criteria.

status

An enumerated type that indicates the match result:

#FullMatch: Indicates a successful match was found
#NoMatch: Indicates no match was found in the input string

position: (Nat, Nat)

A tuple containing the start and end indices of the match in the original string:

First value: Starting index of the match
Second value: Ending index of the match

capturedGroups: ?[(Text,Nat)]

An optional array of tuples containing captured groups from the match:

Each tuple contains:
- Text: The captured text
- Nat: The index of the captured group
null if no groups were captured

spans: [(Nat, Nat)]

An array of tuples representing the character spans of the respective match

lastIndex: Nat

The index where the last match ended. This is particularly useful when performing multiple sequential matches on the same text.

Usage Examples

Basic Match Checking

if (matchResult.status == #FullMatch) {
    Debug.print("Found match: " # matchResult.value);
} else {
    Debug.print("No match found");
};

Working with Captured Groups

switch (matchResult.capturedGroups) {
    case (null) { Debug.print("No groups captured") };
    case (?groups) {
        for ((text, index) in groups.vals()) {
            Debug.print("Group " # debug_show(index) # ": " # text);
        };
    };
};

Unicode Support and Unicode Properties

Introduction

The Motoko Regex Engine supports Unicode properties, allowing users to match specific character categories using \p{Property} and \P{Property} syntax. This enhances pattern matching by enabling character classification based on Unicode properties.

Syntax

Unicode properties can be matched using the following syntax:

\p{Property}   // Matches a character with the specified Unicode property
\P{Property}   // Matches a character that does NOT have the specified Unicode property

Example

\p{L}   // Matches any letter
\p{N}   // Matches any number
\P{P}   // Matches any character except punctuation

Supported Unicode Properties

The engine supports a subset of Unicode properties:

L (Letter)
Ll (Lowercase Letter)
Lu (Uppercase Letter)
N (Number)
P (Punctuation)
Zs (Separator, Space)
Emoji (Emoji characters)

Backreferences

Introduction

Backreferences allow a regular expression to match repeated substrings by referring back to a previously captured group. They enable patterns to enforce consistency within the matched text.

Syntax

Backreferences use the following syntax:

\1, \2, ...: Refers to a numbered capturing group in order of appearance and can only be used in replace and sub methods.
\k<name>: Refers to a named capturing group and can be used within matching.

Example

(?<greeting>hello)\s+\k<greeting>

This matches hello followed by one or more space followed by hello again.

Lookaround Assertions

Introduction

Lookaround assertions allow for pattern matching based on surrounding text without including that text in the final match. They enable advanced pattern constraints while maintaining flexibility in regex processing. However, in many cases, using a stricter pattern without lookarounds can be a better approach, leading to simpler and more efficient regex expressions.

Lookaround Types

Lookaround assertions are divided into Lookahead and Lookbehind, and each has Positive and Negative variations. The following table summarizes them:

Type	Positive	Negative
Look-Ahead	`A(?=B)` - Match `A` if `B` follows	`A(?!B)` - Match `A` if `B` does not follow
Look-Behind	`(?<=B)A` - Match `A` if `B` precedes	`(?<!B)A` - Match `A` if `B` does not precede

Lookahead

Positive Lookahead (?=): Ensures a pattern exists after the current position without consuming it.
Negative Lookahead (?!): Ensures a pattern does not exist after the current position.

foo(?=bar)   // Matches "foo" only if followed by "bar"
foo(?!bar)   // Matches "foo" only if NOT followed by "bar"

Lookbehind

Positive Lookbehind (?<=): Ensures a pattern exists before the current position.
Negative Lookbehind (?<!): Ensures a pattern does not exist before the current position.

(?<=bar)foo   // Matches "foo" only if preceded by "bar"
(?<!bar)foo   // Matches "foo" only if NOT preceded by "bar"

Behavior and Considerations

Lookaround assertions do not consume characters; they only assert conditions.
Combining lookahead and lookbehind can create complex matching rules.
Negative lookaround can be used to enforce exclusions in matching.
Lookbehind patterns must have a fixed length.
In many cases, defining a stricter pattern instead of relying on lookaround assertions results in a more efficient and readable regex.

Conclusion

Lookaround assertions provide powerful matching capabilities without consuming characters. However, in most cases, defining a stricter pattern can lead to better performance and clarity. The Motoko Regex Engine supports both lookahead and lookbehind assertions, but users should consider whether they can achieve the same result with a more precise pattern before resorting to lookarounds.

Regex Examples for Motoko Regex Engine

Regex Examples for Motoko Regex Engine
- Table of Contents
- Internet Computer Identifiers
  - Principal ID
  - Account ID

Internet Computer Identifiers

Principal ID

Pattern to validate Principal ID format.

// Anonymous Principal will be rejected
let principalPattern = Regex.Regex("^[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{3}$", null);
public func validatePrincipalId(id: Text): Bool {
    switch(principalPattern.match(id)) {
        case (#ok(result)) {
            switch(result.status) {
                case (#FullMatch) true;
                case (#NoMatch) false;
            };
        };
        case (#err(_)) false;
    };
};

Account ID

Pattern to validate Account ID format.

// Account ID (32 bytes in hexadecimal)
let accountPattern = Regex.Regex("^[0-9a-f]{64}$", null);
public func validateAccountId(id: Text): Bool {
    switch(accountPattern.match(id)) {
        case (#ok(result)) {
            switch(result.status) {
                case (#FullMatch) true;
                case (#NoMatch) false;
            };
        };
        case (#err(_)) false;
    };
};

Contributing to Motoko Regex Engine

Thank you for your interest in contributing to the Motoko Regex Engine project! This guide will help you get started with contributing to the project.

Contributing to Motoko Regex Engine

Getting Started

Fork the repository:
- Visit https://github.com/Demali-876/motoko_regex_engine
- Click the "Fork" button in the top-right corner

Clone your fork:

git clone https://github.com/YOUR-USERNAME/motoko_regex_engine.git
cd motoko_regex_engine

Add the upstream repository:

git remote add upstream https://github.com/Demali-876/motoko_regex_engine.git

Create a new branch for your work:
```
git checkout -b feat/your-feature-name
```

Development Process

Set up your development environment:
- Install the DFINITY SDK (dfx)
- Install Node.js and npm
- Run npm install to install dependencies
Make your changes:
- Keep your changes focused and concise
- Update documentation if needed

Keep your fork up to date:

git fetch upstream
git rebase upstream/main

Commit Guidelines

We use Conventional Commits for clear and standardized commit messages. Each commit message should be structured as follows:

<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

Types:

feat: A new feature
fix: A bug fix
docs: Documentation only changes
style: Changes that do not affect the meaning of the code
refactor: A code change that neither fixes a bug nor adds a feature
perf: A code change that improves performance
test: Adding missing tests or correcting existing tests
chore: Changes to the build process or auxiliary tools

Examples:

feat(parser): add support for lookahead assertions

fix(matcher): resolve infinite loop in nested groups

docs: update API documentation for search method

refactor(compiler): simplify NFA construction logic

Pull Request Process

Push your changes to your fork:
```
git push origin feat/your-feature-name
```
Create a Pull Request:
- Go to https://github.com/Demali-876/motoko_regex_engine/pulls
- Click "New Pull Request"
- Select your fork and branch
- Fill out the PR template with:
  - Clear description of changes
  - Any breaking changes
  - Evidence of testing (screenshots, console output, etc.)
PR Review Process:
- Maintainers will review your PR
- Address any requested changes
- Once approved, your PR will be merged

Testing Requirements

While formal tests are not required, you must provide evidence that you've tested your changes. This can include:

Screenshots of the feature working
Console output showing successful execution
Example usage and results
Description of test cases you've tried

Example test evidence in PR:

Tested the new alternation operator with:
1. Simple patterns: "a|b" against "a" and "b"
2. Complex patterns: "(foo|bar)+" against "foofoobar"
3. Edge cases: "a||b" and "|a|b|"

Results:
- All patterns matched correctly
- No infinite loops or crashes
- Proper error handling for invalid patterns

Thank you for contributing to the Motoko Regex Engine project!