Introduction
Welcome to the Motoko Regex Engine Documentation, your go-to guide for leveraging the power of regular expressions in the Motoko programming language. This engine provides robust tools for pattern matching, searching, and text processing.
Inspired by other established regex libraries, this regex engine adapts their capabilities to meet the needs of Motoko.
Installation and Import
Install the Motoko Regex Engine using:
mops add regex
Import it into your project with:
import Regex "mo:regex";
What is a Regular Expression?
A regular expression (regex) is a sequence of characters defining a search pattern. Regex is widely used in text processing for tasks such as:
- Searching for text patterns (e.g., keywords in a document).
- Validating formats (e.g., email addresses or phone numbers).
- Extracting data from structured text (e.g., logs or CSV files).
For example, the regex ^\d{3}-\d{2}-\d{4}$
matches a string formatted as a Social Security Number, such as 123-45-6789
.
Key Features
Pattern Support
- Anchors:
^
(start of string),$
(end of string). - Character classes:
[a-z]
,[^0-9]
. - Quantifiers:
*
,+
,?
,{m,n}
. - Groups:
()
,(?:)
. - Alternation:
|
(logical OR). - Escapes:
\d
,\w
,\s
, etc.
Flags
Flags are optional boolean values that modify the behavior of regex matching. They are set during instantiation and cannot be changed afterward. The engine currently supports:
caseSensitive
: Case-sensitive matching (default is true).multiline
: Enables multiline matching.
Example with null flags (default behavior):
let regex = Regex.Regex("\d{3}-\d{2}-\d{4}", null);
API Functions
match
: Check for a full match of the pattern in the input text.search
: Locate the first occurrence of the pattern in the input.findAll
: Retrieve all matches for the pattern.findIter
: Iterate over matches.
Syntax
Supported Syntax
Motoko regex supports a variety of syntax features for defining patterns. These include:
- Character matching (
a
,b
,c
, etc.) - Alternation (
|
) - Grouping (
()
) - Character classes (
[]
with support for ranges like[a-z]
) - Quantifiers (
*
,+
,?
,{n}
,{n,m}
) - Anchors (
^
,$
)
Quantifiers
Quantifiers specify how many times a preceding element must occur for a match.
Supported Quantifiers
Quantifier | Meaning | Example |
---|---|---|
* | Match 0 or more times | a* matches "", "a", "aaa" |
+ | Match 1 or more times | a+ matches "a", "aaa" |
? | Match 0 or 1 time | a? matches "", "a" |
{n} | Match exactly n times | a{2} matches "aa" |
{n,} | Match at least n times | a{2,} matches "aa", "aaa" |
{n,m} | Match between n and m times | a{2,4} matches "aa", "aaa", "aaaa" |
Quantifier Modes
Quantifiers can operate in different modes:
- Greedy: Matches as many occurrences as possible.
- Lazy (
?
after quantifier): Matches as few as possible. E.g.,a+?
matches fewer occurrences of "a".
Invalid Quantifiers
Certain quantifier patterns are not allowed:
- Redundant modifiers, such as
a{2}+
ora{2}*
. - Empty quantifiers, e.g.,
{}
or{,}
. - Multiple commas in ranges, e.g.,
{2,,4}
.
Metacharacters
Metacharacters represent special patterns or symbols.
Metacharacter | Meaning | Example |
---|---|---|
. | Match any character except \n | a.b matches "acb" |
\w | Match word characters (alphanumeric + _ ) | \w+ matches "abc123" |
\W | Match non-word characters | \W matches "@" |
\d | Match digits (0-9 ) | \d+ matches "123" |
\D | Match non-digits | \D matches "a" |
\s | Match whitespace | \s+ matches " " |
\S | Match non-whitespace | \S matches "a" |
Character Classes
Character classes allow matching sets of characters.
[abc]
: Matches any charactera
,b
, orc
.[^abc]
: Matches any character excepta
,b
, orc
.[a-z]
: Matches any character in the rangea
toz
.
Nested Quantifiers
Quantifiers inside character classes must be explicitly defined. Nested or redundant quantifiers, like [a-z]{2}+
, are not allowed.
Anchors
Anchors specify positions in the text.
Anchor | Meaning | Example |
---|---|---|
^ | Start of the string | ^abc matches "abc" at the beginning |
$ | End of the string | abc$ matches "abc" at the end |
\b | Word boundary | \bword\b matches "word" |
\B | Non-word boundary | \Bword matches "word" not at a boundary |
Groups and Group Modifiers
Groups are enclosed in parentheses ()
and can be modified for specific behaviors.
Supported Group Modifiers
Modifier | Syntax | Meaning |
---|---|---|
Non-capturing | (?:...) | Groups without capturing |
Positive Lookahead | (?=...) | Asserts that what follows matches |
Negative Lookahead | (?!...) | Asserts that what follows does not match |
Positive Lookbehind | (?<=...) | Asserts that what precedes matches |
Negative Lookbehind | (?<!...) | Asserts that what precedes does not match |
Escaped Characters
Escape sequences represent special characters.
Escape Sequence | Meaning |
---|---|
\\ | Literal backslash |
\n | Newline |
\t | Tab |
\w , \W | Word/Non-word characters |
\d , \D | Digit/Non-digit |
\s , \S | Whitespace/Non-whitespace |
Invalid escape sequences throw an error.
Prohibited Patterns
- Invalid group modifiers: e.g.,
(?)
. - Empty groups:
()
is not allowed. - Empty character classes:
[]
results in an error. - Redundant or conflicting quantifiers:
a{2}+
.
Error Handling
The Motoko regex engine provides detailed error feedback to help developers identify and fix issues in their regular expressions. Below is a list of all possible errors, their meanings, and typical scenarios where they might occur.
Error Types
Error | Description | Cause |
---|---|---|
#UnexpectedCharacter | An invalid character was encountered during parsing. | Using a character that is not allowed in regex syntax, such as unescaped special characters. |
#UnexpectedEndOfInput | The regex input ended unexpectedly, leaving constructs incomplete. | Omitting closing brackets, parentheses, or quantifier ranges. |
#GenericError | A generic error message providing additional context. | Various syntax or logic errors not covered by specific error types. |
#InvalidQuantifierRange | A malformed or invalid quantifier range was used. | Using invalid quantifier syntax, e.g., {,} , {,3} , {a,b} . |
#InvalidEscapeSequence | An invalid escape sequence was encountered. | Using unrecognized escape sequences like \q or \x without proper syntax. |
#UnmatchedParenthesis | A closing parenthesis ) does not match any preceding opening parenthesis ( . | Missing or extra closing parentheses in the regex pattern. |
#MismatchedParenthesis | Parentheses do not form a valid pairing. | Nested parentheses are incorrectly matched or unbalanced, e.g., ((a)b]) . |
#UnexpectedToken | An unexpected token was encountered during parsing. | Using misplaced or unrecognized tokens in the regex pattern. |
#UnclosedGroup | A group construct is not properly closed with a closing parenthesis ) . | Missing a closing parenthesis in a group definition. |
#InvalidQuantifier | A quantifier is malformed or applied in an invalid context. | Using redundant or conflicting quantifiers, e.g., a{2}+ . |
#EmptyExpression | The regex input is empty or contains no valid expressions. | Providing an empty string or expression with no meaningful content. |
#NotCompiled | The regex has not been compiled before attempting to use it. | There was an error during compilation of the reject object, this may be due to any of the previous errors. That error will be specified in the #NotCompiled variant. |
Flags
Flags in Motoko regex provide flexibility by modifying the behavior of the regex matching process. This section covers the available flags, their purpose, and how to use them effectively.
Overview
Flags are optional parameters that can alter how the regex engine interprets and processes a pattern. They enable features such as case-insensitive matching and handling multiline inputs.
Available Flags
1. CASE_SENSITIVE
-
Type:
Bool
-
Default:
true
-
Description: Determines whether the regex should consider case when matching characters.
-
Behavior:
- When
caseSensitive = true
: Matches are case-sensitive (default behavior). - When
caseSensitive = false
: Matches are case-insensitive, treating uppercase and lowercase letters as equivalent.
- When
-
Example:
let regex = Regex.Regex("abc", ?{caseSensitive = false}); assert(regex.search("ABC") == #FullMatch);
2. MULTILINE
-
Type:
Bool
-
Default:
false
-
Description: Alters the behavior of anchors (
^
and$
) to match at line boundaries rather than the start or end of the entire string. -
Behavior:
- When
multiline = true
:^
matches the start of any line.$
matches the end of any line.
- When
multiline = false
(default behavior):^
matches the beginning of the entire input.$
matches the end of the entire input.
- When
-
Example:
let regex = Regex.Regex("^abc$", ?{multiline = true}); assert(regex.search("abc\ndef\nabc") == #FullMatch);
Combining Flags
You can combine flags to fine-tune the regex engine's behavior. For example:
let regex = Regex.Regex("abc", ?{caseSensitive = false; multiline = true});
assert(regex.search("ABC\ndef\nabc") == #FullMatch);
In this example:
- The pattern
abc
is matched regardless of case. - The engine processes each line independently due to
multiline = true
.
Default Behavior Without Flags
If no flags are specified, the engine uses the following defaults:
caseSensitive = true
multiline = false
This means:
- Matching is case-sensitive.
- Anchors (
^
and$
) match only at the start and end of the entire input.
Best Practices
- Use
caseSensitive = false
for patterns that need to ignore case differences, such as matching user inputs in a case-insensitive manner. - Use
multiline = true
when processing multi-line text, such as logs or formatted documents, where each line might have independent matching requirements.
Functions
This section provides an overview of the (current) functions available in this library. Use the links below to navigate to detailed documentation for each function.
Overview
The search()
function scans an input string for the first occurrence of the regex pattern. Unlike match()
, which requires the pattern to span the entire input, search()
identifies the first substring that satisfies the pattern.
Signature
public func search(text: Text): Result.Result<Match, RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to search for the first match |
Return Value
Type: Result.Result<Match, RegexError>
Success Case
Returns a Match
object containing:
- The matched substring (
value
) - The position of the match within the input string
- Captured groups (if any)
No Match Case
Returns a Match
object with:
status = #NoMatch
- Empty
value
Error Case
Returns RegexError
(#NotCompiled
) only if the pattern failed to compile during instantiation
Behavior
Input Validation
- If the regex instantiation failed (due to an invalid pattern), returns
RegexError
(#NotCompiled
)
Search Process
- Scans the input string character by character
- Identifies if a potential match could begin at the current position
- Delegates to
match()
for full matching starting from that position
Result Construction
- On finding a match:
- Returns a
Match
object with details of the match
- Returns a
- If no match is found after scanning the string:
- Returns a
Match
object withstatus = #NoMatch
- Returns a
Example Usage
1. Successful Match
Pattern: "a+"
Input: "xxaaayy"
let pattern = Regex.Regex("a+", null);
let result = pattern.search("xxaaayy");
switch (result) {
case (#ok(match)) Debug.print("First match: " # match.value); // Output: "aaa"
case (#err(error)) Debug.print("Error: " # debug_show(error));
};
Output:
First match: aaa
2. No Match Found
Pattern: "z+"
Input: "xxaaaayy"
let pattern = Regex.Regex("z+", null);
let result = pattern.search("xxaaaayy");
switch (result) {
case (#ok(match)) {
switch (match.status) {
case (#NoMatch) Debug.print("No match found.");
case (#FullMatch) Debug.print("First match: " # match.value);
};
};
case (#err(error)) Debug.print("Error: " # debug_show(error));
};
Output:
No match found.
3. Invalid Pattern
Scenario: Creating a regex with an invalid pattern
let pattern = Regex.Regex("[a-");
let result = pattern.search("xxaaaayy");
switch (result) {
case (#ok(match)) Debug.print("First match: " # match.value);
case (#err(error)) Debug.print("Error: " # debug_show(error)); // Output: #NotCompiled
};
Output:
Error: #NotCompiled
Overview
The match()
function is a core API for performing regex-based matching. It takes an input string and matches it against a precompiled regex represented as an NFA. The function handles matching mechanics, including state transitions, greedy and lazy quantifiers, and group captures.
Signature
public func match(text: Text): Result.Result<Match, RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to be matched against the compiled regex. |
Return Value
Result.Result<Match, RegexError>
:
- On Success (
Match
):- Contains details of the match, such as the matched substring, captured groups, and spans.
- On Failure (
RegexError
):- Indicates why the matching process failed (e.g., regex not compiled).
Behavior
-
Input Validation:
- Checks if the regex has been compiled.
- Returns
#NotCompiled
error if the regex is unavailable.
-
Matching Process:
- Delegates the actual matching logic to the
matcher.match
function. - Traverses the NFA based on input characters.
- Respects greedy and lazy quantifier modes.
- Handles capture groups and anchors (e.g.,
^
,$
).
- Delegates the actual matching logic to the
-
Result Construction:
- Builds a
Match
object for successful matches. - Returns
RegexError
for failures.
- Builds a
Example Usage
1. Successful Match
let pattern = Regex.Regex("h.*o",null);
let result = pattern.match("hello");
switch (result) {
case (#ok(match)) {
Debug.print("Matched value: " # match.value);
};
case (#err(error)) {
Debug.print("Error: " # debug_show(error));
};
}
Output:
Matched value: hello
2. No Match
let pattern = Regex.Regex("z+",null);
let result = pattern.match("hello");
switch (result) {
case (#ok(match)) {
Debug.print("Matched value: " # match.value);
};
case (#err(error)) {
Debug.print("Error: " # debug_show(error));
};
}
Output:
#ok: status = #NoMatch
Input Validation
-
Before matching, the function ensures the regex is compiled.
-
If
nfa
isnull
, the function returns:#err(#NotCompiled)
Delegation to matcher.match
- The compiled NFA, input
text
, and optionalflags
are passed tomatcher.match
. matcher.match
performs:- State Transitions:
- Moves between states in the NFA based on input characters.
- Greedy and Lazy Quantifiers:
- Greedy quantifiers consume as much input as possible.
- Lazy quantifiers stop at the first valid match.
- Capture Groups:
- Tracks and extracts group matches.
- Anchors:
- Ensures patterns anchored to the start (
^
) or end ($
) are respected.
- Ensures patterns anchored to the start (
- State Transitions:
Overview
The findAll()
method returns an array of all non-overlapping matches of the regex pattern in the input text. Unlike findIter()
, this method collects all matches at once into an array.
Signature
public func findAll(text: Text): Result.Result<[Match], RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to search for matches |
Return Value
Type: Result.Result<[Match], RegexError>
Success Case
Returns an array of Match
objects, where each contains:
- The matched substring (
value
) - The position of the match within the input string
- Any captured groups
- Match status (
#FullMatch
)
Error Case
Returns RegexError
(#NotCompiled
) if the pattern failed to compile during instantiation
Match Collection Process
- Starts from the beginning of the input string
- Collects all non-overlapping matches into an array
- Preserves the order of matches as they appear in the text
- Returns an empty array if no matches are found
Overview
The findIter()
method returns an iterator that yields all non-overlapping matches of the regex pattern in the input text. This method is memory-efficient as it generates matches lazily instead of collecting them all at once like findAll()
.
Signature
public func findIter(text: Text): Result.Result<Iter.Iter<Match>, RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to search for matches |
Return Value
Type: Result.Result<Iter.Iter<Match>, RegexError>
Success Case
Returns an iterator that yields Match
objects, where each contains:
- The matched substring (
value
) - The position of the match within the input string
- Any captured groups
- Match status (
#FullMatch
)
Error Case
Returns RegexError
(#NotCompiled
) if the pattern failed to compile during instantiation
Iteration Process
- Starts from the beginning of the input string
- For each match found:
- Yields a
Match
object - Advances to the position after the current match
- Yields a
- Continues until no more matches are found
- Automatically handles the internal state between iterations
Match Generation
- Matches are generated one at a time as the iterator is consumed
- Non-overlapping matches are guaranteed
- The iteration order follows the text from left to right
Overview
The Match
record represents the result of a pattern matching operation on text, typically used in regular expression or string matching operations. It contains detailed information about what was matched, where it was found, and any captured groups within the match.
Structure
public type Match = {
string: Text;
value: Text;
status: {
#FullMatch;
#NoMatch;
};
position: (Nat, Nat);
capturedGroups: ?[(Text,Nat)];
spans: [(Nat, Nat)];
lastIndex: Nat;
};
Field Descriptions
string: Text
The original input string that was searched for matches. This field preserves the complete text that was analyzed during the matching operation.
value: Text
The actual matched substring found in the original text. This represents the specific portion of text that satisfied the matching criteria.
status
An enumerated type that indicates the match result:
#FullMatch
: Indicates a successful match was found#NoMatch
: Indicates no match was found in the input string
position: (Nat, Nat)
A tuple containing the start and end indices of the match in the original string:
- First value: Starting index of the match
- Second value: Ending index of the match
capturedGroups: ?[(Text,Nat)]
An optional array of tuples containing captured groups from the match:
- Each tuple contains:
- Text: The captured text
- Nat: The index of the captured group
null
if no groups were captured
spans: [(Nat, Nat)]
An array of tuples representing the character spans of the respective match
lastIndex: Nat
The index where the last match ended. This is particularly useful when performing multiple sequential matches on the same text.
Usage Examples
Basic Match Checking
if (matchResult.status == #FullMatch) {
Debug.print("Found match: " # matchResult.value);
} else {
Debug.print("No match found");
};
Working with Captured Groups
switch (matchResult.capturedGroups) {
case (null) { Debug.print("No groups captured") };
case (?groups) {
for ((text, index) in groups.vals()) {
Debug.print("Group " # debug_show(index) # ": " # text);
};
};
};
Regex Examples for Motoko Regex Engine
Table of Contents
Internet Computer Identifiers
Principal ID
Pattern to validate Principal ID format.
// Anonymous Principal will be rejected
let principalPattern = Regex.Regex("^[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{3}$", null);
public func validatePrincipalId(id: Text): Bool {
switch(principalPattern.match(id)) {
case (#ok(result)) {
switch(result.status) {
case (#FullMatch) true;
case (#NoMatch) false;
};
};
case (#err(_)) false;
};
};
Account ID
Pattern to validate Account ID format.
// Account ID (32 bytes in hexadecimal)
let accountPattern = Regex.Regex("^[0-9a-f]{64}$", null);
public func validateAccountId(id: Text): Bool {
switch(accountPattern.match(id)) {
case (#ok(result)) {
switch(result.status) {
case (#FullMatch) true;
case (#NoMatch) false;
};
};
case (#err(_)) false;
};
};
Contributing to Motoko Regex Engine
Thank you for your interest in contributing to the Motoko Regex Engine project! This guide will help you get started with contributing to the project.
Table of Contents
Getting Started
-
Fork the repository:
- Visit https://github.com/Demali-876/motoko_regex_engine
- Click the "Fork" button in the top-right corner
-
Clone your fork:
git clone https://github.com/YOUR-USERNAME/motoko_regex_engine.git cd motoko_regex_engine
-
Add the upstream repository:
git remote add upstream https://github.com/Demali-876/motoko_regex_engine.git
-
Create a new branch for your work:
git checkout -b feat/your-feature-name
Development Process
-
Set up your development environment:
- Install the DFINITY SDK (dfx)
- Install Node.js and npm
- Run
npm install
to install dependencies
-
Make your changes:
- Keep your changes focused and concise
- Update documentation if needed
-
Keep your fork up to date:
git fetch upstream git rebase upstream/main
Commit Guidelines
We use Conventional Commits for clear and standardized commit messages. Each commit message should be structured as follows:
<type>[optional scope]: <description>
[optional body]
[optional footer(s)]
Types:
feat
: A new featurefix
: A bug fixdocs
: Documentation only changesstyle
: Changes that do not affect the meaning of the coderefactor
: A code change that neither fixes a bug nor adds a featureperf
: A code change that improves performancetest
: Adding missing tests or correcting existing testschore
: Changes to the build process or auxiliary tools
Examples:
feat(parser): add support for lookahead assertions
fix(matcher): resolve infinite loop in nested groups
docs: update API documentation for search method
refactor(compiler): simplify NFA construction logic
Pull Request Process
-
Push your changes to your fork:
git push origin feat/your-feature-name
-
Create a Pull Request:
- Go to https://github.com/Demali-876/motoko_regex_engine/pulls
- Click "New Pull Request"
- Select your fork and branch
- Fill out the PR template with:
- Clear description of changes
- Any breaking changes
- Evidence of testing (screenshots, console output, etc.)
-
PR Review Process:
- Maintainers will review your PR
- Address any requested changes
- Once approved, your PR will be merged
Testing Requirements
While formal tests are not required, you must provide evidence that you've tested your changes. This can include:
- Screenshots of the feature working
- Console output showing successful execution
- Example usage and results
- Description of test cases you've tried
Example test evidence in PR:
Tested the new alternation operator with:
1. Simple patterns: "a|b" against "a" and "b"
2. Complex patterns: "(foo|bar)+" against "foofoobar"
3. Edge cases: "a||b" and "|a|b|"
Results:
- All patterns matched correctly
- No infinite loops or crashes
- Proper error handling for invalid patterns
Thank you for contributing to the Motoko Regex Engine project!