Introduction
Welcome to the Motoko Regex Engine Documentation, your go-to guide for leveraging the power of regular expressions in the Motoko programming language. This engine provides robust tools for pattern matching, searching, and text processing.
Inspired by other established regex libraries, this regex engine adapts their capabilities to meet the needs of Motoko.
Installation and Import
Install the Motoko Regex Engine using:
mops add regex
Import it into your project with:
import Regex "mo:regex";
What is a Regular Expression?
A regular expression (regex) is a sequence of characters defining a search pattern. Regex is widely used in text processing for tasks such as:
- Searching for text patterns (e.g., keywords in a document).
- Validating formats (e.g., email addresses or phone numbers).
- Extracting data from structured text (e.g., logs or CSV files).
For example, the regex ^\d{3}-\d{2}-\d{4}$
matches a string formatted as a Social Security Number, such as 123-45-6789
.
Key Features
Pattern Support
- Anchors:
^
(start of string),$
(end of string). - Character classes:
[a-z]
,[^0-9]
. - Quantifiers:
*
,+
,?
,{m,n}
. - Groups:
()
,(?:)
. - Alternation:
|
(logical OR). - Escapes:
\d
,\w
,\s
, etc.
Flags
Flags are optional boolean values that modify the behavior of regex matching. They are set during instantiation and cannot be changed afterward. The engine currently supports:
caseSensitive
: Case-sensitive matching (default is true).multiline
: Enables multiline matching.
Example with null flags (default behavior):
let regex = Regex.Regex("\d{3}-\d{2}-\d{4}", null);
API Functions
match
: Check for a full match of the pattern in the input text.search
: Locate the first occurrence of the pattern in the input.findAll
: Retrieve all matches for the pattern.findIter
: Iterate over matches.
Syntax
Supported Syntax
Motoko regex supports a variety of syntax features for defining patterns. These include:
- Character matching (
a
,b
,c
, etc.) - Alternation (
|
) - Grouping (
()
) - Character classes (
[]
with support for ranges like[a-z]
) - Quantifiers (
*
,+
,?
,{n}
,{n,m}
) - Anchors (
^
,$
)
Quantifiers
Quantifiers specify how many times a preceding element must occur for a match.
Supported Quantifiers
Quantifier | Meaning | Example |
---|---|---|
* | Match 0 or more times | a* matches "", "a", "aaa" |
+ | Match 1 or more times | a+ matches "a", "aaa" |
? | Match 0 or 1 time | a? matches "", "a" |
{n} | Match exactly n times | a{2} matches "aa" |
{n,} | Match at least n times | a{2,} matches "aa", "aaa" |
{n,m} | Match between n and m times | a{2,4} matches "aa", "aaa", "aaaa" |
Quantifier Modes
Quantifiers can operate in different modes:
- Greedy: Matches as many occurrences as possible.
- Lazy (
?
after quantifier): Matches as few as possible. E.g.,a+?
matches fewer occurrences of "a".
Invalid Quantifiers
Certain quantifier patterns are not allowed:
- Redundant modifiers, such as
a{2}+
ora{2}*
. - Empty quantifiers, e.g.,
{}
or{,}
. - Multiple commas in ranges, e.g.,
{2,,4}
.
Metacharacters
Metacharacters represent special patterns or symbols.
Metacharacter | Meaning | Example |
---|---|---|
. | Match any character except \n | a.b matches "acb" |
\w | Match word characters (alphanumeric + _ ) | \w+ matches "abc123" |
\W | Match non-word characters | \W matches "@" |
\d | Match digits (0-9 ) | \d+ matches "123" |
\D | Match non-digits | \D matches "a" |
\s | Match whitespace | \s+ matches " " |
\S | Match non-whitespace | \S matches "a" |
Character Classes
Character classes allow matching sets of characters.
[abc]
: Matches any charactera
,b
, orc
.[^abc]
: Matches any character excepta
,b
, orc
.[a-z]
: Matches any character in the rangea
toz
.
Nested Quantifiers
Quantifiers inside character classes must be explicitly defined. Nested or redundant quantifiers, like [a-z]{2}+
, are not allowed.
Anchors
Anchors specify positions in the text.
Anchor | Meaning | Example |
---|---|---|
^ | Start of the string | ^abc matches "abc" at the beginning |
$ | End of the string | abc$ matches "abc" at the end |
\b | Word boundary | \bword\b matches "word" |
\B | Non-word boundary | \Bword matches "word" not at a boundary |
Groups and Group Modifiers
Groups are enclosed in parentheses ()
and can be modified for specific behaviors.
Supported Group Modifiers
Modifier | Syntax | Meaning |
---|---|---|
Non-capturing | (?:...) | Groups without capturing |
Positive Lookahead | (?=...) | Asserts that what follows matches |
Negative Lookahead | (?!...) | Asserts that what follows does not match |
Positive Lookbehind | (?<=...) | Asserts that what precedes matches |
Negative Lookbehind | (?<!...) | Asserts that what precedes does not match |
Escaped Characters
Escape sequences represent special characters.
Escape Sequence | Meaning |
---|---|
\\ | Literal backslash |
\n | Newline |
\t | Tab |
\w , \W | Word/Non-word characters |
\d , \D | Digit/Non-digit |
\s , \S | Whitespace/Non-whitespace |
Invalid escape sequences throw an error.
Prohibited Patterns
- Invalid group modifiers: e.g.,
(?)
. - Empty groups:
()
is not allowed. - Empty character classes:
[]
results in an error. - Redundant or conflicting quantifiers:
a{2}+
.
Error Handling
The Motoko regex engine provides detailed error feedback to help developers identify and fix issues in their regular expressions. Below is a list of all possible errors, their meanings, and typical scenarios where they might occur.
Error Types
Error | Description | Cause |
---|---|---|
#UnexpectedCharacter | An invalid character was encountered during parsing. | Using a character that is not allowed in regex syntax, such as unescaped special characters. |
#UnexpectedEndOfInput | The regex input ended unexpectedly, leaving constructs incomplete. | Omitting closing brackets, parentheses, or quantifier ranges. |
#GenericError | A generic error message providing additional context. | Various syntax or logic errors not covered by specific error types. |
#InvalidQuantifierRange | A malformed or invalid quantifier range was used. | Using invalid quantifier syntax, e.g., {,} , {,3} , {a,b} . |
#InvalidEscapeSequence | An invalid escape sequence was encountered. | Using unrecognized escape sequences like \q or \x without proper syntax. |
#UnmatchedParenthesis | A closing parenthesis ) does not match any preceding opening parenthesis ( . | Missing or extra closing parentheses in the regex pattern. |
#MismatchedParenthesis | Parentheses do not form a valid pairing. | Nested parentheses are incorrectly matched or unbalanced, e.g., ((a)b]) . |
#UnexpectedToken | An unexpected token was encountered during parsing. | Using misplaced or unrecognized tokens in the regex pattern. |
#UnclosedGroup | A group construct is not properly closed with a closing parenthesis ) . | Missing a closing parenthesis in a group definition. |
#InvalidQuantifier | A quantifier is malformed or applied in an invalid context. | Using redundant or conflicting quantifiers, e.g., a{2}+ . |
#EmptyExpression | The regex input is empty or contains no valid expressions. | Providing an empty string or expression with no meaningful content. |
#NotCompiled | The regex has not been compiled before attempting to use it. | There was an error during compilation of the reject object, this may be due to any of the previous errors. That error will be specified in the #NotCompiled variant. |
Flags
Flags in Motoko regex provide flexibility by modifying the behavior of the regex matching process. This section covers the available flags, their purpose, and how to use them effectively.
Overview
Flags are optional parameters that can alter how the regex engine interprets and processes a pattern. They enable features such as case-insensitive matching and handling multiline inputs.
Available Flags
1. CASE_SENSITIVE
-
Type:
Bool
-
Default:
true
-
Description: Determines whether the regex should consider case when matching characters.
-
Behavior:
- When
caseSensitive = true
: Matches are case-sensitive (default behavior). - When
caseSensitive = false
: Matches are case-insensitive, treating uppercase and lowercase letters as equivalent.
- When
-
Example:
let regex = Regex.Regex("abc", ?{caseSensitive = false}); assert(regex.search("ABC") == #FullMatch);
2. MULTILINE
-
Type:
Bool
-
Default:
false
-
Description: Alters the behavior of anchors (
^
and$
) to match at line boundaries rather than the start or end of the entire string. -
Behavior:
- When
multiline = true
:^
matches the start of any line.$
matches the end of any line.
- When
multiline = false
(default behavior):^
matches the beginning of the entire input.$
matches the end of the entire input.
- When
-
Example:
let regex = Regex.Regex("^abc$", ?{multiline = true}); assert(regex.search("abc\ndef\nabc") == #FullMatch);
Combining Flags
You can combine flags to fine-tune the regex engine's behavior. For example:
let regex = Regex.Regex("abc", ?{caseSensitive = false; multiline = true});
assert(regex.search("ABC\ndef\nabc") == #FullMatch);
In this example:
- The pattern
abc
is matched regardless of case. - The engine processes each line independently due to
multiline = true
.
Default Behavior Without Flags
If no flags are specified, the engine uses the following defaults:
caseSensitive = true
multiline = false
This means:
- Matching is case-sensitive.
- Anchors (
^
and$
) match only at the start and end of the entire input.
Best Practices
- Use
caseSensitive = false
for patterns that need to ignore case differences, such as matching user inputs in a case-insensitive manner. - Use
multiline = true
when processing multi-line text, such as logs or formatted documents, where each line might have independent matching requirements.
Functions
This section provides an overview of the (current) functions available in this library. Use the links below to navigate to detailed documentation for each function.
Overview
The search()
function scans an input string for the first occurrence of the regex pattern. Unlike match()
, which requires the pattern to span the entire input, search()
identifies the first substring that satisfies the pattern.
Signature
public func search(text: Text): Result.Result<Match, RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to search for the first match |
Return Value
Type: Result.Result<Match, RegexError>
Success Case
Returns a Match
object containing:
- The matched substring (
value
) - The position of the match within the input string
- Captured groups (if any)
No Match Case
Returns a Match
object with:
status = #NoMatch
- Empty
value
Error Case
Returns RegexError
(#NotCompiled
) only if the pattern failed to compile during instantiation
Behavior
Input Validation
- If the regex instantiation failed (due to an invalid pattern), returns
RegexError
(#NotCompiled
)
Search Process
- Scans the input string character by character
- Identifies if a potential match could begin at the current position
- Delegates to
match()
for full matching starting from that position
Result Construction
- On finding a match:
- Returns a
Match
object with details of the match
- Returns a
- If no match is found after scanning the string:
- Returns a
Match
object withstatus = #NoMatch
- Returns a
Example Usage
1. Successful Match
Pattern: "a+"
Input: "xxaaayy"
let pattern = Regex.Regex("a+", null);
let result = pattern.search("xxaaayy");
switch (result) {
case (#ok(match)) Debug.print("First match: " # match.value); // Output: "aaa"
case (#err(error)) Debug.print("Error: " # debug_show(error));
};
Output:
First match: aaa
2. No Match Found
Pattern: "z+"
Input: "xxaaaayy"
let pattern = Regex.Regex("z+", null);
let result = pattern.search("xxaaaayy");
switch (result) {
case (#ok(match)) {
switch (match.status) {
case (#NoMatch) Debug.print("No match found.");
case (#FullMatch) Debug.print("First match: " # match.value);
};
};
case (#err(error)) Debug.print("Error: " # debug_show(error));
};
Output:
No match found.
3. Invalid Pattern
Scenario: Creating a regex with an invalid pattern
let pattern = Regex.Regex("[a-");
let result = pattern.search("xxaaaayy");
switch (result) {
case (#ok(match)) Debug.print("First match: " # match.value);
case (#err(error)) Debug.print("Error: " # debug_show(error)); // Output: #NotCompiled
};
Output:
Error: #NotCompiled
Overview
The match()
function is a core API for performing regex-based matching. It takes an input string and matches it against a precompiled regex represented as an NFA. The function handles matching mechanics, including state transitions, greedy and lazy quantifiers, and group captures.
Signature
public func match(text: Text): Result.Result<Match, RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to be matched against the compiled regex. |
Return Value
Result.Result<Match, RegexError>
:
- On Success (
Match
):- Contains details of the match, such as the matched substring, captured groups, and spans.
- On Failure (
RegexError
):- Indicates why the matching process failed (e.g., regex not compiled).
Behavior
-
Input Validation:
- Checks if the regex has been compiled.
- Returns
#NotCompiled
error if the regex is unavailable.
-
Matching Process:
- Delegates the actual matching logic to the
matcher.match
function. - Traverses the NFA based on input characters.
- Respects greedy and lazy quantifier modes.
- Handles capture groups and anchors (e.g.,
^
,$
).
- Delegates the actual matching logic to the
-
Result Construction:
- Builds a
Match
object for successful matches. - Returns
RegexError
for failures.
- Builds a
Example Usage
1. Successful Match
let pattern = Regex.Regex("h.*o",null);
let result = pattern.match("hello");
switch (result) {
case (#ok(match)) {
Debug.print("Matched value: " # match.value);
};
case (#err(error)) {
Debug.print("Error: " # debug_show(error));
};
}
Output:
Matched value: hello
2. No Match
let pattern = Regex.Regex("z+",null);
let result = pattern.match("hello");
switch (result) {
case (#ok(match)) {
Debug.print("Matched value: " # match.value);
};
case (#err(error)) {
Debug.print("Error: " # debug_show(error));
};
}
Output:
#ok: status = #NoMatch
Input Validation
-
Before matching, the function ensures the regex is compiled.
-
If
nfa
isnull
, the function returns:#err(#NotCompiled)
Delegation to matcher.match
- The compiled NFA, input
text
, and optionalflags
are passed tomatcher.match
. matcher.match
performs:- State Transitions:
- Moves between states in the NFA based on input characters.
- Greedy and Lazy Quantifiers:
- Greedy quantifiers consume as much input as possible.
- Lazy quantifiers stop at the first valid match.
- Capture Groups:
- Tracks and extracts group matches.
- Anchors:
- Ensures patterns anchored to the start (
^
) or end ($
) are respected.
- Ensures patterns anchored to the start (
- State Transitions:
Overview
The findAll()
method returns an array of all non-overlapping matches of the regex pattern in the input text. Unlike findIter()
, this method collects all matches at once into an array.
Signature
public func findAll(text: Text): Result.Result<[Match], RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to search for matches |
Return Value
Type: Result.Result<[Match], RegexError>
Success Case
Returns an array of Match
objects, where each contains:
- The matched substring (
value
) - The position of the match within the input string
- Any captured groups
- Match status (
#FullMatch
)
Error Case
Returns RegexError
(#NotCompiled
) if the pattern failed to compile during instantiation
Match Collection Process
- Starts from the beginning of the input string
- Collects all non-overlapping matches into an array
- Preserves the order of matches as they appear in the text
- Returns an empty array if no matches are found
Overview
The findIter()
method returns an iterator that yields all non-overlapping matches of the regex pattern in the input text. This method is memory-efficient as it generates matches lazily instead of collecting them all at once like findAll()
.
Signature
public func findIter(text: Text): Result.Result<Iter.Iter<Match>, RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to search for matches |
Return Value
Type: Result.Result<Iter.Iter<Match>, RegexError>
Success Case
Returns an iterator that yields Match
objects, where each contains:
- The matched substring (
value
) - The position of the match within the input string
- Any captured groups
- Match status (
#FullMatch
)
Error Case
Returns RegexError
(#NotCompiled
) if the pattern failed to compile during instantiation
Iteration Process
- Starts from the beginning of the input string
- For each match found:
- Yields a
Match
object - Advances to the position after the current match
- Yields a
- Continues until no more matches are found
- Automatically handles the internal state between iterations
Match Generation
- Matches are generated one at a time as the iterator is consumed
- Non-overlapping matches are guaranteed
- The iteration order follows the text from left to right
replace()
The replace()
function substitutes matches in the input string with a specified replacement string. This function allows specifying a maximum number of replacements (maxReplacements
).
Signature
public func replace(text: Text, replacement: Text, maxReplacements: ?Nat): Result.Result<Text, RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to perform replacements on. |
replacement | Text | The string to replace matches with. |
maxReplacements | ?Nat | Optional limit on the number of replacements. If null , replaces all. |
Return Value
Result.Result<Text, RegexError>
:
- On Success (
Text
):- The updated string after performing replacements.
- On Failure (
RegexError
):- Indicates why the operation failed (e.g., invalid regex or replacement).
Behavior
-
Input Validation:
- Checks if the regex has been compiled.
- Returns
#NotCompiled
error if unavailable.
-
Replacement Process:
- Matches the regex pattern in the input string.
- Replaces each match with the specified string.
- Respects the
maxReplacements
limit if provided.
-
Result Construction:
- Returns the updated string on success.
- Returns
RegexError
for invalid inputs or uncompiled regex.
Example Usage
1. Replace All Matches
let replaceRegex = Regex.Regex("Hello");
let result = replaceRegex.replace("Hello world, Hello universe", "Hi", null);
switch (result) {
case (#ok(updatedText)) {
Debug.print("Updated text: " # updatedText);
};
case (#err(error)) {
Debug.print("Error: " # debug_show(error));
};
}
Output:
Updated text: Hi world, Hi universe
2. Replace with Limit
let replaceRegex = Regex.Regex("Hello");
let result = replaceRegex.replace("Hello world, Hello universe", "Hi", ?1);
switch (result) {
case (#ok(updatedText)) {
Debug.print("Updated text: " # updatedText);
};
case (#err(error)) {
Debug.print("Error: " # debug_show(error));
};
}
Output:
Updated text: Hi world, Hello universe
sub()
The sub()
function substitutes matches in the input string with a specified replacement string. Unlike replace
, it allows the use of regex patterns in the replacement string.
Signature
public func sub(text: Text, replacement: Text, maxSubstitutions: ?Nat): Result.Result<Text, RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to perform substitutions on. |
replacement | Text | The string (can include regex) to replace matches with. |
maxSubstitutions | ?Nat | Optional limit on the number of substitutions. If null , substitutes all. |
Return Value
Result.Result<Text, RegexError>
:
- On Success (
Text
):- The updated string after performing substitutions.
- On Failure (
RegexError
):- Indicates why the operation failed (e.g., invalid regex or replacement).
Behavior
-
Input Validation:
- Checks if the regex has been compiled.
- Returns
#NotCompiled
error if unavailable.
-
Substitution Process:
- Matches the regex pattern in the input string.
- Substitutes each match with the specified replacement string.
- Respects the
maxReplacements
limit if provided.
-
Result Construction:
- Returns the updated string on success.
- Returns
RegexError
for invalid inputs or uncompiled regex.
Implementation
public func sub(text: Text, replacement: Text, maxReplacements: ?Nat): Result.Result<Text, RegexError> {
switch (nfa) {
case (null) #err(#NotCompiled);
case (?compiledNFA) {
matcher.sub(compiledNFA, text, replacement, maxReplacements)
};
}
};
Example Usage
1. Substituting All Matches
let subRegex = Regex.Regex("\d+");
let result = subRegex.sub("I have 10 bananas and 20 apples", "many", null);
switch (result) {
case (#ok(updatedText)) {
Debug.print("Updated text: " # updatedText);
};
case (#err(error)) {
Debug.print("Error: " # debug_show(error));
};
}
Output:
Updated text: I have many bananas and many apples
2. Substituting with Limit
let subRegex = Regex.Regex("\\d+");
let result = subRegex.sub("I have 10 bananas and 20 apples", "many", ?1);
switch (result) {
case (#ok(updatedText)) {
Debug.print("Updated text: " # updatedText);
};
case (#err(error)) {
Debug.print("Error: " # debug_show(error));
};
}
Output:
Updated text: I have many bananas and 20 apples
split()
The split()
function divides a string into substrings based on matches of the regex pattern, with an optional limit on the number of splits (maxSplit
).
Signature
public func split(text: Text, maxSplit: ?Nat): Result.Result<[Text], RegexError>
Parameters
Parameter | Type | Description |
---|---|---|
text | Text | The input string to be split based on the regex pattern. |
maxSplit | ?Nat | Optional limit on the number of splits. If null , splits all. |
Return Value
Result.Result<[Text], RegexError>
:
- On Success (
[Text]
):- A vector of substrings resulting from splitting the input string.
- On Failure (
RegexError
):- Indicates why the operation failed (e.g., invalid regex).
Behavior
-
Input Validation:
- Checks if the regex has been compiled.
- Returns
#NotCompiled
error if unavailable.
-
Splitting Process:
- Matches the regex pattern in the input string.
- Divides the string at each match, respecting the
maxSplit
limit if provided. - Handles edge cases (e.g., no matches or empty input).
-
Result Construction:
- Returns a vector of substrings on success.
- Returns
RegexError
for invalid inputs or uncompiled regex.
Example Usage
1. Splitting Without Limit
let splitRegex = Regex.Regex(",");
let result = splitRegex.split("one,two,three", null);
switch (result) {
case (#ok(parts)) {
Debug.print("Split result: " # debug_show(parts));
};
case (#err(error)) {
Debug.print("Error: " # debug_show(error));
};
}
Output:
Split result: ["one", "two", "three"]
2. Splitting With Limit
let splitRegex = Regex.Regex(",");
let result = splitRegex.split("one,two,three", ?1);
switch (result) {
case (#ok(parts)) {
Debug.print("Split result: " # debug_show(parts));
};
case (#err(error)) {
Debug.print("Error: " # debug_show(error));
};
}
Output:
Split result: ["one", "two,three"]
inspectRegex()
Returns a detailed text representation of the NFA structure for a compiled regular expression pattern.
Overview
The inspectRegex()
function allows you to examine the internal NFA (Non-deterministic Finite Automaton) structure of a compiled regular expression. This is particularly useful for debugging complex patterns and understanding how the regex engine processes matches.
Return Value
- Type:
Result.Result<Text, RegexError>
- Success: Returns
#ok(Text)
with formatted NFA representation - Error: Returns
#err(#NotCompiled)
if regex is not compiled
Example: Social Security Number Pattern
// Create regex for basic SSN pattern: ^\d{3}-\d{2}-\d{4}$
let regex = Regex.new("^\\d{3}-\\d{2}-\\d{4}$");
// Inspect the NFA structure
switch(regex.inspectRegex()) {
case (#ok(nfa)) Debug.print(nfa);
case (#err(e)) Debug.print("Error: Not compiled");
};
Example Output
=== NFA State Machine ===
Initial State → 0
Accept States → [11]
=== Transitions ===
From State 0:
#Range('0', '9') → 1 (#Greedy)
From State 1:
#Range('0', '9') → 2 (#Greedy)
From State 2:
#Range('0', '9') → 3 (#Greedy)
From State 3:
#Char('-') → 4
From State 4:
#Range('0', '9') → 5 (#Greedy)
From State 5:
#Range('0', '9') → 6 (#Greedy)
From State 6:
#Char('-') → 7
From State 7:
#Range('0', '9') → 8 (#Greedy)
From State 8:
#Range('0', '9') → 9 (#Greedy)
From State 9:
#Range('0', '9') → 10 (#Greedy)
From State 10:
#Range('0', '9') → 11 (#Greedy)
=== Assertions ===
Anchor: {aType = #StartOfString; position = 0}
Anchor: {aType = #EndOfString; position = 11}
Pattern Breakdown
The NFA structure represents the SSN pattern ^\d{3}-\d{2}-\d{4}$
as follows:
- Start anchor at position 0
- States 0-2: First three digits (
\d{3}
) with greedy quantifiers - State 3: Literal hyphen (
-
) - States 4-5: Next two digits (
\d{2}
) with greedy quantifiers - State 6: Second hyphen (
-
) - States 7-10: Final four digits (
\d{4}
) with greedy quantifiers - State 11: Accept state
- End anchor at position 11
inspectState(state: Types.State)
Examines and returns the transitions for a specific state in the NFA (Non-deterministic Finite Automaton) structure.
Overview
The inspectState()
function allows you to inspect the transitions from any particular state in the compiled regex pattern. This is useful for debugging specific parts of pattern matching behavior.
Parameters
state
: Nat - The state number to inspect
Return Value
- Type:
Result.Result<Text, RegexError>
- Success: Returns
#ok(Text)
with formatted transitions from the specified state - Error: Returns
#err(#NotCompiled)
if regex is not compiled
Example: Social Security Number Pattern State
Using the same SSN pattern (^\d{3}-\d{2}-\d{4}$
), let's inspect state 4, which handles the first digit after the first hyphen:
let regex = Regex.Regex("^\d{3}-\d{2}-\d{4}$");
switch(regex.inspectState(3)) {
case (#ok(transitions)) Debug.print(transitions);
case (#err(e)) Debug.print("Error: Not compiled");
};
Example Output
From State 3:
#Char('-') → 4
This shows the transitions available for state 3 in the NFA. There is only one transition available for state 3 which is the dash character.
enableDebug(b: Bool)
Toggles debug logging for the regex matcher. When enabled, it provides detailed logging of the pattern matching process.
Parameters
b
: Bool - Set to true to enable debug logging, false to disable it
Overview
The Match
record represents the result of a pattern matching operation on text, typically used in regular expression or string matching operations. It contains detailed information about what was matched, where it was found, and any captured groups within the match.
Structure
public type Match = {
string: Text;
value: Text;
status: {
#FullMatch;
#NoMatch;
};
position: (Nat, Nat);
capturedGroups: ?[(Text,Nat)];
spans: [(Nat, Nat)];
lastIndex: Nat;
};
Field Descriptions
string: Text
The original input string that was searched for matches. This field preserves the complete text that was analyzed during the matching operation.
value: Text
The actual matched substring found in the original text. This represents the specific portion of text that satisfied the matching criteria.
status
An enumerated type that indicates the match result:
#FullMatch
: Indicates a successful match was found#NoMatch
: Indicates no match was found in the input string
position: (Nat, Nat)
A tuple containing the start and end indices of the match in the original string:
- First value: Starting index of the match
- Second value: Ending index of the match
capturedGroups: ?[(Text,Nat)]
An optional array of tuples containing captured groups from the match:
- Each tuple contains:
- Text: The captured text
- Nat: The index of the captured group
null
if no groups were captured
spans: [(Nat, Nat)]
An array of tuples representing the character spans of the respective match
lastIndex: Nat
The index where the last match ended. This is particularly useful when performing multiple sequential matches on the same text.
Usage Examples
Basic Match Checking
if (matchResult.status == #FullMatch) {
Debug.print("Found match: " # matchResult.value);
} else {
Debug.print("No match found");
};
Working with Captured Groups
switch (matchResult.capturedGroups) {
case (null) { Debug.print("No groups captured") };
case (?groups) {
for ((text, index) in groups.vals()) {
Debug.print("Group " # debug_show(index) # ": " # text);
};
};
};
Unicode Support and Unicode Properties
Introduction
The Motoko Regex Engine supports Unicode properties, allowing users to match specific character categories using \p{Property}
and \P{Property}
syntax. This enhances pattern matching by enabling character classification based on Unicode properties.
Syntax
Unicode properties can be matched using the following syntax:
\p{Property} // Matches a character with the specified Unicode property
\P{Property} // Matches a character that does NOT have the specified Unicode property
Example
\p{L} // Matches any letter
\p{N} // Matches any number
\P{P} // Matches any character except punctuation
Supported Unicode Properties
The engine supports a subset of Unicode properties:
L
(Letter)Ll
(Lowercase Letter)Lu
(Uppercase Letter)N
(Number)P
(Punctuation)Zs
(Separator, Space)Emoji
(Emoji characters)
Backreferences
Introduction
Backreferences allow a regular expression to match repeated substrings by referring back to a previously captured group. They enable patterns to enforce consistency within the matched text.
Syntax
Backreferences use the following syntax:
\1, \2, ...
: Refers to a numbered capturing group in order of appearance and can only be used inreplace
andsub
methods.\k<name>
: Refers to a named capturing group and can be used within matching.
Example
(?<greeting>hello)\s+\k<greeting>
This matches hello followed by one or more space followed by hello again.
Lookaround Assertions
Introduction
Lookaround assertions allow for pattern matching based on surrounding text without including that text in the final match. They enable advanced pattern constraints while maintaining flexibility in regex processing. However, in many cases, using a stricter pattern without lookarounds can be a better approach, leading to simpler and more efficient regex expressions.
Lookaround Types
Lookaround assertions are divided into Lookahead and Lookbehind, and each has Positive and Negative variations. The following table summarizes them:
Type | Positive | Negative |
---|---|---|
Look-Ahead | A(?=B) - Match A if B follows | A(?!B) - Match A if B does not follow |
Look-Behind | (?<=B)A - Match A if B precedes | (?<!B)A - Match A if B does not precede |
Lookahead
- Positive Lookahead (
?=
): Ensures a pattern exists after the current position without consuming it. - Negative Lookahead (
?!
): Ensures a pattern does not exist after the current position.
foo(?=bar) // Matches "foo" only if followed by "bar"
foo(?!bar) // Matches "foo" only if NOT followed by "bar"
Lookbehind
- Positive Lookbehind (
?<=
): Ensures a pattern exists before the current position. - Negative Lookbehind (
?<!
): Ensures a pattern does not exist before the current position.
(?<=bar)foo // Matches "foo" only if preceded by "bar"
(?<!bar)foo // Matches "foo" only if NOT preceded by "bar"
Behavior and Considerations
- Lookaround assertions do not consume characters; they only assert conditions.
- Combining lookahead and lookbehind can create complex matching rules.
- Negative lookaround can be used to enforce exclusions in matching.
- Lookbehind patterns must have a fixed length.
- In many cases, defining a stricter pattern instead of relying on lookaround assertions results in a more efficient and readable regex.
Conclusion
Lookaround assertions provide powerful matching capabilities without consuming characters. However, in most cases, defining a stricter pattern can lead to better performance and clarity. The Motoko Regex Engine supports both lookahead and lookbehind assertions, but users should consider whether they can achieve the same result with a more precise pattern before resorting to lookarounds.
Regex Examples for Motoko Regex Engine
Table of Contents
Internet Computer Identifiers
Principal ID
Pattern to validate Principal ID format.
// Anonymous Principal will be rejected
let principalPattern = Regex.Regex("^[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{3}$", null);
public func validatePrincipalId(id: Text): Bool {
switch(principalPattern.match(id)) {
case (#ok(result)) {
switch(result.status) {
case (#FullMatch) true;
case (#NoMatch) false;
};
};
case (#err(_)) false;
};
};
Account ID
Pattern to validate Account ID format.
// Account ID (32 bytes in hexadecimal)
let accountPattern = Regex.Regex("^[0-9a-f]{64}$", null);
public func validateAccountId(id: Text): Bool {
switch(accountPattern.match(id)) {
case (#ok(result)) {
switch(result.status) {
case (#FullMatch) true;
case (#NoMatch) false;
};
};
case (#err(_)) false;
};
};
Contributing to Motoko Regex Engine
Thank you for your interest in contributing to the Motoko Regex Engine project! This guide will help you get started with contributing to the project.
Table of Contents
Getting Started
-
Fork the repository:
- Visit https://github.com/Demali-876/motoko_regex_engine
- Click the "Fork" button in the top-right corner
-
Clone your fork:
git clone https://github.com/YOUR-USERNAME/motoko_regex_engine.git cd motoko_regex_engine
-
Add the upstream repository:
git remote add upstream https://github.com/Demali-876/motoko_regex_engine.git
-
Create a new branch for your work:
git checkout -b feat/your-feature-name
Development Process
-
Set up your development environment:
- Install the DFINITY SDK (dfx)
- Install Node.js and npm
- Run
npm install
to install dependencies
-
Make your changes:
- Keep your changes focused and concise
- Update documentation if needed
-
Keep your fork up to date:
git fetch upstream git rebase upstream/main
Commit Guidelines
We use Conventional Commits for clear and standardized commit messages. Each commit message should be structured as follows:
<type>[optional scope]: <description>
[optional body]
[optional footer(s)]
Types:
feat
: A new featurefix
: A bug fixdocs
: Documentation only changesstyle
: Changes that do not affect the meaning of the coderefactor
: A code change that neither fixes a bug nor adds a featureperf
: A code change that improves performancetest
: Adding missing tests or correcting existing testschore
: Changes to the build process or auxiliary tools
Examples:
feat(parser): add support for lookahead assertions
fix(matcher): resolve infinite loop in nested groups
docs: update API documentation for search method
refactor(compiler): simplify NFA construction logic
Pull Request Process
-
Push your changes to your fork:
git push origin feat/your-feature-name
-
Create a Pull Request:
- Go to https://github.com/Demali-876/motoko_regex_engine/pulls
- Click "New Pull Request"
- Select your fork and branch
- Fill out the PR template with:
- Clear description of changes
- Any breaking changes
- Evidence of testing (screenshots, console output, etc.)
-
PR Review Process:
- Maintainers will review your PR
- Address any requested changes
- Once approved, your PR will be merged
Testing Requirements
While formal tests are not required, you must provide evidence that you've tested your changes. This can include:
- Screenshots of the feature working
- Console output showing successful execution
- Example usage and results
- Description of test cases you've tried
Example test evidence in PR:
Tested the new alternation operator with:
1. Simple patterns: "a|b" against "a" and "b"
2. Complex patterns: "(foo|bar)+" against "foofoobar"
3. Edge cases: "a||b" and "|a|b|"
Results:
- All patterns matched correctly
- No infinite loops or crashes
- Proper error handling for invalid patterns
Thank you for contributing to the Motoko Regex Engine project!