In Regexes§
See primary documentation in context for Unicode properties
The character classes mentioned so far are mostly for convenience; another approach is to use Unicode character properties. These come in the form <:property>
, where property
can be a short or long Unicode General Category name. These use pair syntax.
To match against a Unicode property you can use either smartmatch or uniprop
:
"a".uniprop('Script'); # OUTPUT: «Latin» "a" ~~ / <:Script<Latin>> /; # OUTPUT: «「a」» "a".uniprop('Block'); # OUTPUT: «Basic Latin» "a" ~~ / <:Block('Basic Latin')> /; # OUTPUT: «「a」»
These are the Unicode general categories used for matching:
Short | Long |
---|---|
L | Letter |
LC | Cased_Letter |
Lu | Uppercase_Letter |
Ll | Lowercase_Letter |
Lt | Titlecase_Letter |
Lm | Modifier_Letter |
Lo | Other_Letter |
M | Mark |
Mn | Nonspacing_Mark |
Mc | Spacing_Mark |
Me | Enclosing_Mark |
N | Number |
Nd | Decimal_Number or digit |
Nl | Letter_Number |
No | Other_Number |
P | Punctuation or punct |
Pc | Connector_Punctuation |
Pd | Dash_Punctuation |
Ps | Open_Punctuation |
Pe | Close_Punctuation |
Pi | Initial_Punctuation |
Pf | Final_Punctuation |
Po | Other_Punctuation |
S | Symbol |
Sm | Math_Symbol |
Sc | Currency_Symbol |
Sk | Modifier_Symbol |
So | Other_Symbol |
Z | Separator |
Zs | Space_Separator |
Zl | Line_Separator |
Zp | Paragraph_Separator |
C | Other |
Cc | Control or cntrl |
Cf | Format |
Cs | Surrogate |
Co | Private_Use |
Cn | Unassigned |
For example, <:Lu>
matches a single, uppercase letter.
Its negation is this: <:!property>
. So, <:!Lu>
matches a single character that is not an uppercase letter.
Categories can be used together, with an infix operator:
Operator | Meaning |
---|---|
+ | set union |
\- | set difference |
To match either a lowercase letter or a number, write <:Ll+:N>
or <:Ll+:Number>
or <+ :Lowercase_Letter + :Number>
.
It's also possible to group categories and sets of categories with parentheses; for example:
say $0 if 'raku9' ~~ /\w+(<:Ll+:N>)/ # OUTPUT: «「9」»