Character Class

Qedit 5.7 for HP-UX

previous page next page

Character Class

When you have to check for a fixed string of characters, it is easy enough to simply type it at the actual regexp. Entering "abc123" will only find exact matches. What if you want to find the string "abc" followed by a numeric digit? There are no specific metacharacters for digits or alphabetic characters. However, regular expressions have a concept called character class to address these issues. Actually, character class is a lot more powerful and flexible than metacharacters for specific types of text.

A character class is enclosed between brackets. The closing bracket can be left out. However, it is good practice to code it explicitly to avoid ambiguity.

Note that most regexp metacharacters listed above lose their meaning inside a character class. The start-of-line anchor acquires a different definition and a new metacharacter, hyphen (-), appears.

A character class is a list of possible values for a specific position in the string. The character class can be as long as needed. A character class for numeric digits would be

[0123456789]

Note, the list does not have to be in sorted order. You could have entered the digits in reverse order or in random order and the character class would still be valid. It is just harder to verify that all digits are included. Similarly, a character class for lowercase letters would be

[abcdefghijklmnopqrstuvwxyz]

It is really important to understand that a match occurs if one of the characters in the class is found. Using the "abc" example above, if we want to find this string followed by a digit, we would enter

abc[0123456789]    {matches "abc0", "abc1", etc.  to "abc9"}

To further restrict the search, we could have used

abc[13579]         {matches "abc" followed by one odd digit}

Because a character class is only a list of possible values, you can mix and match all the characters in the ASCII code table.

p[imy246!.*]e      {matches "pie," "pme," "p4e," "p*e," etc.}

This example would find text starting with the letter p and ending with an e that encloses a single character matching one of the letters a, m or y, one of the digits 2, 4 or 6, an exclamation mark (!), a period (.) or an asterisk (*). Note the period and asterisk are not metacharacters anymore.

Of course, if the character class contains many possible values, it can be tedious and error-prone to enter each character. The hyphen is a character-class metacharacter that can be used as a range indicator. Simply specify the first and last characters in the range. Numeric digits could then be coded as [0-9]. Lowercase letters could be coded as [a-z]. You can also combine ranges with single values, as in

abc[0-9a-z!.*]

A character class range is based on the ASCII character set. You could specify a range of

[A-z]

and it would be perfectly valid. In this case, the range would include all uppercase letters, a series of special characters ([,\,],^,_,`) and all lowercase letters. Typically, you would enter the character with the smallest ASCII value as the lower limit and the character with the largest value as the upper limit. Qedit accepts characters even if they are reversed (i.e., the largest value first) as in:

[Z-A]

Qedit detects this situation and swaps the values internally so [a-z] and [z-a] are really equivalent. To avoid ambiguity, it is recommended that you use the first format.

The hyphen is interpreted as a range indicator only if it is at a logical place between two other characters. If it is somewhere else in the class, it is used at face value.

[-a-z]      {hyphen and lowercase letters}
[a-z-]      {lowercase letters and hyphen}
[a-z-9]     {lowercase letters, hyphen and digit 9}
[a-z0-9]    {lowercase letters and digits 0 to 9}

Negated Character Class.

The caret (^) takes on a different meaning inside a character class. It is used at face value anywhere in the class, except if it is the first character in which case the caret negates the whole class. This means a match is found if the text does not contain any of the characters in the class.

p[246^]e           {matches "p2e", "p^e", etc.}
p[^246]e           {matches "pae", "p3e", etc.}

In the last example, the caret negates 2, 4 and 6. The regexp is true if the text starts with the letter p, ends with the letter e and encloses a single character that is not 2, 4 or 6.

Repeating Character Class.

Because a character class is interpreted as a single character, you can use the optional (?) and quantifier (* and +) metacharacters to further qualify a character class. For example, if we want to allow one or more numeric digits after the "abc" string, we could use the following regexp:

abc[0-9]+

previous page start next page