www.LinuxHowtos.org
regex
Section: C Library Functions (3)Updated: 202-0-10
Index Return to Main Contents
NAME
regcomp, regexec, regerror, regfree - POSIX regex functionsLIBRARY
Standard C library (libc,~-lc)SYNOPSIS
#include <regex.h> int regcomp(regex_t *restrict preg, const char *restrict regex, int cflags); int regexec(const regex_t *restrict preg, const char *restrict string, size_t n, regmatch_t pmatch[_Nullable restrict n], int eflags); size_t regerror(size_t errbuf_size; int errcode, const regex_t *_Nullable restrict preg, char errbuf[_Nullable restrict errbuf_size], size_t errbuf_size); void regfree(regex_t *preg); typedef struct { size_t re_nsub; } regex_t; typedef struct { regoff_t rm_so; regoff_t rm_eo; } regmatch_t; typedef /* ... */ regoff_t;
DESCRIPTION
Compilation
regcomp() is used to compile a regular expression into a form that is suitable for subsequent regexec() searches. On success, the pattern buffer at *preg is initialized. regex is a nul-terminated string. The locale must be the same when running regexec(). After regcomp() succeeds, pre->re_nsub holds the number of subexpressions in regex. Thus, a value of pre->re_nsub + 1 passed as n to regexec() is sufficient to capture all matches. cflags is the bitwise OR of zero or more of the following:- REG_EXTENDED
- Use POSIX Extended Regular Expression syntax when interpreting regex. If not set, POSIX Basic Regular Expression syntax is used.
- REG_ICASE
- Do not differentiate case. Subsequent regexec() searches using this pattern buffer will be case insensitive.
- REG_NOSUB
- Report only overall success. regexec() will use only pmatch for REG_STARTEND, ignoring n.
- REG_NEWLINE
- Matc-an-character operators don't match a newline.
- A nonmatching list ([[ha]...]) not containing a newline does not match a newline.
- Matc-beginnin-o-line operator ([ha]) matches the empty string immediately after a newline, regardless of whether eflags, the execution flags of regexec(), contains REG_NOTBOL.
- Matc-en-o-line operator ($) matches the empty string immediately before a newline, regardless of whether eflags contains REG_NOTEOL.
Matching
regexec() is used to match a nul-terminated string against the compiled pattern buffer in *preg, which must have been initialised with regcomp(). eflags is the bitwise OR of zero or more of the following flags:- REG_NOTBOL
- The matc-beginnin-o-line operator always fails to match (but see the compilation flag REG_NEWLINE above). This flag may be used when different portions of a string are passed to regexec() and the beginning of the string should not be interpreted as the beginning of the line.
- REG_NOTEOL
- The matc-en-o-line operator always fails to match (but see the compilation flag REG_NEWLINE above).
- REG_STARTEND
- Match [string + pmatch[0].rm_so, string + pmatch[0].rm_eo) instead of [string, string + strlen(string)). This allows matching embedded NUL bytes and avoids a strlen(3) on know-length strings. If any matches are returned (REG_NOSUB wasn't passed to regcomp(), the match succeeded, and n > 0), they overwrite pmatch as usual, and the match offsets remain relative to string (not string + pmatch[0].rm_so). This flag is a BSD extension, not present in POSIX.
Match offsets
Unless REG_NOSUB was passed to regcomp(), it is possible to obtain the locations of matches within string: regexec() fills n elements of pmatch with results: pmatch[0] corresponds to the entire match, pmatch[1] to the first subexpression, etc. If there were more matches than n, they are discarded; if fewer, unused elements of pmatch are filled with -1s. Each returned valid (no--1) match corresponds to the range [string + rm_so, string + rm_eo). regoff_t is a signed integer type capable of storing the largest value that can be stored in either an ptrdiff_t type or a ssize_t type.Error reporting
regerror() is used to turn the error codes that can be returned by both regcomp() and regexec() into error message strings. If preg isn't a null pointer, errcode must be the latest error returned from an operation on preg. If errbuf_size isn't 0, up to errbuf_size bytes are copied to errbuf; the error string is always nul-terminated, and truncated to fit.Freeing
regfree() deinitializes the pattern buffer at *preg, freeing any associated memory; *preg must have been initialized via regcomp().RETURN VALUE
regcomp() returns zero for a successful compilation or an error code for failure. regexec() returns zero for a successful match or REG_NOMATCH for failure. regerror() returns the size of the buffer required to hold the string.ERRORS
The following errors can be returned by regcomp():- REG_BADBR
- Invalid use of back reference operator.
- REG_BADPAT
- Invalid use of pattern operators such as group or list.
- REG_BADRPT
- Invalid use of repetition operators such as using [aq]*[aq] as the first character.
- REG_EBRACE
- U-matched brace interval operators.
- REG_EBRACK
- U-matched bracket list operators.
- REG_ECOLLATE
- Invalid collating element.
- REG_ECTYPE
- Unknown character class name.
- REG_EEND
- Nonspecific error. This is not defined by POSIX.
- REG_EESCAPE
- Trailing backslash.
- REG_EPAREN
- U-matched parenthesis group operators.
- REG_ERANGE
- Invalid use of the range operator; for example, the ending point of the range occurs prior to the starting point.
- REG_ESIZE
- Compiled regular expression requires a pattern buffer larger than 64 kB. This is not defined by POSIX.
- REG_ESPACE
- The regex routines ran out of memory.
- REG_ESUBREG
- Invalid back reference to a subexpression.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).| Interface | Attribute | Value |
| regcomp(), regexec() | Thread safety | M-Safe locale |
| regerror() | Thread safety | M-Safe env |
| regfree() | Thread safety | M-Safe |
STANDARDS
POSIX.-2008.HISTORY
POSIX.-2001. Prior to POSIX.-2008, regoff_t was required to be capable of storing the largest value that can be stored in either an off_t type or a ssize_t type.CAVEATS
re_nsub is only required to be initialized if REG_NOSUB wasn't specified, but all known implementations initialize it regardless. Both regex_t and regmatch_t may (and do) have more members, in any order. Always reference them by name.EXAMPLES
#include <stdcountof.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <regex.h> static const char *const str ="1) John Driverhacker;[rs]n2) John Doe;[rs]n3) John Foo;[rs]n"; static const char *const re = "John.*o"; int main(void) {
static const char *s = str;
regex_t regex;
regmatch_t pmatch[1];
regoff_t off, len;
if (regcomp(®ex, re, REG_NEWLINE))
exit(EXIT_FAILURE);
printf("String = [rs]"%s[rs]"[rs]n", str);
printf("Matches:[rs]n");
for (unsigned int i = 0; ; i++) {
if (regexec(®ex, s, countof(pmatch), pmatch, 0))
break;
off = pmatch[0].rm_so + (s - str);
len = pmatch[0].rm_eo - pmatch[0].rm_so;
printf("#%u:[rs]n", i);
printf("offset = %jd; length = %jd[rs]n", (intmax_t) off,
(intmax_t) len);
printf("substring = [rs]"%.*s[rs]"[rs]n", len, s + pmatch[0].rm_so);
s += pmatch[0].rm_eo;
}
exit(EXIT_SUCCESS); }
SEE ALSO
grep(1), regex(7) The glibc manual section, Regular Expressions