NukeCops - How-to on REGEX, Regular Expressions via Linux Programmer Guide

You are missing our premiere tool bar navigation system! Register and use it for FREE!

Create an account

• Home • Downloads • Gallery • Your Account • Forums •

Readme First

- Readme First! -
Read and follow the rules, otherwise your posts will be closed

Modules

· Home
· FAQ
· Buy a Theme
· Advertising
· AvantGo
· Bookmarks
· Columbia
· Community
· Donations
· Downloads
· Feedback
· Forums
· PHP-Nuke HOWTO
· Private Messages
· Search
· Statistics
· Stories Archive
· Submit News
· Surveys
· Theme Gallery
· Top
· Topics
· Your Account

Who's Online

There are currently, 351 guest(s) and 0 member(s) that are online.

You are Anonymous user. You can register for free by clicking here

How-to on REGEX, Regular Expressions via Linux Programmer Guide

NAME
regex - POSIX 1003.2 regular expressions

DESCRIPTION
Regular expressions (``RE''s), as defined in POSIX 1003.2, come in two forms: modern REs (roughly those of egrep; 1003.2 calls
these ``extended'' REs) and obsolete REs (roughly those of ed(1); 1003.2 ``basic'' REs). Obsolete REs mostly exist for back-
ward compatibility in some old programs; they will be discussed at the end. 1003.2 leaves some aspects of RE syntax and seman-
tics open; `(!)' marks decisions on these aspects that may not be fully portable to other 1003.2 implementations.

A (modern) RE is one(!) or more non-empty(!) branches, separated by `|'. It matches anything that matches one of the branches.

A branch is one(!) or more pieces, concatenated. It matches a match for the first, followed by a match for the second, etc.

A piece is an atom possibly followed by a single(!) `*', `+', `?', or bound. An atom followed by `*' matches a sequence of 0
or more matches of the atom. An atom followed by `+' matches a sequence of 1 or more matches of the atom. An atom followed by
`?' matches a sequence of 0 or 1 matches of the atom.

A bound is `{' followed by an unsigned decimal integer, possibly followed by `,' possibly followed by another unsigned decimal
integer, always followed by `}'. The integers must lie between 0 and RE_DUP_MAX (255(!)) inclusive, and if there are two of
them, the first may not exceed the second. An atom followed by a bound containing one integer i and no comma matches a
sequence of exactly i matches of the atom. An atom followed by a bound containing one integer i and a comma matches a sequence
of i or more matches of the atom. An atom followed by a bound containing two integers i and j matches a sequence of i through
j (inclusive) matches of the atom.

An atom is a regular expression enclosed in `()' (matching a match for the regular expression), an empty set of `()' (matching
the null string)(!), a bracket expression (see below), `.' (matching any single character), `^' (matching the null string at
the beginning of a line), `$' (matching the null string at the end of a line), a `' followed by one of the characters
`^.[$()|*+?{' (matching that character taken as an ordinary character), a `' followed by any other character(!) (matching
that character taken as an ordinary character, as if the `' had not been present(!)), or a single character with no other sig-
nificance (matching that character). A `{' followed by a character other than a digit is an ordinary character, not the begin-
ning of a bound(!). It is illegal to end an RE with `'.

A bracket expression is a list of characters enclosed in `[]'. It normally matches any single character from the list (but see
below). If the list begins with `^', it matches any single character (but see below) not from the rest of the list. If two
characters in the list are separated by `-', this is shorthand for the full range of characters between those two (inclusive)
in the collating sequence, e.g. `[0-9]' in ASCII matches any decimal digit. It is illegal(!) for two ranges to share an end-
point, e.g. `a-c-e'. Ranges are very collating-sequence-dependent, and portable programs should avoid relying on them.

To include a literal `]' in the list, make it the first character (following a possible `^'). To include a literal `-', make
it the first or last character, or the second endpoint of a range. To use a literal `-' as the first endpoint of a range,
enclose it in `[.' and `.]' to make it a collating element (see below). With the exception of these and some combinations
using `[' (see next paragraphs), all other special characters, including `', lose their special significance within a bracket
expression.

Within a bracket expression, a collating element (a character, a multi-character sequence that collates as if it were a single
character, or a collating-sequence name for either) enclosed in `[.' and `.]' stands for the sequence of characters of that
collating element. The sequence is a single element of the bracket expression's list. A bracket expression containing a
multi-character collating element can thus match more than one character, e.g. if the collating sequence includes a `ch' col-
lating element, then the RE `[[.ch.]]*c' matches the first five characters of `chchcc'.

Within a bracket expression, a collating element enclosed in `[=' and `=]' is an equivalence class, standing for the sequences
of characters of all collating elements equivalent to that one, including itself. (If there are no other equivalent collating
elements, the treatment is as if the enclosing delimiters were `[.' and `.]'.) For example, if o and ^ are the members of an
equivalence class, then `[[=o=]]', `[[=^=]]', and `[o^]' are all synonymous. An equivalence class may not(!) be an endpoint of
a range.

Within a bracket expression, the name of a character class enclosed in `[:' and `:]' stands for the list of all characters
belonging to that class. Standard character class names are:

alnum
digit
punct
alpha
graph
space
blank
lower
upper
cntrl
print
xdigit

These stand for the character classes defined in wctype(3). A locale may provide others. A character class may not be used as
an endpoint of a range.

There are two special cases(!) of bracket expressions: the bracket expressions `[[::]]' match the null string at
the beginning and end of a word respectively. A word is defined as a sequence of word characters which is neither preceded nor
followed by word characters. A word character is an alnum character (as defined by wctype(3)) or an underscore. This is an
extension, compatible with but not specified by POSIX 1003.2, and should be used with caution in software intended to be
portable to other systems.

In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the
string. If the RE could match more than one substring starting at that point, it matches the longest. Subexpressions also
match the longest possible substrings, subject to the constraint that the whole match be as long as possible, with subexpres-
sions starting earlier in the RE taking priority over ones starting later. Note that higher-level subexpressions thus take
priority over their lower-level component subexpressions.

Match lengths are measured in characters, not collating elements. A null string is considered longer than no match at all.
For example, `bb*' matches the three middle characters of `abbbc', `(wee|week)(knights|nights)' matches all ten characters of
`weeknights', when `(.*).*' is matched against `abc' the parenthesized subexpression matches all three characters, and when
`(a*)*' is matched against `bc' both the whole RE and the parenthesized subexpression match the null string.

If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the alphabet. When
an alphabetic that exists in multiple cases appears as an ordinary character outside a bracket expression, it is effectively
transformed into a bracket expression containing both cases, e.g. `x' becomes `[xX]'. When it appears inside a bracket expres-
sion, all case counterparts of it are added to the bracket expression, so that (e.g.) `[x]' becomes `[xX]' and `[^x]' becomes
`[^xX]'.

No particular limit is imposed on the length of REs(!). Programs intended to be portable should not employ REs longer than 256
bytes, as an implementation can refuse to accept such REs and remain POSIX-compliant.

Obsolete (``basic'') regular expressions differ in several respects. `|', `+', and `?' are ordinary characters and there is no
equivalent for their functionality. The delimiters for bounds are `{' and `}', with `{' and `}' by themselves ordinary char-
acters. The parentheses for nested subexpressions are `(' and `)', with `(' and `)' by themselves ordinary characters. `^'
is an ordinary character except at the beginning of the RE or(!) the beginning of a parenthesized subexpression, `$' is an
ordinary character except at the end of the RE or(!) the end of a parenthesized subexpression, and `*' is an ordinary character
if it appears at the beginning of the RE or the beginning of a parenthesized subexpression (after a possible leading `^').
Finally, there is one new type of atom, a back reference: `' followed by a non-zero decimal digit d matches the same sequence
of characters matched by the dth parenthesized subexpression (numbering subexpressions by the positions of their opening paren-
theses, left to right), so that (e.g.) `([bc])1' matches `bb' or `cc' but not `bc'.

AUTHOR
This page was taken from Henry Spencer's regex package.

isalnum() [:alnum:]
checks for an alphanumeric character; it is equivalent to (isalpha(c) || isdigit(c)).

isalpha() [:alpha:]
checks for an alphabetic character; in the standard "C" locale, it is equivalent to (isupper(c) || islower(c)). In some
locales, there may be additional characters for which isalpha() is true--letters which are neither upper case nor lower
case.

isascii() [:ascii:]
checks whether c is a 7-bit unsigned char value that fits into the ASCII character set. This function is a BSD exten-
sion and is also an SVID extension.

isblank() [:blank:]
checks for a blank character; that is, a space or a tab. This function is a GNU extension.

iscntrl() [:cntrl:]
checks for a control character.

isdigit() [:digit:]
checks for a digit (0 through 9).

isgraph() [:graph:]
checks for any printable character except space.

islower() [:lower:]
checks for a lower-case character.

isprint() [:print:]
checks for any printable character including space.

ispunct() [:punct:]
checks for any printable character which is not a space or an alphanumeric character.

isspace() [:space:]
checks for white-space characters. In the "C" and "POSIX" locales, these are: space, form-feed ('f'), newline (' '),
carriage return (' '), horizontal tab (' '), and vertical tab ('v').

isupper() [:upper:]
checks for an uppercase letter.

isxdigit() [:xdigit:]
checks for a hexadecimal digits, i.e. one of 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F.

NAME
locale - Description of multi-language support

SYNOPSIS
#include

DESCRIPTION
A locale is a set of language and cultural rules. These cover aspects such as language for messages, different character sets,
lexigraphic conventions, etc. A program needs to be able to determine its locale and act accordingly to be portable to differ-
ent cultures.

The header declares data types, functions and macros which are useful in this task.

The functions it declares are setlocale() to set the current locale, and localeconv() to get information about number format-
ting.

There are different categories for local information a program might need; they are declared as macros. Using them as the
first argument to the setlocale() function, it is possible to set one of these to the desired locale:

LC_COLLATE
This is used to change the behaviour of the functions strcoll() and strxfrm(), which are used to compare strings in the
local alphabet. For example, the German sharp s is sorted as "ss".

LC_CTYPE
This changes the behaviour of the character handling and classification functions, such as isupper() and toupper(), and
the multi-byte character functions such as mblen() or wctomb().

LC_MONETARY
changes the information returned by localeconv() which describes the way numbers are usually printed, with details such
as decimal point versus decimal comma. This information is internally used by the function strfmon().

LC_MESSAGES
changes the language messages are displayed in and how an affirmative or negative answer looks like. The GNU C-library
contains the rpmatch() function to ease the use of these information.

LC_NUMERIC
changes the information used by the printf() and scanf() family of functions, when they are advised to use the locale-
settings. This information can also be read with the localeconv() function.

LC_TIME
changes the behaviour of the strftime() function to display the current time in a locally acceptable form; for example,
most of Europe uses a 24-hour clock vs. the US' 12-hour clock.

LC_ALL All of the above.

Posted on Friday, September 05 @ 23:55:00 CEST by Zhen-Xjell

Most read story about Off-Topic:
Editorial History on PHP-Nuke and Post-Nuke by Lawrence Krubner

Article Rating

Average Score: 0
Votes: 0

Options

Printer Friendly Page

Send to a Friend

The comments are owned by the poster. We aren't responsible for their content.

No Comments Allowed for Anonymous, please register

Powered by TOGETHER TEAM srl ITALY http://www.togetherteam.it - DONDELEO E-COMMERCE http://www.DonDeLeo.com - TUTTISU E-COMMERCE http://www.tuttisu.it
Web site engine's code is Copyright © 2002 by PHP-Nuke. All Rights Reserved. PHP-Nuke is Free Software released under the GNU/GPL license.
Page Generation: 0.019 Seconds - 2337 pages served in past 5 minutes. Nuke Cops Founded by Paul Laudanski (Zhen-Xjell)

:: FI Theme :: PHP-Nuke theme by coldblooded (www.nukemods.com) ::