[ Index ]

PHP Cross Reference of Unnamed Project

title

Body

[close]

/se3-unattended/var/se3/unattended/install/linuxaux/opt/perl/lib/5.10.0/pod/ -> perlrequick.pod (source)

   1  =head1 NAME
   2  
   3  perlrequick - Perl regular expressions quick start
   4  
   5  =head1 DESCRIPTION
   6  
   7  This page covers the very basics of understanding, creating and
   8  using regular expressions ('regexes') in Perl.
   9  
  10  
  11  =head1 The Guide
  12  
  13  =head2 Simple word matching
  14  
  15  The simplest regex is simply a word, or more generally, a string of
  16  characters.  A regex consisting of a word matches any string that
  17  contains that word:
  18  
  19      "Hello World" =~ /World/;  # matches
  20  
  21  In this statement, C<World> is a regex and the C<//> enclosing
  22  C</World/> tells perl to search a string for a match.  The operator
  23  C<=~> associates the string with the regex match and produces a true
  24  value if the regex matched, or false if the regex did not match.  In
  25  our case, C<World> matches the second word in C<"Hello World">, so the
  26  expression is true.  This idea has several variations.
  27  
  28  Expressions like this are useful in conditionals:
  29  
  30      print "It matches\n" if "Hello World" =~ /World/;
  31  
  32  The sense of the match can be reversed by using C<!~> operator:
  33  
  34      print "It doesn't match\n" if "Hello World" !~ /World/;
  35  
  36  The literal string in the regex can be replaced by a variable:
  37  
  38      $greeting = "World";
  39      print "It matches\n" if "Hello World" =~ /$greeting/;
  40  
  41  If you're matching against C<$_>, the C<$_ =~> part can be omitted:
  42  
  43      $_ = "Hello World";
  44      print "It matches\n" if /World/;
  45  
  46  Finally, the C<//> default delimiters for a match can be changed to
  47  arbitrary delimiters by putting an C<'m'> out front:
  48  
  49      "Hello World" =~ m!World!;   # matches, delimited by '!'
  50      "Hello World" =~ m{World};   # matches, note the matching '{}'
  51      "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
  52                                   # '/' becomes an ordinary char
  53  
  54  Regexes must match a part of the string I<exactly> in order for the
  55  statement to be true:
  56  
  57      "Hello World" =~ /world/;  # doesn't match, case sensitive
  58      "Hello World" =~ /o W/;    # matches, ' ' is an ordinary char
  59      "Hello World" =~ /World /; # doesn't match, no ' ' at end
  60  
  61  perl will always match at the earliest possible point in the string:
  62  
  63      "Hello World" =~ /o/;       # matches 'o' in 'Hello'
  64      "That hat is red" =~ /hat/; # matches 'hat' in 'That'
  65  
  66  Not all characters can be used 'as is' in a match.  Some characters,
  67  called B<metacharacters>, are reserved for use in regex notation.
  68  The metacharacters are
  69  
  70      {}[]()^$.|*+?\
  71  
  72  A metacharacter can be matched by putting a backslash before it:
  73  
  74      "2+2=4" =~ /2+2/;    # doesn't match, + is a metacharacter
  75      "2+2=4" =~ /2\+2/;   # matches, \+ is treated like an ordinary +
  76      'C:\WIN32' =~ /C:\\WIN/;                       # matches
  77      "/usr/bin/perl" =~ /\/usr\/bin\/perl/;  # matches
  78  
  79  In the last regex, the forward slash C<'/'> is also backslashed,
  80  because it is used to delimit the regex.
  81  
  82  Non-printable ASCII characters are represented by B<escape sequences>.
  83  Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r>
  84  for a carriage return.  Arbitrary bytes are represented by octal
  85  escape sequences, e.g., C<\033>, or hexadecimal escape sequences,
  86  e.g., C<\x1B>:
  87  
  88      "1000\t2000" =~ m(0\t2)        # matches
  89      "cat"        =~ /\143\x61\x74/ # matches, but a weird way to spell cat
  90  
  91  Regexes are treated mostly as double quoted strings, so variable
  92  substitution works:
  93  
  94      $foo = 'house';
  95      'cathouse' =~ /cat$foo/;   # matches
  96      'housecat' =~ /$foo}cat/; # matches
  97  
  98  With all of the regexes above, if the regex matched anywhere in the
  99  string, it was considered a match.  To specify I<where> it should
 100  match, we would use the B<anchor> metacharacters C<^> and C<$>.  The
 101  anchor C<^> means match at the beginning of the string and the anchor
 102  C<$> means match at the end of the string, or before a newline at the
 103  end of the string.  Some examples:
 104  
 105      "housekeeper" =~ /keeper/;         # matches
 106      "housekeeper" =~ /^keeper/;        # doesn't match
 107      "housekeeper" =~ /keeper$/;        # matches
 108      "housekeeper\n" =~ /keeper$/;      # matches
 109      "housekeeper" =~ /^housekeeper$/;  # matches
 110  
 111  =head2 Using character classes
 112  
 113  A B<character class> allows a set of possible characters, rather than
 114  just a single character, to match at a particular point in a regex.
 115  Character classes are denoted by brackets C<[...]>, with the set of
 116  characters to be possibly matched inside.  Here are some examples:
 117  
 118      /cat/;            # matches 'cat'
 119      /[bcr]at/;        # matches 'bat', 'cat', or 'rat'
 120      "abc" =~ /[cab]/; # matches 'a'
 121  
 122  In the last statement, even though C<'c'> is the first character in
 123  the class, the earliest point at which the regex can match is C<'a'>.
 124  
 125      /[yY][eE][sS]/; # match 'yes' in a case-insensitive way
 126                      # 'yes', 'Yes', 'YES', etc.
 127      /yes/i;         # also match 'yes' in a case-insensitive way
 128  
 129  The last example shows a match with an C<'i'> B<modifier>, which makes
 130  the match case-insensitive.
 131  
 132  Character classes also have ordinary and special characters, but the
 133  sets of ordinary and special characters inside a character class are
 134  different than those outside a character class.  The special
 135  characters for a character class are C<-]\^$> and are matched using an
 136  escape:
 137  
 138     /[\]c]def/; # matches ']def' or 'cdef'
 139     $x = 'bcr';
 140     /[$x]at/;   # matches 'bat, 'cat', or 'rat'
 141     /[\$x]at/;  # matches '$at' or 'xat'
 142     /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
 143  
 144  The special character C<'-'> acts as a range operator within character
 145  classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]>
 146  become the svelte C<[0-9]> and C<[a-z]>:
 147  
 148      /item[0-9]/;  # matches 'item0' or ... or 'item9'
 149      /[0-9a-fA-F]/;  # matches a hexadecimal digit
 150  
 151  If C<'-'> is the first or last character in a character class, it is
 152  treated as an ordinary character.
 153  
 154  The special character C<^> in the first position of a character class
 155  denotes a B<negated character class>, which matches any character but
 156  those in the brackets.  Both C<[...]> and C<[^...]> must match a
 157  character, or the match fails.  Then
 158  
 159      /[^a]at/;  # doesn't match 'aat' or 'at', but matches
 160                 # all other 'bat', 'cat, '0at', '%at', etc.
 161      /[^0-9]/;  # matches a non-numeric character
 162      /[a^]at/;  # matches 'aat' or '^at'; here '^' is ordinary
 163  
 164  Perl has several abbreviations for common character classes:
 165  
 166  =over 4
 167  
 168  =item *
 169  
 170  \d is a digit and represents
 171  
 172      [0-9]
 173  
 174  =item *
 175  
 176  \s is a whitespace character and represents
 177  
 178      [\ \t\r\n\f]
 179  
 180  =item *
 181  
 182  \w is a word character (alphanumeric or _) and represents
 183  
 184      [0-9a-zA-Z_]
 185  
 186  =item *
 187  
 188  \D is a negated \d; it represents any character but a digit
 189  
 190      [^0-9]
 191  
 192  =item *
 193  
 194  \S is a negated \s; it represents any non-whitespace character
 195  
 196      [^\s]
 197  
 198  =item *
 199  
 200  \W is a negated \w; it represents any non-word character
 201  
 202      [^\w]
 203  
 204  =item *
 205  
 206  The period '.' matches any character but "\n"
 207  
 208  =back
 209  
 210  The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside
 211  of character classes.  Here are some in use:
 212  
 213      /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
 214      /[\d\s]/;         # matches any digit or whitespace character
 215      /\w\W\w/;         # matches a word char, followed by a
 216                        # non-word char, followed by a word char
 217      /..rt/;           # matches any two chars, followed by 'rt'
 218      /end\./;          # matches 'end.'
 219      /end[.]/;         # same thing, matches 'end.'
 220  
 221  The S<B<word anchor> > C<\b> matches a boundary between a word
 222  character and a non-word character C<\w\W> or C<\W\w>:
 223  
 224      $x = "Housecat catenates house and cat";
 225      $x =~ /\bcat/;  # matches cat in 'catenates'
 226      $x =~ /cat\b/;  # matches cat in 'housecat'
 227      $x =~ /\bcat\b/;  # matches 'cat' at end of string
 228  
 229  In the last example, the end of the string is considered a word
 230  boundary.
 231  
 232  =head2 Matching this or that
 233  
 234  We can match different character strings with the B<alternation>
 235  metacharacter C<'|'>.  To match C<dog> or C<cat>, we form the regex
 236  C<dog|cat>.  As before, perl will try to match the regex at the
 237  earliest possible point in the string.  At each character position,
 238  perl will first try to match the first alternative, C<dog>.  If
 239  C<dog> doesn't match, perl will then try the next alternative, C<cat>.
 240  If C<cat> doesn't match either, then the match fails and perl moves to
 241  the next position in the string.  Some examples:
 242  
 243      "cats and dogs" =~ /cat|dog|bird/;  # matches "cat"
 244      "cats and dogs" =~ /dog|cat|bird/;  # matches "cat"
 245  
 246  Even though C<dog> is the first alternative in the second regex,
 247  C<cat> is able to match earlier in the string.
 248  
 249      "cats"          =~ /c|ca|cat|cats/; # matches "c"
 250      "cats"          =~ /cats|cat|ca|c/; # matches "cats"
 251  
 252  At a given character position, the first alternative that allows the
 253  regex match to succeed will be the one that matches. Here, all the
 254  alternatives match at the first string position, so the first matches.
 255  
 256  =head2 Grouping things and hierarchical matching
 257  
 258  The B<grouping> metacharacters C<()> allow a part of a regex to be
 259  treated as a single unit.  Parts of a regex are grouped by enclosing
 260  them in parentheses.  The regex C<house(cat|keeper)> means match
 261  C<house> followed by either C<cat> or C<keeper>.  Some more examples
 262  are
 263  
 264      /(a|b)b/;    # matches 'ab' or 'bb'
 265      /(^a|b)c/;   # matches 'ac' at start of string or 'bc' anywhere
 266  
 267      /house(cat|)/;  # matches either 'housecat' or 'house'
 268      /house(cat(s|)|)/;  # matches either 'housecats' or 'housecat' or
 269                          # 'house'.  Note groups can be nested.
 270  
 271      "20" =~ /(19|20|)\d\d/;  # matches the null alternative '()\d\d',
 272                               # because '20\d\d' can't match
 273  
 274  =head2 Extracting matches
 275  
 276  The grouping metacharacters C<()> also allow the extraction of the
 277  parts of a string that matched.  For each grouping, the part that
 278  matched inside goes into the special variables C<$1>, C<$2>, etc.
 279  They can be used just as ordinary variables:
 280  
 281      # extract hours, minutes, seconds
 282      $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format
 283      $hours = $1;
 284      $minutes = $2;
 285      $seconds = $3;
 286  
 287  In list context, a match C</regex/> with groupings will return the
 288  list of matched values C<($1,$2,...)>.  So we could rewrite it as
 289  
 290      ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
 291  
 292  If the groupings in a regex are nested, C<$1> gets the group with the
 293  leftmost opening parenthesis, C<$2> the next opening parenthesis,
 294  etc.  For example, here is a complex regex and the matching variables
 295  indicated below it:
 296  
 297      /(ab(cd|ef)((gi)|j))/;
 298       1  2      34
 299  
 300  Associated with the matching variables C<$1>, C<$2>, ... are
 301  the B<backreferences> C<\1>, C<\2>, ...  Backreferences are
 302  matching variables that can be used I<inside> a regex:
 303  
 304      /(\w\w\w)\s\1/; # find sequences like 'the the' in string
 305  
 306  C<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>,
 307  C<\2>, ... only inside a regex.
 308  
 309  =head2 Matching repetitions
 310  
 311  The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us
 312  to determine the number of repeats of a portion of a regex we
 313  consider to be a match.  Quantifiers are put immediately after the
 314  character, character class, or grouping that we want to specify.  They
 315  have the following meanings:
 316  
 317  =over 4
 318  
 319  =item *
 320  
 321  C<a?> = match 'a' 1 or 0 times
 322  
 323  =item *
 324  
 325  C<a*> = match 'a' 0 or more times, i.e., any number of times
 326  
 327  =item *
 328  
 329  C<a+> = match 'a' 1 or more times, i.e., at least once
 330  
 331  =item *
 332  
 333  C<a{n,m}> = match at least C<n> times, but not more than C<m>
 334  times.
 335  
 336  =item *
 337  
 338  C<a{n,}> = match at least C<n> or more times
 339  
 340  =item *
 341  
 342  C<a{n}> = match exactly C<n> times
 343  
 344  =back
 345  
 346  Here are some examples:
 347  
 348      /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and
 349                       # any number of digits
 350      /(\w+)\s+\1/;    # match doubled words of arbitrary length
 351      $year =~ /\d{2,4}/;  # make sure year is at least 2 but not more
 352                           # than 4 digits
 353      $year =~ /\d{4}|\d{2}/;    # better match; throw out 3 digit dates
 354  
 355  These quantifiers will try to match as much of the string as possible,
 356  while still allowing the regex to match.  So we have
 357  
 358      $x = 'the cat in the hat';
 359      $x =~ /^(.*)(at)(.*)$/; # matches,
 360                              # $1 = 'the cat in the h'
 361                              # $2 = 'at'
 362                              # $3 = ''   (0 matches)
 363  
 364  The first quantifier C<.*> grabs as much of the string as possible
 365  while still having the regex match. The second quantifier C<.*> has
 366  no string left to it, so it matches 0 times.
 367  
 368  =head2 More matching
 369  
 370  There are a few more things you might want to know about matching
 371  operators.  In the code
 372  
 373      $pattern = 'Seuss';
 374      while (<>) {
 375          print if /$pattern/;
 376      }
 377  
 378  perl has to re-evaluate C<$pattern> each time through the loop.  If
 379  C<$pattern> won't be changing, use the C<//o> modifier, to only
 380  perform variable substitutions once.  If you don't want any
 381  substitutions at all, use the special delimiter C<m''>:
 382  
 383      @pattern = ('Seuss');
 384      m/@pattern/; # matches 'Seuss'
 385      m'@pattern'; # matches the literal string '@pattern'
 386  
 387  The global modifier C<//g> allows the matching operator to match
 388  within a string as many times as possible.  In scalar context,
 389  successive matches against a string will have C<//g> jump from match
 390  to match, keeping track of position in the string as it goes along.
 391  You can get or set the position with the C<pos()> function.
 392  For example,
 393  
 394      $x = "cat dog house"; # 3 words
 395      while ($x =~ /(\w+)/g) {
 396          print "Word is $1, ends at position ", pos $x, "\n";
 397      }
 398  
 399  prints
 400  
 401      Word is cat, ends at position 3
 402      Word is dog, ends at position 7
 403      Word is house, ends at position 13
 404  
 405  A failed match or changing the target string resets the position.  If
 406  you don't want the position reset after failure to match, add the
 407  C<//c>, as in C</regex/gc>.
 408  
 409  In list context, C<//g> returns a list of matched groupings, or if
 410  there are no groupings, a list of matches to the whole regex.  So
 411  
 412      @words = ($x =~ /(\w+)/g);  # matches,
 413                                  # $word[0] = 'cat'
 414                                  # $word[1] = 'dog'
 415                                  # $word[2] = 'house'
 416  
 417  =head2 Search and replace
 418  
 419  Search and replace is performed using C<s/regex/replacement/modifiers>.
 420  The C<replacement> is a Perl double quoted string that replaces in the
 421  string whatever is matched with the C<regex>.  The operator C<=~> is
 422  also used here to associate a string with C<s///>.  If matching
 423  against C<$_>, the S<C<$_ =~> > can be dropped.  If there is a match,
 424  C<s///> returns the number of substitutions made, otherwise it returns
 425  false.  Here are a few examples:
 426  
 427      $x = "Time to feed the cat!";
 428      $x =~ s/cat/hacker/;   # $x contains "Time to feed the hacker!"
 429      $y = "'quoted words'";
 430      $y =~ s/^'(.*)'$/$1/;  # strip single quotes,
 431                             # $y contains "quoted words"
 432  
 433  With the C<s///> operator, the matched variables C<$1>, C<$2>, etc.
 434  are immediately available for use in the replacement expression. With
 435  the global modifier, C<s///g> will search and replace all occurrences
 436  of the regex in the string:
 437  
 438      $x = "I batted 4 for 4";
 439      $x =~ s/4/four/;   # $x contains "I batted four for 4"
 440      $x = "I batted 4 for 4";
 441      $x =~ s/4/four/g;  # $x contains "I batted four for four"
 442  
 443  The evaluation modifier C<s///e> wraps an C<eval{...}> around the
 444  replacement string and the evaluated result is substituted for the
 445  matched substring.  Some examples:
 446  
 447      # reverse all the words in a string
 448      $x = "the cat in the hat";
 449      $x =~ s/(\w+)/reverse $1/ge;   # $x contains "eht tac ni eht tah"
 450  
 451      # convert percentage to decimal
 452      $x = "A 39% hit rate";
 453      $x =~ s!(\d+)%!$1/100!e;       # $x contains "A 0.39 hit rate"
 454  
 455  The last example shows that C<s///> can use other delimiters, such as
 456  C<s!!!> and C<s{}{}>, and even C<s{}//>.  If single quotes are used
 457  C<s'''>, then the regex and replacement are treated as single quoted
 458  strings.
 459  
 460  =head2 The split operator
 461  
 462  C<split /regex/, string> splits C<string> into a list of substrings
 463  and returns that list.  The regex determines the character sequence
 464  that C<string> is split with respect to.  For example, to split a
 465  string into words, use
 466  
 467      $x = "Calvin and Hobbes";
 468      @word = split /\s+/, $x;  # $word[0] = 'Calvin'
 469                                # $word[1] = 'and'
 470                                # $word[2] = 'Hobbes'
 471  
 472  To extract a comma-delimited list of numbers, use
 473  
 474      $x = "1.618,2.718,   3.142";
 475      @const = split /,\s*/, $x;  # $const[0] = '1.618'
 476                                  # $const[1] = '2.718'
 477                                  # $const[2] = '3.142'
 478  
 479  If the empty regex C<//> is used, the string is split into individual
 480  characters.  If the regex has groupings, then the list produced contains
 481  the matched substrings from the groupings as well:
 482  
 483      $x = "/usr/bin";
 484      @parts = split m!(/)!, $x;  # $parts[0] = ''
 485                                  # $parts[1] = '/'
 486                                  # $parts[2] = 'usr'
 487                                  # $parts[3] = '/'
 488                                  # $parts[4] = 'bin'
 489  
 490  Since the first character of $x matched the regex, C<split> prepended
 491  an empty initial element to the list.
 492  
 493  =head1 BUGS
 494  
 495  None.
 496  
 497  =head1 SEE ALSO
 498  
 499  This is just a quick start guide.  For a more in-depth tutorial on
 500  regexes, see L<perlretut> and for the reference page, see L<perlre>.
 501  
 502  =head1 AUTHOR AND COPYRIGHT
 503  
 504  Copyright (c) 2000 Mark Kvale
 505  All rights reserved.
 506  
 507  This document may be distributed under the same terms as Perl itself.
 508  
 509  =head2 Acknowledgments
 510  
 511  The author would like to thank Mark-Jason Dominus, Tom Christiansen,
 512  Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful
 513  comments.
 514  
 515  =cut
 516  


Generated: Tue Mar 17 22:47:18 2015 Cross-referenced by PHPXref 0.7.1