[ Index ] |
PHP Cross Reference of Unnamed Project |
[Summary view] [Print] [Text view]
1 =head1 NAME 2 3 perlrequick - Perl regular expressions quick start 4 5 =head1 DESCRIPTION 6 7 This page covers the very basics of understanding, creating and 8 using regular expressions ('regexes') in Perl. 9 10 11 =head1 The Guide 12 13 =head2 Simple word matching 14 15 The simplest regex is simply a word, or more generally, a string of 16 characters. A regex consisting of a word matches any string that 17 contains that word: 18 19 "Hello World" =~ /World/; # matches 20 21 In this statement, C<World> is a regex and the C<//> enclosing 22 C</World/> tells perl to search a string for a match. The operator 23 C<=~> associates the string with the regex match and produces a true 24 value if the regex matched, or false if the regex did not match. In 25 our case, C<World> matches the second word in C<"Hello World">, so the 26 expression is true. This idea has several variations. 27 28 Expressions like this are useful in conditionals: 29 30 print "It matches\n" if "Hello World" =~ /World/; 31 32 The sense of the match can be reversed by using C<!~> operator: 33 34 print "It doesn't match\n" if "Hello World" !~ /World/; 35 36 The literal string in the regex can be replaced by a variable: 37 38 $greeting = "World"; 39 print "It matches\n" if "Hello World" =~ /$greeting/; 40 41 If you're matching against C<$_>, the C<$_ =~> part can be omitted: 42 43 $_ = "Hello World"; 44 print "It matches\n" if /World/; 45 46 Finally, the C<//> default delimiters for a match can be changed to 47 arbitrary delimiters by putting an C<'m'> out front: 48 49 "Hello World" =~ m!World!; # matches, delimited by '!' 50 "Hello World" =~ m{World}; # matches, note the matching '{}' 51 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 52 # '/' becomes an ordinary char 53 54 Regexes must match a part of the string I<exactly> in order for the 55 statement to be true: 56 57 "Hello World" =~ /world/; # doesn't match, case sensitive 58 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char 59 "Hello World" =~ /World /; # doesn't match, no ' ' at end 60 61 perl will always match at the earliest possible point in the string: 62 63 "Hello World" =~ /o/; # matches 'o' in 'Hello' 64 "That hat is red" =~ /hat/; # matches 'hat' in 'That' 65 66 Not all characters can be used 'as is' in a match. Some characters, 67 called B<metacharacters>, are reserved for use in regex notation. 68 The metacharacters are 69 70 {}[]()^$.|*+?\ 71 72 A metacharacter can be matched by putting a backslash before it: 73 74 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 75 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 76 'C:\WIN32' =~ /C:\\WIN/; # matches 77 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches 78 79 In the last regex, the forward slash C<'/'> is also backslashed, 80 because it is used to delimit the regex. 81 82 Non-printable ASCII characters are represented by B<escape sequences>. 83 Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> 84 for a carriage return. Arbitrary bytes are represented by octal 85 escape sequences, e.g., C<\033>, or hexadecimal escape sequences, 86 e.g., C<\x1B>: 87 88 "1000\t2000" =~ m(0\t2) # matches 89 "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat 90 91 Regexes are treated mostly as double quoted strings, so variable 92 substitution works: 93 94 $foo = 'house'; 95 'cathouse' =~ /cat$foo/; # matches 96 'housecat' =~ /$foo}cat/; # matches 97 98 With all of the regexes above, if the regex matched anywhere in the 99 string, it was considered a match. To specify I<where> it should 100 match, we would use the B<anchor> metacharacters C<^> and C<$>. The 101 anchor C<^> means match at the beginning of the string and the anchor 102 C<$> means match at the end of the string, or before a newline at the 103 end of the string. Some examples: 104 105 "housekeeper" =~ /keeper/; # matches 106 "housekeeper" =~ /^keeper/; # doesn't match 107 "housekeeper" =~ /keeper$/; # matches 108 "housekeeper\n" =~ /keeper$/; # matches 109 "housekeeper" =~ /^housekeeper$/; # matches 110 111 =head2 Using character classes 112 113 A B<character class> allows a set of possible characters, rather than 114 just a single character, to match at a particular point in a regex. 115 Character classes are denoted by brackets C<[...]>, with the set of 116 characters to be possibly matched inside. Here are some examples: 117 118 /cat/; # matches 'cat' 119 /[bcr]at/; # matches 'bat', 'cat', or 'rat' 120 "abc" =~ /[cab]/; # matches 'a' 121 122 In the last statement, even though C<'c'> is the first character in 123 the class, the earliest point at which the regex can match is C<'a'>. 124 125 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 126 # 'yes', 'Yes', 'YES', etc. 127 /yes/i; # also match 'yes' in a case-insensitive way 128 129 The last example shows a match with an C<'i'> B<modifier>, which makes 130 the match case-insensitive. 131 132 Character classes also have ordinary and special characters, but the 133 sets of ordinary and special characters inside a character class are 134 different than those outside a character class. The special 135 characters for a character class are C<-]\^$> and are matched using an 136 escape: 137 138 /[\]c]def/; # matches ']def' or 'cdef' 139 $x = 'bcr'; 140 /[$x]at/; # matches 'bat, 'cat', or 'rat' 141 /[\$x]at/; # matches '$at' or 'xat' 142 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 143 144 The special character C<'-'> acts as a range operator within character 145 classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> 146 become the svelte C<[0-9]> and C<[a-z]>: 147 148 /item[0-9]/; # matches 'item0' or ... or 'item9' 149 /[0-9a-fA-F]/; # matches a hexadecimal digit 150 151 If C<'-'> is the first or last character in a character class, it is 152 treated as an ordinary character. 153 154 The special character C<^> in the first position of a character class 155 denotes a B<negated character class>, which matches any character but 156 those in the brackets. Both C<[...]> and C<[^...]> must match a 157 character, or the match fails. Then 158 159 /[^a]at/; # doesn't match 'aat' or 'at', but matches 160 # all other 'bat', 'cat, '0at', '%at', etc. 161 /[^0-9]/; # matches a non-numeric character 162 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 163 164 Perl has several abbreviations for common character classes: 165 166 =over 4 167 168 =item * 169 170 \d is a digit and represents 171 172 [0-9] 173 174 =item * 175 176 \s is a whitespace character and represents 177 178 [\ \t\r\n\f] 179 180 =item * 181 182 \w is a word character (alphanumeric or _) and represents 183 184 [0-9a-zA-Z_] 185 186 =item * 187 188 \D is a negated \d; it represents any character but a digit 189 190 [^0-9] 191 192 =item * 193 194 \S is a negated \s; it represents any non-whitespace character 195 196 [^\s] 197 198 =item * 199 200 \W is a negated \w; it represents any non-word character 201 202 [^\w] 203 204 =item * 205 206 The period '.' matches any character but "\n" 207 208 =back 209 210 The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 211 of character classes. Here are some in use: 212 213 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 214 /[\d\s]/; # matches any digit or whitespace character 215 /\w\W\w/; # matches a word char, followed by a 216 # non-word char, followed by a word char 217 /..rt/; # matches any two chars, followed by 'rt' 218 /end\./; # matches 'end.' 219 /end[.]/; # same thing, matches 'end.' 220 221 The S<B<word anchor> > C<\b> matches a boundary between a word 222 character and a non-word character C<\w\W> or C<\W\w>: 223 224 $x = "Housecat catenates house and cat"; 225 $x =~ /\bcat/; # matches cat in 'catenates' 226 $x =~ /cat\b/; # matches cat in 'housecat' 227 $x =~ /\bcat\b/; # matches 'cat' at end of string 228 229 In the last example, the end of the string is considered a word 230 boundary. 231 232 =head2 Matching this or that 233 234 We can match different character strings with the B<alternation> 235 metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex 236 C<dog|cat>. As before, perl will try to match the regex at the 237 earliest possible point in the string. At each character position, 238 perl will first try to match the first alternative, C<dog>. If 239 C<dog> doesn't match, perl will then try the next alternative, C<cat>. 240 If C<cat> doesn't match either, then the match fails and perl moves to 241 the next position in the string. Some examples: 242 243 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 244 "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 245 246 Even though C<dog> is the first alternative in the second regex, 247 C<cat> is able to match earlier in the string. 248 249 "cats" =~ /c|ca|cat|cats/; # matches "c" 250 "cats" =~ /cats|cat|ca|c/; # matches "cats" 251 252 At a given character position, the first alternative that allows the 253 regex match to succeed will be the one that matches. Here, all the 254 alternatives match at the first string position, so the first matches. 255 256 =head2 Grouping things and hierarchical matching 257 258 The B<grouping> metacharacters C<()> allow a part of a regex to be 259 treated as a single unit. Parts of a regex are grouped by enclosing 260 them in parentheses. The regex C<house(cat|keeper)> means match 261 C<house> followed by either C<cat> or C<keeper>. Some more examples 262 are 263 264 /(a|b)b/; # matches 'ab' or 'bb' 265 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 266 267 /house(cat|)/; # matches either 'housecat' or 'house' 268 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 269 # 'house'. Note groups can be nested. 270 271 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 272 # because '20\d\d' can't match 273 274 =head2 Extracting matches 275 276 The grouping metacharacters C<()> also allow the extraction of the 277 parts of a string that matched. For each grouping, the part that 278 matched inside goes into the special variables C<$1>, C<$2>, etc. 279 They can be used just as ordinary variables: 280 281 # extract hours, minutes, seconds 282 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format 283 $hours = $1; 284 $minutes = $2; 285 $seconds = $3; 286 287 In list context, a match C</regex/> with groupings will return the 288 list of matched values C<($1,$2,...)>. So we could rewrite it as 289 290 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 291 292 If the groupings in a regex are nested, C<$1> gets the group with the 293 leftmost opening parenthesis, C<$2> the next opening parenthesis, 294 etc. For example, here is a complex regex and the matching variables 295 indicated below it: 296 297 /(ab(cd|ef)((gi)|j))/; 298 1 2 34 299 300 Associated with the matching variables C<$1>, C<$2>, ... are 301 the B<backreferences> C<\1>, C<\2>, ... Backreferences are 302 matching variables that can be used I<inside> a regex: 303 304 /(\w\w\w)\s\1/; # find sequences like 'the the' in string 305 306 C<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>, 307 C<\2>, ... only inside a regex. 308 309 =head2 Matching repetitions 310 311 The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us 312 to determine the number of repeats of a portion of a regex we 313 consider to be a match. Quantifiers are put immediately after the 314 character, character class, or grouping that we want to specify. They 315 have the following meanings: 316 317 =over 4 318 319 =item * 320 321 C<a?> = match 'a' 1 or 0 times 322 323 =item * 324 325 C<a*> = match 'a' 0 or more times, i.e., any number of times 326 327 =item * 328 329 C<a+> = match 'a' 1 or more times, i.e., at least once 330 331 =item * 332 333 C<a{n,m}> = match at least C<n> times, but not more than C<m> 334 times. 335 336 =item * 337 338 C<a{n,}> = match at least C<n> or more times 339 340 =item * 341 342 C<a{n}> = match exactly C<n> times 343 344 =back 345 346 Here are some examples: 347 348 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and 349 # any number of digits 350 /(\w+)\s+\1/; # match doubled words of arbitrary length 351 $year =~ /\d{2,4}/; # make sure year is at least 2 but not more 352 # than 4 digits 353 $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates 354 355 These quantifiers will try to match as much of the string as possible, 356 while still allowing the regex to match. So we have 357 358 $x = 'the cat in the hat'; 359 $x =~ /^(.*)(at)(.*)$/; # matches, 360 # $1 = 'the cat in the h' 361 # $2 = 'at' 362 # $3 = '' (0 matches) 363 364 The first quantifier C<.*> grabs as much of the string as possible 365 while still having the regex match. The second quantifier C<.*> has 366 no string left to it, so it matches 0 times. 367 368 =head2 More matching 369 370 There are a few more things you might want to know about matching 371 operators. In the code 372 373 $pattern = 'Seuss'; 374 while (<>) { 375 print if /$pattern/; 376 } 377 378 perl has to re-evaluate C<$pattern> each time through the loop. If 379 C<$pattern> won't be changing, use the C<//o> modifier, to only 380 perform variable substitutions once. If you don't want any 381 substitutions at all, use the special delimiter C<m''>: 382 383 @pattern = ('Seuss'); 384 m/@pattern/; # matches 'Seuss' 385 m'@pattern'; # matches the literal string '@pattern' 386 387 The global modifier C<//g> allows the matching operator to match 388 within a string as many times as possible. In scalar context, 389 successive matches against a string will have C<//g> jump from match 390 to match, keeping track of position in the string as it goes along. 391 You can get or set the position with the C<pos()> function. 392 For example, 393 394 $x = "cat dog house"; # 3 words 395 while ($x =~ /(\w+)/g) { 396 print "Word is $1, ends at position ", pos $x, "\n"; 397 } 398 399 prints 400 401 Word is cat, ends at position 3 402 Word is dog, ends at position 7 403 Word is house, ends at position 13 404 405 A failed match or changing the target string resets the position. If 406 you don't want the position reset after failure to match, add the 407 C<//c>, as in C</regex/gc>. 408 409 In list context, C<//g> returns a list of matched groupings, or if 410 there are no groupings, a list of matches to the whole regex. So 411 412 @words = ($x =~ /(\w+)/g); # matches, 413 # $word[0] = 'cat' 414 # $word[1] = 'dog' 415 # $word[2] = 'house' 416 417 =head2 Search and replace 418 419 Search and replace is performed using C<s/regex/replacement/modifiers>. 420 The C<replacement> is a Perl double quoted string that replaces in the 421 string whatever is matched with the C<regex>. The operator C<=~> is 422 also used here to associate a string with C<s///>. If matching 423 against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, 424 C<s///> returns the number of substitutions made, otherwise it returns 425 false. Here are a few examples: 426 427 $x = "Time to feed the cat!"; 428 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 429 $y = "'quoted words'"; 430 $y =~ s/^'(.*)'$/$1/; # strip single quotes, 431 # $y contains "quoted words" 432 433 With the C<s///> operator, the matched variables C<$1>, C<$2>, etc. 434 are immediately available for use in the replacement expression. With 435 the global modifier, C<s///g> will search and replace all occurrences 436 of the regex in the string: 437 438 $x = "I batted 4 for 4"; 439 $x =~ s/4/four/; # $x contains "I batted four for 4" 440 $x = "I batted 4 for 4"; 441 $x =~ s/4/four/g; # $x contains "I batted four for four" 442 443 The evaluation modifier C<s///e> wraps an C<eval{...}> around the 444 replacement string and the evaluated result is substituted for the 445 matched substring. Some examples: 446 447 # reverse all the words in a string 448 $x = "the cat in the hat"; 449 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" 450 451 # convert percentage to decimal 452 $x = "A 39% hit rate"; 453 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" 454 455 The last example shows that C<s///> can use other delimiters, such as 456 C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used 457 C<s'''>, then the regex and replacement are treated as single quoted 458 strings. 459 460 =head2 The split operator 461 462 C<split /regex/, string> splits C<string> into a list of substrings 463 and returns that list. The regex determines the character sequence 464 that C<string> is split with respect to. For example, to split a 465 string into words, use 466 467 $x = "Calvin and Hobbes"; 468 @word = split /\s+/, $x; # $word[0] = 'Calvin' 469 # $word[1] = 'and' 470 # $word[2] = 'Hobbes' 471 472 To extract a comma-delimited list of numbers, use 473 474 $x = "1.618,2.718, 3.142"; 475 @const = split /,\s*/, $x; # $const[0] = '1.618' 476 # $const[1] = '2.718' 477 # $const[2] = '3.142' 478 479 If the empty regex C<//> is used, the string is split into individual 480 characters. If the regex has groupings, then the list produced contains 481 the matched substrings from the groupings as well: 482 483 $x = "/usr/bin"; 484 @parts = split m!(/)!, $x; # $parts[0] = '' 485 # $parts[1] = '/' 486 # $parts[2] = 'usr' 487 # $parts[3] = '/' 488 # $parts[4] = 'bin' 489 490 Since the first character of $x matched the regex, C<split> prepended 491 an empty initial element to the list. 492 493 =head1 BUGS 494 495 None. 496 497 =head1 SEE ALSO 498 499 This is just a quick start guide. For a more in-depth tutorial on 500 regexes, see L<perlretut> and for the reference page, see L<perlre>. 501 502 =head1 AUTHOR AND COPYRIGHT 503 504 Copyright (c) 2000 Mark Kvale 505 All rights reserved. 506 507 This document may be distributed under the same terms as Perl itself. 508 509 =head2 Acknowledgments 510 511 The author would like to thank Mark-Jason Dominus, Tom Christiansen, 512 Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful 513 comments. 514 515 =cut 516
title
Description
Body
title
Description
Body
title
Description
Body
title
Body
Generated: Tue Mar 17 22:47:18 2015 | Cross-referenced by PHPXref 0.7.1 |