Ruby / Regular expressions

From WhyNotWiki
Jump to: navigation, search

Contents

Links

http://www.rubycentral.com/book/tut_stdtypes.html Programming Ruby: The Pragmatic Programmer's Guide

http://www.regular-expressions.info/ruby.html Ruby Regexp Class - Regular Expressions in Ruby

[] vs. match/=~

If you just want the (entire) matching text returned, you can do this (simpler but not as powerful):

irb -> "abcdef"[/bcd/]
    => "bcd"
irb -> "abcdef"[/bcda/]
    => nil

If you need more power, such as access to multiple match groups, then you may need to use match/=~:

irb -> "abcdef".match /bcd/
    => #<MatchData:0xb7f99144>
irb -> "abcdef".match /b(c)d(.+)/ ; "#{$1}#{$2}"
    => "cef"
irb -> "abcdef".=~ /b(c)d(.+)/ ; "#{$1}#{$2}"
    => "cef"

How to remove a substring

irb -> "aaabaa".sub(/b/, '')
    => "aaaaa"

How to remove a substring multiple times

str.gsub

Returns a copy of str with all occurrences of pattern replaced [...].

Previously the only way I could think to do it:

Only removes the first occurrence:

irb -> input = "aaababaaba"
    => "aaababaaba"
irb -> input.sub!(regexp = /b+/, '')
    => "aaaabaaba"

Doesn't work at all!:

irb -> input = "aaababaaba"; regexp = nil
    => nil
irb -> input.sub!(regexp = /b+/, '') until input !~ regexp ; input
    => "aaababaaba"

irb -> input.sub!(regexp, '') until input !~ (regexp = /b+/) ; input
    => "aaaaaaa"

This unfortunately doesn't work, due to [the order in which Ruby parses variables (category)].

irb -> input = "aaababaaba"
    => "aaababaaba"

irb -> input.sub!(regexp, '') until input !~ (regexp = /b+/) ; input
NameError: undefined local variable or method `regexp' for main:Object
        from (irb):2
        from :0

You have to initialize regexp (even to nil works) before you can read from it. Observe that even though it comes after the sub! command, the regexp = /b+/ initialization in the until expression happens bofer the sub! command.

irb -> input = "aaababaaba"; regexp = nil
    => nil
irb -> input.sub!(regexp, '') until input !~ (regexp = /b+/) ; input
    => "aaaaaaa"

MatchData

I think it is better practice to use .match and MatchData objects rather than to use =~ and refer to funky global variables like $` and $2...


if (matches = "abcde".match(/.c./))
  puts matches.to_s
  puts matches[0]
  puts matches.pre_match
  puts matches.post_match
end
#outputs:
bcd
bcd
a
e


Can treat it like an array...

   m = /(.)(.)(\d+)(\d)/.match("THX1138.")
   m[0]       #=> "HX1138"
   m[1, 2]    #=> ["H", "X"]
   m[1..3]    #=> ["H", "X", "113"]
   m[-3, 2]   #=> ["X", "113"]

captures vs. to_a:

> match_data = 'foo.html'.match(/(.+)\.(\w+)/)
=> #<MatchData:0x7f1e62f2ba48>

> puts match_data.to_a
foo.html
foo
html

> puts match_data[0..-1]
foo.html
foo
html

> puts match_data[0..-1] == match_data.to_a
true

> puts match_data[0..-1] == match_data.captures
false

> puts match_data.captures
foo
html

But I don't want to create a temporary local variable -- especially not one with a long name like match_data!

You could use a shorter name, like matches or m or md.

Or you could bypass that temporary variable altogether, if all you need is, say, the captures...

> basename, extension = 'foo.html'.match(/(.+)\.(\w+)/).captures
=> ["foo", "html"]
Returns the portion of the original string before the current match. Equivalent to the special variable $`.
   m = /(.)(.)(\d+)(\d)/.match("THX1138.")   
   m.pre_match   #=> "T"      

Returns the portion of the original string after the current match. Equivalent to the special variable $’.
   m = /(.)(.)(\d+)(\d)/.match("THX1138: The Movie")
   m.post_match   #=> ": The Movie"


Returns the array of matches.
   m = /(.)(.)(\d+)(\d)/.match("THX1138.")
   m.to_a   #=> ["HX1138", "H", "X", "113", "8"]

Returns the entire matched string.
   m = /(.)(.)(\d+)(\d)/.match("THX1138.")
   m.to_s   #=> "HX1138"

Match groups

This example [1] uses match group number 2:

  /(.)(.)(.)/.match("abc")[2]   #=> "b"

This example shows how to extract the numerical prefix from a string:

irb -> "013_whatever".match(/[0-9]+/)[0]
    => "013"

irb -> "013_whatever".match(/([0-9]+)_([\w]+)/)[1]
    => "013"
irb -> "013_whatever".match(/([0-9]+)_([\w]+)/)[1..2]
    => ["013", "whatever"]

Character classes

Abbreviation Short for Meaning
\d [0-9] Digit character
\D [^0-9] Nondigit
\s [ \t\r\n\f] Whitespace character
\S [^ \t\r\n\f] Nonwhitespace character
\w [A-Za-z0-9_] Word character
\W [^A-Za-z0-9_] Nonword character

Anchors

http://www.rubycentral.com/book/tut_stdtypes.html

The patterns ^ and $ match the beginning and end of a line, respectively. The patterns \b and \B match word boundaries and nonword boundaries, respectively. Word characters are letters, numbers, and underscore.

Multi-line regular expressions

irb -> "line1\nline2"[/.*/]
    => "line1"

irb -> "line1\nline2"[/.*/m]
    => "line1\nline2"
irb -> "
     " <div>
     "   <div>
     "     Contents
     "   </div>
     " </div>
     " "[%r{<div>(.*)</div>}m, 1]
    => "\n  <div>\n    Contents\n  </div>\n"

Greed

Example

irb -> 'prefix1-prefix2-main_filename.rb' =~ /^(.*)-(.*)/; [$1, $2]
    => ["prefix1-prefix2", "main_filename.rb"]

irb -> 'prefix1-prefix2-main_filename.rb' =~ /^(.*?)-(.*)/; [$1, $2]
    => ["prefix1", "prefix2-main_filename.rb"]

Example

irb -> "<div>Contents of 1st div</div><div>Contents of 2nd div</div>"[%r{<div>(.*)</div>}m, 1]
    => "Contents of 1st div</div><div>Contents of 2nd div"

irb -> "<div>Contents of 1st div</div><div>Contents of 2nd div</div>"[%r{<div>(.*?)</div>}m, 1]
    => "Contents of 1st div"

Example: Removing an option from a command-line string

[Command-line options (category)] [Command-line arguments (category)]

Let's say you want to remove option1=? from a list of command-line options, and you don't know what the value of that option ('?') will be.

One's first attempt might look like this (greedy version):

irb -> 'command option1=1 option2=2 option3=3'.gsub(/option1=(.*) /, '')
    => "command option3=3"

but notice how it also removed option2 from the list of options in addition to option1! That's not what we wanted!

Non-greed to the rescue!

irb -> 'command option1=1 option2=2 option3=3'.gsub(/option1=(.*?) /, '')
    => "command option2=2 option3=3"

Now it only matches the minimum necessary before the first space it encounters and then it stops matching. So it matches 'option1=1 '. Perfect. That's exactly what we want.

Side note: This method doesn't work very well if your options' values may contain spaces...


If you wanted to remove option1 including its value (which may include spaces), then that technique probably won't work for you...

irb -> "command option1='1 + 1' option2=2 option3=3".gsub(/option1=(.*?) /, '')
    => "command + 1' option2=2 option3=3"

In that case, you're better off using a full-featured command-line parser. I've used one; I just can't remember the name.

Or, if you're able to use ARGV, that would work too, as that only takes spaces into account and properly turns the command line into a list of arguments. If you can use ARGV, then you would use a different approach than described: you would use reject to remove those elements from ARGV that don't suit your fancy, rather than gsub to remove the respective substring.

Example:

  p ARGV
  puts ARGV.join(' ')

  new_args = ARGV.reject {|it| it =~ /option1=/}
  p new_args
  puts new_args.join(' ')
> command option1='1 + 1' option2=2 option3=3

["command", "option1=1 + 1", "option2=2", "option3=3"]
command option1=1 + 1 option2=2 option3=3

["command", "option2=2", "option3=3"]
command option2=2 option3=3
 

scan

Example use: Say you copied and pasted a list of methods from an RDoc page and you want to convert that list--which is oddly spaced--from this:

the_oddly_spaced_list = "assert   assert_block   assert_equal   assert_in_delta   assert_instance_of   assert_kind_of   assert_match   assert_nil   assert_no_match   assert_not_equal   assert_not_nil   assert_not_same   assert_nothing_raised   assert_nothing_thrown   assert_operator   assert_raise   assert_raises   assert_respond_to   assert_same   assert_send   assert_throws"

to having one word per line.

the_oddly_spaced_list.scan(/\w+/) {|word| puts word}

Can interpolate strings into regular expressions

irb -> neat_regexp = Regexp.escape('*neat*')
    => "\\*neat\\*"      # a string

irb -> "What a *neat* idea!" =~ /a #{neat_regexp} idea/
    => 5

Converting between strings and regular expressions: Comparison

input type output type escaped?
Regexp.escape(s) String String yes
String#to_re(s) (Facets) String Regexp yes
Regexp.union(s) String Regexp yes
String#to_rx(s) (Facets) String Regexp no
/#{s}/ String Regexp no
r.to_s Regexp String N/A May not be spelled the same if you convert back, but should be equivalent.

Converting strings to regular expressions: Regexp.escape, Regexp.union, and String#to_re

Might be useful if you have some user-supplied input as a string and you need it be treated as a literal in your regexp -- you want to inoculate the string and remove any special powers that it might otherwise have if it were inserted straight into a regular expression.

irb -> neat_regexp = Regexp.escape('*neat*')
    => "\\*neat\\*"
irb -> "What a *neat* idea!" =~ /a #{neat_regexp} idea/
    => 5
irb -> "What a *neat* idea!" =~ /a #{Regexp.escape("*neat*")} idea/
    => 5
irb -> "What a *neat* idea!" =~ /a \*neat\* idea/
    => 5
irb -> "What a *neat* idea!" =~ /a *neat* idea/
    => nil

It's even useful if you have a string literal in your code (as opposed to from user input) that you want to treat as a regular expression without having to worry about the escaping rules!

This is a bit easier to type:

irb -> require 'facets/core/string/to_re'
irb -> 'Are you *sure*? *Really* sure?' =~ 'Are you *sure*?'.to_re
    => 0

than this:

irb -> 'Are you *sure*? *Really* sure?' =~ /Are you \*sure\*\?/
    => 0

, for example.

The difference between Regexp.escape and String#to_re is that Regexp.escape returns a string (which you'd then have to interpolate into a regular expression -- String#to_re skips that step and converts straight into a regular expression: conciser but not as flexible.

Notice:

irb -> 'Are you *sure*?'.to_re
    => /Are\ you\ \*sure\*\?/
irb -> Regexp.escape('Are you *sure*?')
    => "Are\\ you\\ \\*sure\\*\\?"

irb -> 'Are you *sure*? *Really* sure?' =~ Regexp.escape('Are you *sure*?')
TypeError: type mismatch: String given
        from (irb):5:in `=~'
irb -> 'Are you *sure*? *Really* sure?' =~ /#{Regexp.escape('Are you *sure*?')}.*\?$/
    => 0
irb -> 'Are you *sure*? *Really* sure?' =~ /#{'Are you *sure*?'.to_re.to_s}.*\?$/
    => 0

It looks like Regexp.union() actually does the same thing as String#to_re:

irb -> /#{Regexp.escape('*.*')}/
    => /\*\.\*/

irb -> Regexp.union('*.*')
    => /\*\.\*/

May not be spelled the same if you convert back, but should be equivalent

In general, converting a Regexp to a String, causes it to not be optimized for prettiness. Rather, it has to store all of the flags, even the default ones, to make sure no information is lost during the conversion.

irb -> %r{#{ /(?-mix:.*)/.to_s }}
    => /(?-mix:.*)/

irb -> %r{#{ /(?i-mx:.*)/.to_s }}
    => /(?i-mx:.*)/

but:

irb -> %r{ #{ /.*/.to_s } }
    => / (?-mix:.*) /


Bug in Regexp#to_s ?

Unfortunately, Regexp#to_s doesn't appear to work properly...

irb -> 'Are you *sure*? *Really* sure?' =~ /#{'Are you *sure*?'.to_re.to_s}/
    => 0

but...

irb -> 'Are you *sure*? *Really* sure?' =~ 'Are you *sure*?'.to_re.to_s.to_re
    => nil

^ and $ can be used in subexpressions

Example

Say we had this input:

input = ["processor", "processing", "process", "process_with_fluff", "process_without_fluff"]

and want this as output:

["process", "process_with_fluff", "process_without_fluff"]

How would we do it?

These don't work:

irb -> input.grep /^process/
    => ["processor", "processing", "process", "process_with_fluff", "process_without_fluff"]

irb -> input.grep /^process_/
    => ["process_with_fluff", "process_without_fluff"]

irb -> input.grep /^process$/
    => ["process"]

irb -> input.grep /^(process_|process)$/
    => ["process"]

Ah, but this does!:

irb -> input.grep /^(process_|process$)/
    => ["process", "process_with_fluff", "process_without_fluff"]

# Notice how the order of precedence is such that this is equivalent (don't need the parentheses)...
irb -> input.grep /^process_|process$/         
    => ["process", "process_with_fluff", "process_without_fluff"]

# But this is the best / most concise solution of them all...
irb -> input.grep /^process(_|$)/
    => ["process", "process_with_fluff", "process_without_fluff"]

We want all (method) names that either start with "process_" (a prefix) or are exactly "prefix".

([Application (category)]): How to match an exact filename, which may be part of a larger path. (example of "^ and $ can be used in subexpressions")

# This means the filename has to come directly after a '/' character (\/) or has to be the beginning of the path (^).
irb -> exact_filename_re = /(\/|^)filename.rb$/
    => /(\/|^)filename.rb$/

irb -> '/a/really/long/path/filename.rb' =~ exact_filename_re
    => 19

irb -> 'filename.rb' =~ exact_filename_re
    => 0

# But if it's part of a longer filename, it won't match, which is what we want.
irb -> 'a_longer_filename.rb' =~ exact_filename_re
    => nil

(This is useful in conjunction with an include/exclude pattern for FileList, since I've had problems with other matching methods, such as globbing and "plain strings".)

/^...$/ vs. /\A...\z

Controller User Input Validation » Ruby on Rails Security Blog (http://www.rorsecurity.info/2007/05/29/controller-user-input-validation/) (2007-05-29). Retrieved on 2007-05-11 11:18.


# A file name may be alphanumerical and may contain .-+_
file = parseparam( params[:file], "", "str", nil, /^[\w\.\-\+]+$/)

The last example seems to validate for a valid file name, however it is prone to user agent injection, a file name with embedded JavaScript, such as file.txt\%0A<script>alert('hello')</script>, passes the filter. This is due to the widespread belief that ^ matches the beginning of a string and $ the end, as in other programming languages. In Ruby, however, these characters match the beginning and end of a line, so the above string passes the filter, as it contains a line break (%0A). The correct sequences for Ruby are \A and \z, so the expression from above should read /\A[\w\.\-\+]+\z/.

irb -> "line1
     " line2" =~
       /^line1$/
    => 0              # Matches

irb -> "line1
     " line2" =~
       /\Aline1\z/
    => nil            # Doesn't match




Regular expressions  edit   (Category  edit)


Ruby  edit   (Category  edit) .

Ads
Personal tools