Ruby / Regular expressions
From WhyNotWiki
[edit] Links
http://www.rubycentral.com/book/tut_stdtypes.html Programming Ruby: The Pragmatic Programmer's Guide
http://www.regular-expressions.info/ruby.html Ruby Regexp Class - Regular Expressions in Ruby
[edit] [] vs. match/=~
If you just want the (entire) matching text returned, you can do this (simpler but not as powerful):
irb -> "abcdef"[/bcd/]
=> "bcd"
irb -> "abcdef"[/bcda/]
=> nil
If you need more power, such as access to multiple match groups, then you may need to use match/=~:
irb -> "abcdef".match /bcd/
=> #<MatchData:0xb7f99144>
irb -> "abcdef".match /b(c)d(.+)/ ; "#{$1}#{$2}"
=> "cef"
irb -> "abcdef".=~ /b(c)d(.+)/ ; "#{$1}#{$2}"
=> "cef"
[edit] How to remove a substring
irb -> "aaabaa".sub(/b/, '')
=> "aaaaa"
[edit] How to remove a substring multiple times
str.gsub
- Returns a copy of
strwith all occurrences of pattern replaced [...].
Previously the only way I could think to do it:
Only removes the first occurrence:
irb -> input = "aaababaaba" => "aaababaaba" irb -> input.sub!(regexp = /b+/, '') => "aaaabaaba"Doesn't work at all!:
irb -> input = "aaababaaba"; regexp = nil => nil irb -> input.sub!(regexp = /b+/, '') until input !~ regexp ; input => "aaababaaba" irb -> input.sub!(regexp, '') until input !~ (regexp = /b+/) ; input => "aaaaaaa"This unfortunately doesn't work, due to [the order in which Ruby parses variables (category)].
irb -> input = "aaababaaba" => "aaababaaba" irb -> input.sub!(regexp, '') until input !~ (regexp = /b+/) ; input NameError: undefined local variable or method `regexp' for main:Object from (irb):2 from :0You have to initialize regexp (even to nil works) before you can read from it. Observe that even though it comes after the
sub!command, theregexp = /b+/initialization in theuntilexpression happens bofer thesub!command.irb -> input = "aaababaaba"; regexp = nil => nil irb -> input.sub!(regexp, '') until input !~ (regexp = /b+/) ; input => "aaaaaaa"
[edit] MatchData
Better to use MatchData objects rather than funky globals like $` ...
if (matches = "abcde".match(/.c./))
puts matches.to_s
puts matches[0]
puts matches.pre_match
puts matches.post_match
end
#outputs:
bcd
bcd
a
e
Can treat it like an array...
m = /(.)(.)(\d+)(\d)/.match("THX1138.")
m[0] #=> "HX1138"
m[1, 2] #=> ["H", "X"]
m[1..3] #=> ["H", "X", "113"]
m[-3, 2] #=> ["X", "113"]
Returns the portion of the original string before the current match. Equivalent to the special variable $`.
m = /(.)(.)(\d+)(\d)/.match("THX1138.")
m.pre_match #=> "T"
Returns the portion of the original string after the current match. Equivalent to the special variable $’.
m = /(.)(.)(\d+)(\d)/.match("THX1138: The Movie")
m.post_match #=> ": The Movie"
Returns the array of matches.
m = /(.)(.)(\d+)(\d)/.match("THX1138.")
m.to_a #=> ["HX1138", "H", "X", "113", "8"]
Returns the entire matched string.
m = /(.)(.)(\d+)(\d)/.match("THX1138.")
m.to_s #=> "HX1138"
[edit] Match groups
This example [1] uses match group number 2:
/(.)(.)(.)/.match("abc")[2] #=> "b"
This example shows how to extract the numerical prefix from a string:
irb -> "013_whatever".match(/[0-9]+/)[0]
=> "013"
irb -> "013_whatever".match(/([0-9]+)_([\w]+)/)[1]
=> "013"
irb -> "013_whatever".match(/([0-9]+)_([\w]+)/)[1..2]
=> ["013", "whatever"]
[edit] Character classes
| Abbreviation | Short for | Meaning |
|---|---|---|
| \d | [0-9] |
Digit character |
| \D | [^0-9] |
Nondigit |
| \s | [ \t\r\n\f] |
Whitespace character |
| \S | [^ \t\r\n\f] |
Nonwhitespace character |
| \w | [A-Za-z0-9_] |
Word character |
| \W | [^A-Za-z0-9_] |
Nonword character |
[edit] Anchors
http://www.rubycentral.com/book/tut_stdtypes.html
The patterns ^ and $ match the beginning and end of a line, respectively.
The patterns \b and \B match word boundaries and nonword boundaries, respectively. Word characters are letters, numbers, and underscore.
[edit] Multi-line regular expressions
irb -> "line1\nline2"[/.*/]
=> "line1"
irb -> "line1\nline2"[/.*/m]
=> "line1\nline2"
irb -> "
" <div>
" <div>
" Contents
" </div>
" </div>
" "[%r{<div>(.*)</div>}m, 1]
=> "\n <div>\n Contents\n </div>\n"
[edit] Greed
[edit] Example
irb -> 'prefix1-prefix2-main_filename.rb' =~ /^(.*)-(.*)/; [$1, $2]
=> ["prefix1-prefix2", "main_filename.rb"]
irb -> 'prefix1-prefix2-main_filename.rb' =~ /^(.*?)-(.*)/; [$1, $2]
=> ["prefix1", "prefix2-main_filename.rb"]
[edit] Example
irb -> "<div>Contents of 1st div</div><div>Contents of 2nd div</div>"[%r{<div>(.*)</div>}m, 1]
=> "Contents of 1st div</div><div>Contents of 2nd div"
irb -> "<div>Contents of 1st div</div><div>Contents of 2nd div</div>"[%r{<div>(.*?)</div>}m, 1]
=> "Contents of 1st div"
[edit] Example: Removing an option from a command-line string
[Command-line options (category)] [Command-line arguments (category)]
Let's say you want to remove option1=? from a list of command-line options, and you don't know what the value of that option ('?') will be.
One's first attempt might look like this (greedy version):
irb -> 'command option1=1 option2=2 option3=3'.gsub(/option1=(.*) /, '')
=> "command option3=3"
but notice how it also removed option2 from the list of options in addition to option1! That's not what we wanted!
Non-greed to the rescue!
irb -> 'command option1=1 option2=2 option3=3'.gsub(/option1=(.*?) /, '')
=> "command option2=2 option3=3"
Now it only matches the minimum necessary before the first space it encounters and then it stops matching. So it matches 'option1=1 '. Perfect. That's exactly what we want.
Side note: This method doesn't work very well if your options' values may contain spaces...
If you wanted to remove option1 including its value (which may include spaces), then that technique probably won't work for you...
irb -> "command option1='1 + 1' option2=2 option3=3".gsub(/option1=(.*?) /, '')
=> "command + 1' option2=2 option3=3"
In that case, you're better off using a full-featured command-line parser. I've used one; I just can't remember the name.
Or, if you're able to use ARGV, that would work too, as that only takes spaces into account and properly turns the command line into a list of arguments. If you can use ARGV, then you would use a different approach than described: you would use reject to remove those elements from ARGV that don't suit your fancy, rather than gsub to remove the respective substring.
Example:
p ARGV
puts ARGV.join(' ')
new_args = ARGV.reject {|it| it =~ /option1=/}
p new_args
puts new_args.join(' ')
> command option1='1 + 1' option2=2 option3=3 ["command", "option1=1 + 1", "option2=2", "option3=3"] command option1=1 + 1 option2=2 option3=3 ["command", "option2=2", "option3=3"] command option2=2 option3=3
[edit] scan
Example use: Say you copied and pasted a list of methods from an RDoc page and you want to convert that list--which is oddly spaced--from this:
the_oddly_spaced_list = "assert assert_block assert_equal assert_in_delta assert_instance_of assert_kind_of assert_match assert_nil assert_no_match assert_not_equal assert_not_nil assert_not_same assert_nothing_raised assert_nothing_thrown assert_operator assert_raise assert_raises assert_respond_to assert_same assert_send assert_throws"
to having one word per line.
the_oddly_spaced_list.scan(/\w+/) {|word| puts word}
[edit] Can interpolate strings into regular expressions
irb -> neat_regexp = Regexp.escape('*neat*')
=> "\\*neat\\*" # a string
irb -> "What a *neat* idea!" =~ /a #{neat_regexp} idea/
=> 5
[edit] Converting between strings and regular expressions: Comparison
| input type | output type | escaped? | ||
|---|---|---|---|---|
Regexp.escape(s) |
String | String | yes | |
String#to_re(s) (Facets) |
String | Regexp | yes | |
Regexp.union(s) |
String | Regexp | yes | |
String#to_rx(s) (Facets) |
String | Regexp | no | |
/#{s}/ |
String | Regexp | no | |
r.to_s |
Regexp | String | N/A | May not be spelled the same if you convert back, but should be equivalent. |
[edit] Converting strings to regular expressions: Regexp.escape, Regexp.union, and String#to_re
Might be useful if you have some user-supplied input as a string and you need it be treated as a literal in your regexp -- you want to inoculate the string and remove any special powers that it might otherwise have if it were inserted straight into a regular expression.
irb -> neat_regexp = Regexp.escape('*neat*')
=> "\\*neat\\*"
irb -> "What a *neat* idea!" =~ /a #{neat_regexp} idea/
=> 5
irb -> "What a *neat* idea!" =~ /a #{Regexp.escape("*neat*")} idea/
=> 5
irb -> "What a *neat* idea!" =~ /a \*neat\* idea/
=> 5
irb -> "What a *neat* idea!" =~ /a *neat* idea/
=> nil
It's even useful if you have a string literal in your code (as opposed to from user input) that you want to treat as a regular expression without having to worry about the escaping rules!
This is a bit easier to type:
irb -> require 'facets/core/string/to_re'
irb -> 'Are you *sure*? *Really* sure?' =~ 'Are you *sure*?'.to_re
=> 0
than this:
irb -> 'Are you *sure*? *Really* sure?' =~ /Are you \*sure\*\?/
=> 0
, for example.
The difference between Regexp.escape and String#to_re is that Regexp.escape returns a string (which you'd then have to interpolate into a regular expression -- String#to_re skips that step and converts straight into a regular expression: conciser but not as flexible.
Notice:
irb -> 'Are you *sure*?'.to_re
=> /Are\ you\ \*sure\*\?/
irb -> Regexp.escape('Are you *sure*?')
=> "Are\\ you\\ \\*sure\\*\\?"
irb -> 'Are you *sure*? *Really* sure?' =~ Regexp.escape('Are you *sure*?')
TypeError: type mismatch: String given
from (irb):5:in `=~'
irb -> 'Are you *sure*? *Really* sure?' =~ /#{Regexp.escape('Are you *sure*?')}.*\?$/
=> 0
irb -> 'Are you *sure*? *Really* sure?' =~ /#{'Are you *sure*?'.to_re.to_s}.*\?$/
=> 0
It looks like Regexp.union() actually does the same thing as String#to_re:
irb -> /#{Regexp.escape('*.*')}/
=> /\*\.\*/
irb -> Regexp.union('*.*')
=> /\*\.\*/
[edit] May not be spelled the same if you convert back, but should be equivalent
In general, converting a Regexp to a String, causes it to not be optimized for prettiness. Rather, it has to store all of the flags, even the default ones, to make sure no information is lost during the conversion.
irb -> %r{#{ /(?-mix:.*)/.to_s }}
=> /(?-mix:.*)/
irb -> %r{#{ /(?i-mx:.*)/.to_s }}
=> /(?i-mx:.*)/
but:
irb -> %r{ #{ /.*/.to_s } }
=> / (?-mix:.*) /
[edit] Bug in Regexp#to_s ?
Unfortunately, Regexp#to_s doesn't appear to work properly...
irb -> 'Are you *sure*? *Really* sure?' =~ /#{'Are you *sure*?'.to_re.to_s}/
=> 0
but...
irb -> 'Are you *sure*? *Really* sure?' =~ 'Are you *sure*?'.to_re.to_s.to_re
=> nil
[edit] ^ and $ can be used in subexpressions
[edit] Example
Say we had this input:
input = ["processor", "processing", "process", "process_with_fluff", "process_without_fluff"]
and want this as output:
["process", "process_with_fluff", "process_without_fluff"]
How would we do it?
These don't work:
irb -> input.grep /^process/
=> ["processor", "processing", "process", "process_with_fluff", "process_without_fluff"]
irb -> input.grep /^process_/
=> ["process_with_fluff", "process_without_fluff"]
irb -> input.grep /^process$/
=> ["process"]
irb -> input.grep /^(process_|process)$/
=> ["process"]
Ah, but this does!:
irb -> input.grep /^(process_|process$)/
=> ["process", "process_with_fluff", "process_without_fluff"]
# Notice how the order of precedence is such that this is equivalent (don't need the parentheses)...
irb -> input.grep /^process_|process$/
=> ["process", "process_with_fluff", "process_without_fluff"]
# But this is the best / most concise solution of them all...
irb -> input.grep /^process(_|$)/
=> ["process", "process_with_fluff", "process_without_fluff"]
We want all (method) names that either start with "process_" (a prefix) or are exactly "prefix".
[edit] ([Application (category)]): How to match an exact filename, which may be part of a larger path. (example of "^ and $ can be used in subexpressions")
# This means the filename has to come directly after a '/' character (\/) or has to be the beginning of the path (^).
irb -> exact_filename_re = /(\/|^)filename.rb$/
=> /(\/|^)filename.rb$/
irb -> '/a/really/long/path/filename.rb' =~ exact_filename_re
=> 19
irb -> 'filename.rb' =~ exact_filename_re
=> 0
# But if it's part of a longer filename, it won't match, which is what we want.
irb -> 'a_longer_filename.rb' =~ exact_filename_re
=> nil
(This is useful in conjunction with an include/exclude pattern for FileList, since I've had problems with other matching methods, such as globbing and "plain strings".)
[edit] /^...$/ vs. /\A...\z
Controller User Input Validation » Ruby on Rails Security Blog (http://www.rorsecurity.info/2007/05/29/controller-user-input-validation/) (2007-05-29).
# A file name may be alphanumerical and may contain .-+_ file = parseparam( params[:file], "", "str", nil, /^[\w\.\-\+]+$/)The last example seems to validate for a valid file name, however it is prone to user agent injection, a file name with embedded JavaScript, such as
file.txt\%0A<script>alert('hello')</script>, passes the filter. This is due to the widespread belief that^matches the beginning of a string and$the end, as in other programming languages. In Ruby, however, these characters match the beginning and end of a line, so the above string passes the filter, as it contains a line break (%0A). The correct sequences for Ruby are\Aand\z, so the expression from above should read/\A[\w\.\-\+]+\z/.
irb -> "line1
" line2" =~
/^line1$/
=> 0 # Matches
irb -> "line1
" line2" =~
/\Aline1\z/
=> nil # Doesn't match
Regular expressions edit (Category edit)
Categories: The order in which Ruby parses variables | Command-line options | Command-line arguments | Regular expressions / Applications of | Pages containing web citations | Pages with an associated category | Programming language features | Computer science concepts | Regular expressions | Ruby | Intersection articles
