Regex in R with stringr
Simon Goring
May 9, 2017
Background on stringr
stringr
is a wrapper, and implementation that lies over top of the stringi
R package. In addition to these packages you can also use the base regular expression functions (regexpr()
, gsub()
and others). It’s critical to point out that there is a difference in the implementation of some of these expression engines, in particular the native regexpr
implementation and the stringi
/stringr
implementations. The stringi
implementation is much closer to the native regular expression implementation as seen in other programming languages.
Some critical resources
https://regex101.com/ - Online testing of regular expressions with some great tools and breakdowns of expressions.
https://regexr.com/ - Another utility for matching expressions.
https://extendsclass.com - Online visual regex tester.
Base R Regular Expressions, a fairly good resource.
Regular Expressions Info - A pretty big site dedicated to regular expressions, with lots of examples.
Handling and Processing Strings in R by Gaston Sanchez, a UC-Berkeley faculty member, is a big PDF with lots of great content.
Regular expressions
String matching is critical for making the internet work. Password validation, checking URLs, email & credit card checks. . . in fact, regular expressions are associated with the earliest development of modern computers, with a standardized language developing in the mid 1950s. The term grep
comes from Global Regular Expression search and Print, which is exactly what R’s grep()
function does, and what grep
does on almost any Linux based system.
For this tutorial I also wrote a small Shiny app to test out some of the functions. You can access the app from my Shiny Apps repo, and, as we continue working, you can clone and make pull requests from the RegularExpressionR GitHub repository. In particular, the “short story” I wrote is intended to throw some curveballs, so if you feel that you can think of some tricks, please go ahead and add them in.
Important Matches
Along with matching specific strings (like Goring
, the default in the Shiny app), it’s possible to match classes of characters, for example all upper case characters, all lower case characters, or all numbers.
Expression | Explanation |
---|---|
. |
Matches anything (you can use \. to match a period). |
\s |
Matches a space. |
\d |
Matches a digit. |
\w |
Match characters and numbers. |
[:alpha:] |
Matches alphabetical characters. |
[:upper:] |
Matches all capital letters. |
[:lower:] |
Matches all lower case characters. |
Matching Multiples
Expression | Explanation |
---|---|
* |
Any number of repeats. |
{n} |
n repeats where n is a number. |
{n,} |
n or more repeats. |
{n,m} |
n or more repeats, but m or less. |
Matching Locations
Expression | Explanation |
---|---|
^ |
At the start. |
$ |
At the end. |
? |
The lazy question mark! Optional. . . |
Using stringr
So, given these options, lets see what we can do. stringr
has a number of key functions that we’ll explore here. There is a much more extensive tutorial
str_detect()
str_extract()
str_locate()
str_match()
str_replace()
There are a few exercises that I think we should try out. One thing that will be helpful as you’re testing is using the htmltools::browsable()
function at the end of your call. It will post your text output to the browser window. You can further fix things up by adding tags or <br>
elements for carriage returns.
Detecting:
- I made a mistake adding in personal details. When I showed this to my mom she got really mad at me and asked me to remove all mention of her. Let’s find all the lines that mention mom and then cut them. You can read the text file using the
readr
package’sread_lines()
function and pulling from the same data file as the Shiny app:
file <- 'https://raw.githubusercontent.com/SimonGoring/RegularExpressionR/master/data/raw_file.txt'
text_file <- readr::read_lines(file) %>% ?????
Extraction
- Extract all the dollar amounts quoted in the story and plot the values. Did you get everything?
file <- 'https://raw.githubusercontent.com/SimonGoring/RegularExpressionR/master/data/raw_file.txt'
text_file <- readr::read_lines(file) %>% ?????
- Extract all the years that the papers were published. Be careful to craft the statement correctly so you don’t get all four character values!
file <- 'https://raw.githubusercontent.com/SimonGoring/RegularExpressionR/master/data/raw_file.txt'
text_file <- readr::read_lines(file) %>% ?????
Replacing
In the Shiny app I do string matching, and then replace the text with the matched string, but also add an HTML tag (
<span>
) that includes the highlighting tag (background
). Along with my mom’s personal information, I accidentally added a bunch of phone numbers. Go in and replace all of them withXXX-XXX-XXXX
so that they’re masked. Can you get them all?Bonus Question: Can you pull everything out of quotes?
Conclusion
This is just an introduction to some of the tools you can use with regular expressions. The book “Mastering Regular Expressions” is a 500+ page tome, and there are bigger volumes. I also want to point you to a really cool post called “The Greatest Regex Trick Ever”. It’s pretty great at walking you through something that is, on the face of it, pretty simple, but, ultimately is actually quite elegant and complicated.