Regex to the Rescue - Reinstalling R packages
Sometimes, when you update to a new version of R, download a project from GitHub or copy a new project from a colleague it’s possible that there are functions or packages called that aren’t currently installed. This can be a real pain in the neck the first time you try to run the scripts. If the project has been set up well it’s often a matter of looking into a single file (perhaps a setup.R
file), but in other cases, finding all the packages requires itteratively running the various scripts and installing things on a package by package basis.
I’ve encountered this often enough that I decided to to write a bash
script for linux (and MacOS) to (1) search through all R
and Rmd
files in a directory, (2) return the full list of packages, and then (3) install the packages one by one.
Finding R package calls
Let’s define the scope of the first problem. If we’re going to write something that will detect R packages in a file we need to know: How do people call packages in R? This is a pretty good question in general, and probably not that well defined. We can make an educated guess and choose several fairly common patterns:
library(ggplot2)
library(ggplot2, verbose = TRUE)
library(ggplot2, stringr)
library("ggplot2")
library("ggplot2", "yarrr")
require(stringr)
#' @import fields
#' @importFrom neotoma compile_taxa
neotoma::get_dataset()
dataset %>% dplyr::filter(value > 10) %>% DT::datatable()
These methods of loading a package make up a valid (and common) subset of commands for loading libraries within R. So, if we want to capture a full set of packages we need to be able to figure out how to express the commands above using my favorite tool: regular expressions.
Caveat: There are a number of options for the library()
command, and people might call library()
any number of ways. While the list above is not an exhaustive list, this reflects the way that I commonly call packages.
Defining our tools
Figure 1. A model program flow for the intended script. A project is downloaded or copied and placed into a folder. The bash script is executed, it checks the directory for package declarations, compares those against packages in the local library folders, and then dowloads missing packages. Cloud Icon Created by Yo! Baba from the Noun Project.
This workflow is designed to work as a bash
script within a linux terminal. I am using Ubuntu 18.04, but it should work on a MacOS terminal as well. I wrote it to be used as part of a workflow where I could do something like this:
git clone somegitrepowithRcode
bash installLibs.sh -i
And then run RStudio or an editor of my choice (I’ve been using Atom more lately) without having to worry about hitting messages about missing packages.
If I wanted to go further with the commandline I could edit my .bashrc
file to add an alias
. For now we’ll work on building the regular expressions and then putting them into a bash file.
Using sed
I use the program sed
to perform my regular expression matching. I use sed
rather than grep
because sed
is specifically designed to edit streams of text (sed
comes from the contraction of String EDitor). The script will be processing lines of code and returning text to an array, directly interacting with a stream of text. sed
also gives us some more tools to work with. For this project I will be using one particular flag with sed
:
sed -n pattern source
The -n
flag tells sed
not to print intermediate results to the screen. Without the -n
flag sed
will print all of the source
to the screen and then also print out any matches. When we build the bash script we will want to do something a bit special: we don’t want to return the whole match, we want to generate a regular expression query that results in a substitution, so that our match to library(ggplot2)
returns ggplot2
only. That way we will get a list of packages, and not the full declarations.
The Pattern
The general style for sed
matching is options/match/substitution/options
. To undertake substitution in sed
you need to start with the option s/
. Our assumption is that each call to a package will occur only once per line of code, but for the neotoma::get_dataset()
it should be clear that people can call nested functions or string multiple functions on a single line using %>%
pipes. Because of this we need to implement a global search for that pattern. To do this we use a terminal “global” option, or /g
, so we write: s/match/substitution/g
in most of our sed
commands below.
Capturing a user’s R packages
Note: I am using regex101 for many of my code examples. It’s a very useful too, and all complete regular expressions are linked to a page showing how they work in the context of the examples I provided above. I have another post on regular expressions in R that may also be of interest.
Capturing library
calls
The first few cases above, where library()
is used, should be relatively straightforward. Regex doesn’t just match complete strings, it allows you to use capture groups, specified elements within the full regex match. So for example the regex ^library\((.+)\)
will capture (1) any occurrence of library()
(2) at the beginning of a line (indicated by ^
), (3) with literal brackets (escaped using \(
or \)
) that (4) enclose some text (.
) that is (5) of length one or more (+
). By putting the string .+
in brackets we tell the regular expression engine that this match is a special part of the regular expression, a capture group. Since these brackets are to tell the regex engine something special they are not escaped.
In most regex engines, the capture groups can be returned using the notation either $1
or \1
. So we could match library(
ggplot2
)
with ^library\((.+)\)
and return ggplot2
with \1
. You can try this out in the terminal using:
echo 'library(ggplot2)' | sed 's/^library[(]\(.*\)[)]/\1/p'
Process the sed
output
The capture string still captures a variety of library calls, whether quoted (library("ggplot2")
), a lists of packages (library(ggplot2, cars)
), or quoted lists. To manage these various outputs we need to use bash pipes and a function called tr
, to clean up any extraneous characters and turn the packages into an array that can be used in bash. Try this:
echo 'library("ggplot2", neotoma, "dplyr", verbose = FALSE)' | \
sed -n 's/^library[(]\(.*\)[)]/\1/p' | \
tr "," "\n" | \
tr -d "[\"\\']" | \
sed "s/verbose\s*=\s*\(\(TRUE\)\|\(FALSE\)\)/ /g"
The sed
matching (with the -n
option) is piped (|
) out. With the first tr
we translate all occurrences of ,
to a carriage return (\n
). The second tr
deletes (-d
) occurrences of single or double quotes. We have to escape the quotes otherwise bash would think it was the end of the quoted text, and we place quotes in square brackets to say that either type of quote is acceptable. The last sed
command is used to remove the option verbose = FALSE
or verbose = TRUE
which may or may not be present in the library command. You can see this line within the context of the final bash script in my GitHub Gist.
The code-block above gives a list of packages, separated by a hard return (\n
). In a bash
script we assign the list to a variable; we can see that things are working by writing the bash file and then executing it from the command line:
#!/bin/bash
library=$(cat R/*.R | \
sed -n 's/^library[(]\(.*\)[)]/\1/p' | \
tr "," "\n" | \
tr -d "[\"\\']" | \
sed "s/verbose\s*=\s*\(\(TRUE\)\|\(FALSE\)\)/ /g")
echo $library
We can add a second line, replacing library
with require
, giving us the first set of match requirements.
Matching package imports in roxygen2
In roxygen2
, and when people are building packages, it’s possible to call packages using the statemetn @import
or @importFrom
. Here we would either declare a single package, or a package and then subsequent functions from that package. For valid roxygen2
markup this needs to be proceeded with #'
, so we can look for something like this:
^\s*\#+\'\s+\@import\(\?:From\)\?\s\([[:alnum:]]+\)
We begin the line (^
), possibly with space (\s*
allows zero or more), followed by the special #'
character (escaped, and allowing for one or more comment characters: \#+\'
) with at least one space (\s+
) and then @import
which could be followed by From
(with escaped parentheses followed by a question mark to indicate an optional match: \@import\(From\)\?
). The capture group here is defined only as [[:alnum:]]
a regex class of all alphanumeric values. This is different than the earlier request where we captured (.*)
because in the library call we expected to potentially obtain comma separated lists, and we needed to account for the possibility of quoted package names. This then becomes the third match in the bash script.
Matching from pipes (%<%
)
The regular expressions to capture calls within pipes also catches any call where a function is called with its package explicitly, using package::function()
. The call requires the use of perl
rather than the initial sed
since perl
allows the use of optional matches, where sed
does not.
perl -pe 's/(.*?)([[:alnum:]]+)(::)(.*?)|./ \2/g' | \
sed '/^\s*$/d')
The regular expression ((.*?)([[:alnum:]]+)(::)([[:alnum:]]+?)|.
) matches any set of alphanumeric text that is followed by ::
, indicating that it is the package calling the function. The function name, indicated by the second ([[:alnum:]]+?)
indicates a lazy match, which tries to match as few elements as possible. We follow this with the .
, so that it gives the lazy match something to stop on (a space, a pipe, whatever).
We have to use perl
in this case since sed
does not recognize non-greedy matches, but the options here are the same. The perl -e
flag executes the command in the quotes, and, as before, the -p
flag prints the output. This winds up matching a lot of empty space, which is unfortunate, but the sed
match then removes any line that contains only spaces to the end of the line: \s*$
. This completes the set of regex calls we need for the bash
script.
Cleaning an array of packages
Each of these regular expression/perl/sed sequences will return a set of package names. In the bash
file these are aggregated into a single long array by chaning them. In some cases the returns from these calls may be separated by only a single space. Passing the library
array into tr
and replacing spaces with hard returns (\n
) gives an array of libraries that we can sort using unique values, returning the unique set of packages.
So the bash file:
#!/bin/bash
library=$(cat $(find . -type f \( -name \*.R -o -name \*.Rmd \)) | \
sed -n 's/^library[(]\(.*\)[)]/\1/p' | \
tr "," "\n" | \
tr -d "[\"\\']" | \
sed "s/verbose\s*=\s*\(\(TRUE\)\|\(FALSE\)\)/ /g")
library+=$(cat $(find . -type f \( -name \*.R -o -name \*.Rmd \)) | \
sed -n 's/^require[(]\(.*\)[)]/ \1/p' | \
tr "," "\n" | \
tr -d "[\"\\']" | \
sed "s/verbose\s*=\s*\(\(TRUE\)\|\(FALSE\)\)/ /g")
library+=$(cat $(find . -type f \( -name \*.R -o -name \*.Rmd \)) | \
sed -n 's/^.*\@import\(From\)\?\s\([a-zA-Z]*\)\s.*/ \2/p')
library+=$(cat $(find . -type f \( -name \*.R -o -name \*.Rmd \)) | \
perl -e 's/(.*?)([[:alnum:]]+)(::)(.*?)|./ \2/g' | sed '/^\s*$/d')
installs=$(tr ' ' '\n' <<< "${library[@]}" | sort -u | tr '\n' ' ')
echo $installs
Returns Bchron dplyr fields ggplot2 gridExtra maps mgcv neotoma plyr purrr purrrlyr raster readr reshape2 rgdal rmarkdown svglite viridis
if you locally clone a project I am currently working on. If the user is not interested in installing the packages the results from the bash script may look like this:
The package ggforce hasn't been installed.
The package ggmap hasn't been installed.
The package giphyR hasn't been installed.
The package gstat hasn't been installed.
The package hdf5 hasn't been installed.
The package highlight hasn't been installed.
Using the script without installing packages may be a good first step, since it will indicate the extent to which packages are required, and also, the bash script is working with the current set of R scripts. For this reason, we use an installation flag.
Installing packages
In the bash script I allow the flag -i
using a set of commands at the top of the bash
file:
rinstall=0
while getopts "i" OPTION
do
case $OPTION in
i)
echo Running installLib with the option to install packages.
rinstall=1
;;
esac
done
This uses the bash getopts
command, and echos to the screen when a user chooses to install the packages. If they’ve chosen to install the packages then we need to check each package name against the set of currently installed packages.
Is the package installed?
First we need the current path for R libraries, which we obtain from the command .libPaths()
. Assuming R can be called globally, we can execute rpath=$(Rscript -e "cat(.libPaths())")
in the bash
script. This sets an internal bash variable to the array of paths. For each path element we test whether any of the packages are already installed by looking for the directory using test -d "$paths/$onePkg"
. If the package is not present then install
remains 0
, otherwise it is changed to 1
.
From there we use the equalities to test whether to install the package or not. If the package isn’t installed and the flag has not been set, then simply print to the screen:
test $install -eq 0 && \
printf " The package %s hasn\'t been installed.\n" $onePkg
Otherwise, if the flag -i
has been used, then install the package from the main cran
repository:
test $install -eq 0 && \
test $rinstall -eq 1 && \
printf " * Will now install the package.\n" && \
Rscript -e "install.packages (\"$onePkg\", repos=\"http://cran.r-project.org/\")"
Wrapping it up
So, at this point, we can git clone
, and copy our bash script (wherever it is) into the cloned directory:
git clone git@github.com:SimonGoring/RegularExpressionR.git
cp installLib.sh ./RegularExpressionR/installLib.sh
cd ./RegularExpressionR
bash installRpkg.sh -i
and we will have all of our packages installed. For me, this is a huge time saver. If you have suggestions, comments, or want to use the script, check it out of my GitHub gist. Feel free to comment or edit anything you need.
Caveats
The whole script eventually runs through all R files and checks them all, pulling all the packages and then running the install.packages()
command through Rscript
. As mentioned before this will not work on all of the possible options for installing packages, but in most cases, failures will generally either result in trying to install invalid packages (e.g., TRUE or =), or it will fail to detect a package call. In addition, this will not install packages that are installed using devtools::install_github()
, however, it will install devtools
, and, subsequently, if install_github()
is called explicitly within the scripts, then the package should be installed.