Finding Site Locations with GeoDeepDive
Simon Goring
July 16, 2017
GeoDeepDive
GeoDeepDive (GDD) mines publications using OCR and then applies several Natural Language Processing (NLP) utilities to the documents. The GDD output consists of two files, one, input/bibjson
is a JSON format file that contains bibliographic information about the publications. The second (really, a set of files, but we’re looking at one in particular) is input/sentences_nlp352
. The sentences
file contains several elements, a link to the publication, the individual words in the sentences, an index of the ‘words’, and then several renderings of the text using NLP.
To begin, we want to load the packages we will be using, and then import the data:
library(jsonlite)
library(readr)
library(dplyr)
library(stringr)
library(leaflet)
library(purrr)
library(DT)
publications <- fromJSON('input/bibjson', flatten = TRUE)
nlp <- readr::read_tsv('input/sentences_nlp352',
trim_ws = TRUE,
col_names = c('_gddid', 'sentence', 'wordIndex',
'word', 'partofspeech', 'specialclass',
'wordsAgain', 'wordtype', 'wordmodified'))
From this we get an output object that includes a key for the publication (_gddid
, linking to the publications
variable), the sentence number of the parsed text, and then both the parsed text and some results from natural language processing:
We’re interested in trying to use GDD to obtain site coordinates for sites that are not currently in Neotoma. This would provide two services: (1) it helps Neotoma determine the extent to which its data acquisitions have covered the literature, and (2) it can help researchers searching for relevant sites for use in meta-analysis, or in comparing their results to results in similar geographic locations by providing relevant geocoded publications and links to the publications using DOIs.
Publication Overlap
It’s important to try to find the papers that exist both in the GeoDeepDive records and in Neotoma. These will be test cases. To do this we need to try matching publications$title
to the publications in the Neotoma database. To do this we pull all the Neotoma datasets and then get their associated publications so that the site locations and publications are linked in our data model. Since this can be time consuming we write the output to an rds
file so that we can rapidly load and unload the data when knitting. Then I link the dataset ID to the publication.
if('neo_pubs.rds' %in% list.files('data')) {
neo_pubs <- readRDS('data/neo_pubs.rds')
neo_data <- readRDS('data/neo_data.rds')
} else {
neo_data <- neotoma::get_dataset()
neo_pubs <- neotoma::get_publication(neo_data)
saveRDS(neo_data, 'data/neo_data.rds')
saveRDS(neo_pubs, 'data/neo_pubs.rds')
}
for(i in 1:length(neo_pubs)) {
for(j in 1:length(neo_pubs[[i]])) {
neo_pubs[[i]][[j]]$meta$datasetid <- neo_data[[i]]$dataset.meta$dataset.id
}
}
neo_bound <- lapply(neo_pubs,
function(x) {
lapply(x, function(y)y$meta) %>% bind_rows()
}) %>%
bind_rows() %>% filter(!is.na(id))
This then generates a long table, with the site and datasetid, the publication type, year of publication and the citation. Once we’ve generated this long table for the neotoma publications we need to match it against the publications in the GeoDeepDive corpus (or partial corpus):
if(!'matches.rds' %in% list.files('data')) {
matches <- list()
for(i in 1:nrow(publications)) {
test <- agrep(publications$title[i],
neo_bound$citation,
ignore.case = TRUE)
if(length(test) == 0) {
matches[[i]] <- data.frame(gddid = publications$`_gddid`[i],
pub = publications$title[i],
year = publications$year[i],
neotoma = NA,
stringsAsFactors = FALSE)
} else {
matches[[i]] <- data.frame(gddid = publications$`_gddid`[i],
pub = publications$title[i],
pubid = neo_bound$id[test],
year = publications$year[i],
dsid = neo_bound$datasetid[test],
neotoma = neo_bound$citation[test],
stringsAsFactors = FALSE)
}
}
matches <- matches %>% bind_rows()
saveRDS(matches, 'data/matches.rds')
} else {
matches <- readRDS('data/matches.rds')
}
In this setup the agrep()
command is a bit slow, and it’s cycling through all of the Neotoma publications, and matching them against the DeepDive publications, for a total of 4,895,100 approximate agrep()
matches. There is most likely a more efficient way of managing this (perhaps using the stringr
package), but this is a simple short term solution.
From the matching we get 160 datasets from 25 papers in Neotoma that match 23 GeoDeepDive papers. This is a curious mismatch in paper numbers. We ought to expect that the Neotoma:GeoDeepDive match rate is 1:1, that each GeoDeepDive paper should match the same Neotoma paper. It appears that this is, in part, the result of a duplicate Neotoma publication entry for Fréchette et al., 2006 (pubids 7473 and 7820).
Coordinates in the matched papers
Neotoma can provide us the coordinates for the publications found in both Neotoma and the GDD publications.
neo_datawide <- neo_data %>% map(function(x){
data.frame( lat = x$site.data$lat,
long = x$site.data$long,
dsid = x$dataset.meta$dataset.id,
snam = x$site.data$site.name,
dst = x$dataset.meta$dataset.type)}) %>%
bind_rows()
matched_loc <- neo_datawide %>% filter(dsid %in% matches$dsid)
leaflet(matched_loc) %>%
addTiles() %>%
addCircleMarkers(lat = ~lat, lng = ~long,
clusterOptions = markerClusterOptions(),
popup = paste0('<b>', matched_loc$snam, '</b><br><b>Type:</b> ',
matched_loc$dst,
'<br><a href=http://apps.neotomadb.org/explorer/?datasetid=',
matched_loc$dsid,'>Explorer Link</a>'))
Getting Coordinates
To obtain coordinates from the paper we must consider that there are several potential issues. The first is that not all coordinates will neccessarily refer to an actual pollen core. We may also, inadvertantly, find numeric objects that appear to be coordinates, but are in fact simply numbers. We then must identify what exactly we think coordinates might look like:
Coordinates |
---|
45°56’ W |
45°56’N |
45◦56 W |
45◦56’N |
-45°56’ |
123.5° E |
-123°23’12“ |
From this we can compose two regular expressions. Since we will be processing DMS coordinates differently than DD coordinates we generate two regular expressions:
dms_regex <- "[\\{,]([-]?[1]?[0-9]{1,2}?)(?:(?:,[°◦o],)|(?:[O])|(?:,`{2},))([1]?[0-9]{1,2}(?:.[0-9]*)),[′'`]?[,]?([[0-9]{0,2}]?)[\"]?[,]?([NESWnesw]?),"
dd_regex <- "[\\{,][-]?[1]?[0-9]{1,2}\\.[0-9]{1,}[,]?[NESWnesw],"
These regular expressions allow for negative or positive coordinate systems, that may start with a 1
, and then are followed by one or two digits ({1,2}
). From there we see differences in the structure, reflecting the need to capture the degree symbols, or, in the case of decimal degrees, the decimal component of the coordinates. We are more rigorous here for the decimal degrees because there are too many other options when there are only decimal numbers.
The regex commands were constructed using capture (and non-capture) groups to work with the stringr
package, so that we obtain five elements from any match. The full match, the degrees, the minutes and the seconds (which may be an empty string). It also returns the quadrant (NESW).
degmin <- str_match_all(nlp$word, dms_regex)
decdeg <- str_match_all(nlp$word, dd_regex)
Since the documents are broken up into sentences we should expect that all coordinates are reported as pairs, and so we might be most interested in finding all the records that show up with pairs of coordinates. Let’s start by matching up the publications with sentences that have coordinate pairs:
So even here, we can see that many of these matches work, but that some of the matches are incomplete. There appears to be a much lower proportion of sites returned than we might otherwise expect. Given that there are 296 articles in the NLP dataset, it’s surprising that only 69 appear to support regex matches to coordinate pairs.
In reality, this is likely to be, in part, an issue with the OCR/regex processing. We need to go over the potential matches more thoroughly to find all the alternative methods of indicating the coordinate systems before we can commit to a full analysis.
Converting Coordinates
So, given the coordinate strings, we need to be able to transform them to reliable lat/long pairs with sufficient trust to actually map the records. These two functions will convert the GeoDeepDive (GDD) word elements pulled out by the regular expression searches into decimal degrees that can account for reported locations.
convert_dec <- function(x, i) {
drop_comma <- gsub(',', '', x) %>%
substr(., c(1,1), nchar(.) - 1) %>%
as.numeric %>%
unlist
domain <- (str_detect(x, 'N') * 1 +
str_detect(x, 'E') * 1 +
str_detect(x, 'W') * -1 +
str_detect(x, 'S') * -1) *
drop_comma
publ <- match(nlp$`_gddid`[i], publications$`_gddid`)
point_pairs <- data.frame(sentence = nlp$word[i],
lat = domain[str_detect(x, 'N') | str_detect(x, 'S')],
lng = domain[str_detect(x, 'E') | str_detect(x, 'W')],
publications[publ,],
stringsAsFactors = FALSE)
return(point_pairs)
}
convert_dm <- function(x, i) {
# We use the `i` index so that we can keep the coordinate outputs from the
# regex in a smaller list.
dms <- data.frame(deg = as.numeric(x[,2]),
min = as.numeric(x[,3]) / 60,
sec = as.numeric(x[,4]) / 60 ^ 2,
stringsAsFactors = FALSE)
dms <- rowSums(dms, na.rm = TRUE)
domain <- (str_detect(x[,5], 'N') * 1 +
str_detect(x[,5], 'E') * 1 +
str_detect(x[,5], 'W') * -1 +
str_detect(x[,5], 'S') * -1) *
dms
publ <- match(nlp$`_gddid`[i], publications$`_gddid`)
point_pairs <- data.frame(sentence = nlp$word[i],
lat = domain[x[,5] %in% c('N', 'S')],
lng = domain[x[,5] %in% c('E', 'W')],
publications[publ,],
stringsAsFactors = FALSE)
return(point_pairs)
}
Then, once we’ve done that, we need to apply those functions to the set of records we’ve pulled to build a composite table:
coordinates <- list()
coord_idx <- 1
for(i in 1:length(decdeg)) {
if((length(decdeg[[i]]) %% 2 == 0 |
length(degmin[[i]]) %% 2 == 0) & length(degmin[[i]]) > 0) {
if(any(str_detect(decdeg[[i]], '[NS]')) &
sum(str_detect(decdeg[[i]], '[EW]')) == sum(str_detect(decdeg[[i]], '[NS]'))) {
coordinates[[coord_idx]] <- convert_dec(decdeg[[i]], i)
coord_idx <- coord_idx + 1
}
if(any(str_detect(degmin[[i]], '[NS]')) &
sum(str_detect(degmin[[i]], '[EW]')) == sum(str_detect(degmin[[i]], '[NS]'))) {
coordinates[[coord_idx]] <- convert_dm(degmin[[i]], i)
coord_idx <- coord_idx + 1
}
}
}
coordinates <- coordinates %>% bind_rows %>%
mutate(sentence = gsub(',', ' ', sentence)) %>%
mutate(sentence = str_replace_all(sentence, '-LRB-', '(')) %>%
mutate(sentence = str_replace_all(sentence, '-RRB-', ')')) %>%
mutate(sentence = str_replace_all(sentence, '" "', ',')) %>%
mutate(matched = ifelse(X_gddid %in% matches$gddid[!is.na(matches$dsid)], "red", "black"))
leaflet(coordinates) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(popup = paste0('<b>', coordinates$title, '</b><br>',
'<a href=http://dx.doi.org/',
coordinates$title,'>Publication Link</a><br>',
'<b>Sentence:</b><br>',
'<small>',gsub(',', ' ', coordinates$sentence),
'</small>'),
color = ~matched)
So here are the sites that we pull out from GeoDeepDive. It should be obvious here that there are clear limitations to the current methods. First, the only records returned here are records that are currently in Neotoma (coloured red). There was a facet used in plotting to differentiate between existing (red) and new records (black), and it appears that no new records were found.
Error Checking
Given that we have a number of matched recorrds, we can improve our regex
by looking at the publications that don’t show up in our pulled records.
missed_match <- matches %>% filter(!is.na(dsid) & !gddid %in% coordinates$X_gddid) %>% distinct(gddid)
This gives a set of 18 articles that ought to have paired locations, but appear to be missing when we apply the regex to the records. So what are they missing?
It is possible to sort through these, but unfortunately, this is an itterative process, and so as I work on this project the potential matches decline as the regex improves.