Occurrence data

Overview

Teaching: 10 min
Exercises: 10 min

Objectives

download occurrence data through API.

filter occurrance data.

## "","x"
## "1","data/occ_raw.csv"

2.1 API

~1 billion biodiversity records on GBIF.org

What is an API looks like?

put this in Chrome/IE: http://api.gbif.org/v1/occurrence/search?year=1800,1899

What is an API? (Application Programming Interface)
API is the acronym for Application Programming Interface, which is a software intermediary that allows two applications to talk to each other. Each time you use an app like Facebook, send an instant message, or check the weather on your phone, you’re using an API.

2.1 Download occurrence data

gbif() is a function in dismo package, which can directly download occurrences through GBIF api; here we query the number of records of the nine-banded armadillo, without downloading

gbif(genus="Dasypus",species="novemcinctus",download=FALSE)

[1] 7520

by setting download=TRUE, we can download all records

dir.create("data")

if(!file.exists("data/occ_raw.rdata")){
  occ_raw <- gbif(genus="Dasypus",species="novemcinctus",download=TRUE) 
  save(occ_raw,file = "data/occ_raw.rdata")
}else{
  load("data/occ_raw.rdata")
}

# to view the first few records the occurrence dataset use:
head( occ_raw )

2.2 List of biodiversity databases and their R package.

Table 1. List of biodiversity databases and their R package.

Database	R package
BIEN	BIEN
BISON	rbison
eBird	rebird
GBIF	rgbif
iNaturalist	rinat
VertNet	rvertnet
iDigBio	ridigbio

The great thing is, you could query many databases at one time using spocc package, developed by rOpenSci

2.3 Occurrence data in Darwin Core

Take a look at the columns of the GBIF occurrences.

names(occ_raw) [1:20 ]

 [1] "acceptedNameUsage"             "acceptedScientificName"       
 [3] "acceptedTaxonKey"              "accessRights"                 
 [5] "adm1"                          "adm2"                         
 [7] "associatedReferences"          "basisOfRecord"                
 [9] "behavior"                      "bibliographicCitation"        
[11] "catalogNumber"                 "class"                        
[13] "classKey"                      "cloc"                         
[15] "collectionCode"                "collectionID"                 
[17] "continent"                     "coordinatePrecision"          
[19] "coordinateUncertaintyInMeters" "country"                      

The meaning of those columns/terms are defined by Darwin Core. Refer to Darwin Core quick reference guide for more information.

A few columns to highlight:

basisOfRecord
- The specific nature of the data record.
- PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence
year
- The four-digit year in which the Event occurred, according to the Common Era Calendar.
lat and lon (or decimalLongitude,decimalLatitude in Darwin Core)
- The geographic longitude/latitude of the geographic center of a Location. Positive values are east of the Greenwich Meridian/north of the Equator, negative values are west/south of it. Legal values lie between [-180 180] / [-90 90], inclusive.

2.4 Clean occurrence data

Since some of our records do not have appropriate coordinates and some have missing locational data, we need to remove them from our dataset. To do this, we created a new dataset named “occ_clean”, which is a subset of the “occ_raw” dataset where records with missing latitude and/or longitude are removed.

# here we remove erroneous coordinates, where either the latitude or longitude is missing
occ_clean <- subset(occ_raw,(!is.na(lat))&(!is.na(lon))) 
#  "!" means the opposite logic value

#Show the number of records that are removed from the dataset.  
cat(nrow(occ_raw)-nrow(occ_clean), "records are removed")

2401 records are removed

Remove duplicated data based on latitude and longitude

dups <- duplicated( occ_clean[c("lat","lon")]  )
occ_unique <- occ_clean[!dups,]
cat(nrow(occ_clean)-nrow(occ_unique), "records are removed")

1472 records are removed

show the frequency table of “basisOfRecord”

table(occ_unique$basisOfRecord)

    FOSSIL_SPECIMEN   HUMAN_OBSERVATION     LIVING_SPECIMEN 
                 13                2444                   1 
MACHINE_OBSERVATION         OBSERVATION  PRESERVED_SPECIMEN 
                 33                  27                 921 
            UNKNOWN 
                208 

only keep record that are associted with a specimen

occ_unique_specimen <- subset(occ_unique, basisOfRecord=="PRESERVED_SPECIMEN")
cat(nrow(occ_unique_specimen), "out of ", nrow(occ_unique), "records are specimen")

921 out of  3647 records are specimen

show the histogram of “year”

hist(occ_unique_specimen$year)

plot of chunk clean_data5

to filter the species records by year, in this example 1950 to 2000:

occ_final <- subset(occ_unique_specimen, year>=1950 & year <=2000)

show a quick summary of years in the data

summary(occ_final$year)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1950    1965    1976    1977    1989    2000 

2.5 Make occurrence data spatial

make occ spatial, assign coordinate reference system to spatial points

occ_final_COPY <- occ_final
coordinates(occ_final) <- ~ lon + lat

Note that, after make the dataframe spatial, the dataframe object is transformed into a spatial object

cat("the previous object is: ", class(occ_final_COPY),"\n")  

the previous object is:  data.frame

cat("the new object is: ",class(occ_final),"\n" )

the new object is:  SpatialPointsDataFrame

we could view the coordinates and the data that are associated with the spatial object

head(occ_final@coords)

            lon      lat
-84.55206 10.49557
-104.51337 19.13245
-100.51001 31.30495
-103.90280 19.16453
-90.88333 16.15000
-94.82222 16.43611

#head(occ_final@data)

read the CRS of the spatial object; it is NA because it has not been defined.

crs(occ_final)

CRS arguments: NA

now we define a CRS object

# Define the coordinate system that will be used. Here we show several examples:
myCRS1 <- CRS("+init=epsg:4326") # WGS 84
myCRS2 <- CRS("+init=epsg:4269") # NAD 83
myCRS3 <- CRS("+init=epsg:3857") # Mercator
myCRS3 <- CRS("+init=epsg:3413") # WGS 84 / NSIDC Sea Ice Polar Stereographic North

You can full reference list from spatialreference.org website.

assign the Coordinate Reference System (CRS) to our occ spatial object

crs(occ_final) <- myCRS1
crs(occ_final)

CRS arguments:
 +init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84
+towgs84=0,0,0 

after defineing the CRS, we can do CRS projecitons

occ_final_projected <- spTransform(occ_final, myCRS3)
plot(occ_final)

plot of chunk occ-spatial6

plot(occ_final_projected)

plot of chunk occ-spatial6

after we transform a dataframe into a spatial object, we can still subset it by column; for example, here we only keep occurrences north of the Equator

occ_north <- subset(occ_final, occ_final@coords[,2] >=0)
plot(occ_north)

plot of chunk occ-spatial7

or we can subset by year

occ_1990 <- subset(occ_final, year ==1990)
plot(occ_1990)

plot of chunk occ-spatial8

2.6 Read/Write shapefile files

dir.create("temp")
shapefile(occ_final,"temp/occ_final.shp",overwrite=TRUE)
loaded_shapefile <- shapefile("temp/occ_final.shp")

Challenge: Download occurrences from GBIF and filter data

–select your favorite species
–only keep specimen records
–only keep records that are collected between 2000 & 2018
–only keep records that have valid longitude & latitude
–make the occ spatial –assign WGS84 as the crs of the occurrences –save the spatial object as “myocc_final.shp” in folder “temp”

Solution

library(dismo)
library(raster)

# download 
myocc <- gbif(genus="Dasypus",species="novemcinctus",download=TRUE) 
 
# filter 
myocc_final <- subset(myocc,basisOfRecord=="PRESERVED_SPECIMEN" &
                            year >= 2000 & year <= 2018 &
                            !is.na(lat) & !is.na(lon)    )

# show number of records that are removed 
nrow(myocc) - nrow(myocc_final)     

# make it spatial
coordinates(myocc_final) <- ~ lon + lat

# define CRS
myCRS1 <- CRS("+init=epsg:4326") # WGS 84

# assign CRS to your occ
crs(myocc_final) <- myCRS1

# write shapefile
dir.create("temp")
shapefile(myocc_final,"temp/myocc_final.shp")

previous episode

next episode