NEON EDUCATION bio photo

NEON EDUCATION

Devoted to open data and open source in science and education.

View All Tutorials

Tags

LiDAR (7)
R programming (51)
Remote Sensing (12)
Data Visualization (4)
Hyperspectral Remote Sensing (7)
Hierarchical Data Formats (HDF5) (9)
Spatial Data & GIS (18)
Time Series (15)
Phenology (7)
Raster Data (8)
Vector Data (6)
Metadata (1)
Git & GitHub (6)
(1) (1)

Tutorial by R Package

dplyr (7)
ggplot2 (17)
h5py (1)
lubridate (time series) (6)
maps (1)
maptools (2)
plyr (2)
raster (26)
rasterVis (raster time series) (3)
rgdal (GIS) (22)
rgeos (3)
rhdf5 (11)
sp (5)
scales (4)
gridExtra (4)
ggtheme (0)
grid (2)
reshape2 (3)
plotly (5)

View ALL Tutorial Series




Twitter Youtube Github


Blog.Roll

R Bloggers

Overview

Several factors contributed to the extreme flooding that occurred in Boulder, Colorado in 2013. In this data activity, we explore and visualize the data for stream discharge data collected by the United States Geological Survey (USGS). This tutorial is part of the Data Activities that can be used with the Ecological Disturbance Teaching Module.

Learning Objectives

After completing this tutorial, you will be able to:

Things You’ll Need To Complete This Lesson

Please be sure you have the most current version of R and, preferably, RStudio to write your code.

R Skill Level: Intermediate - To succeed in this tutorial, you will need to have basic knowledge for use of the R software program.

R Libraries to Install:

  • ggplot2: install.packages("ggplot2")
  • plotly: install.packages("plotly")

Data to Download

We include directions on how to directly find and access the data from USGS’s National National Water Information System Database. However, depending on your learning objectives you may prefer to use the provided teaching data subset that can be downloaded from the NEON Data Skills account on FigShare.

To more easily follow along with this lesson, use the same organization for your files and folders as we did. First, create a data directory (folder) within your Documents directory. If you downloaded the compressed data file above, unzip this file and place the distub-events-co13 folder within the data directory you created. If you are planning to access the data directly as described in the lesson, create a new directory called distub-events-co13 wihin your data folder and then within it create another directory called discharge. If you choose to save your files elsewhere in your file structure, you will need to modify the directions in the lesson to set your working directory accordingly.

Research Question

What were the patterns of stream discharge prior to and during the 2013 flooding events in Colorado?

About the Data - USGS Stream Discharge Data

The USGS has a distributed network of aquatic sensors located in streams across the United States. This network monitors a suit of variables that are important to stream morphology and health. One of the metrics that this sensor network monitors is Stream Discharge, a metric which quantifies the volume of water moving down a stream. Discharge is an ideal metric to quantify flow, which increases significantly during a flood event.

As defined by USGS: Discharge is the volume of water moving down a stream or river per unit of time, commonly expressed in cubic feet per second or gallons per day. In general, river discharge is computed by multiplying the area of water in a channel cross section by the average velocity of the water in that cross section.

For more on stream discharge by USGS.

The USGS tracks stream discharge through time at locations across the United States. Note the pattern observed in the plot above. The peak recorded discharge value in 2013 was significantly larger than what was observed in other years. Source: USGS, National Water Information System.

Obtain USGS Stream Gauge Data

This next section explains how to find and locate data through the USGS’s National Water Information System portal. If you want to use the pre-compiled dataset downloaded above, you can skip this section and start again at the Work With Stream Gauge Data header.

Step 1: Search for the data

To search for stream gauge data in a particular area, we can use the interactive map of all USGS stations. By searching for locations around “Boulder, CO”, we can find 3 gauges in the area.

For this lesson, we want data collected by USGS stream gauge 06730200 located on Boulder Creek at North 75th St. This gauge is one of the few the was able to continuously collect data throughout the 2013 Boulder floods.

You can directly access the data for this station through the “Access Data” link on the map icon or searching for this site on the National Water Information System portal .

On the Boulder Creek stream gauge 06730200 page, we can now see summary information about the types of data available for this station. We want to select Daily Data and then the following parameters:

  • Available Parameters = 00060 Discharge (Mean)
  • Output format = Tab-separated
  • Begin Date = 1 October 1986
  • End Date = 31 December 2013

Now click “Go”.

Step 2: Save data to .txt

The output is a plain text page that you must copy into a spreadsheet of choice and save as a .csv. Note, you can also download the teaching data set (above) or access the data through an API (see Additional Resources, below).

Work with Stream Gauge Data

R Packages

We will use ggplot2 to efficiently plot our data and plotly to create interactive plots.

# set your working directory
#setwd("working-dir-path-here")

# load packages
library(ggplot2) # create efficient, professional plots
library(plotly) # create cool interactive plots

Import USGS Stream Discharge Data Into R

Now that we better understand the data that we are working with, let’s import it into R. First, open up the discharge/06730200-discharge_daily_1986-2013.txt file in a text editor.

What do you notice about the structure of the file?

The first 24 lines are descriptive text and not actual data. Also notice that this file is separated by tabs, not commas. We will need to specify the tab delimiter when we import our data.We will use the read.csv() function to import it into an R object.

When we use read.csv(), we need to define several attributes of the file including:

  1. The data are tab delimited. We will this tell R to use the "/t" separator, which defines a tab delimited separation.
  2. The first group of 24 lines in the file are not data; we will tell R to skip those lines when it imports the data using skip=25.
  3. Our data have a header, which is similar to column names in a spreadsheet. We will tell R to set header=TRUE to ensure the headers are imported as column names rather than data values.
  4. Finally we will set stringsAsFactors = FALSE to ensure our data come in a individual values.

Let’s import our data.

(Note: you can use the discharge/06730200-discharge_daily_1986-2013.csv file and read it in directly using read.csv() function and then skip to the View Data Structure section).

#import data
discharge <- read.csv("discharge/06730200-discharge_daily_1986-2013.txt",
                      sep="\t",
                      skip=24,
                      header=TRUE,
                      stringsAsFactors = FALSE)

#view first few lines
head(discharge)

##   agency_cd  site_no   datetime X17663_00060_00003 X17663_00060_00003_cd
## 1        5s      15s        20d                14n                   10s
## 2      USGS 06730200 1986-10-01                 30                     A
## 3      USGS 06730200 1986-10-02                 30                     A
## 4      USGS 06730200 1986-10-03                 30                     A
## 5      USGS 06730200 1986-10-04                 30                     A
## 6      USGS 06730200 1986-10-05                 30                     A

When we import these data, we can see that the first row of data is a second header row rather than actual data. We can remove the second row of header values by selecting all data beginning at row 2 and ending at the total number or rows in the file and re-assigning it to the variable discharge. The nrow function will count the total number of rows in the object.

# nrow: how many rows are in the R object
nrow(discharge)

## [1] 9955

# remove the first line from the data frame (which is a second list of headers)
# the code below selects all rows beginning at row 2 and ending at the total
# number of rows. 
discharge <- discharge[2:nrow(discharge),]

Metadata

We now have an R object that includes only rows containing data values. Each column also has a unique column name. However the column names may not be descriptive enough to be useful - what is X17663_00060_00003?.

Reopen the discharge/06730200-discharge_daily_1986-2013.txt file in a text editor or browser. The text at the top provides useful metadata about our data. On rows 10-12, we see that the values in the 5th column of data are “Discharge, cubic feed per second (Mean)”. Rows 14-16 tell us more about the 6th column of data, the quality flags.

Now that we know what the columns are, let’s rename column 5, which contains the discharge value, as disValue and column 6 as qualFlag so it is more “human readable” as we work with it in R.

#view names
names(discharge)

## [1] "agency_cd"             "site_no"               "datetime"             
## [4] "X17663_00060_00003"    "X17663_00060_00003_cd"

#rename the fifth column to disValue representing discharge value
names(discharge)[4] <- "disValue"
names(discharge)[5] <- "qualCode"

#view names
names(discharge)

## [1] "agency_cd" "site_no"   "datetime"  "disValue"  "qualCode"

View Data Structure

Let’s have a look at the structure of our data.

#view structure of data
str(discharge)

## 'data.frame':	9954 obs. of  5 variables:
##  $ agency_cd: chr  "USGS" "USGS" "USGS" "USGS" ...
##  $ site_no  : chr  "06730200" "06730200" "06730200" "06730200" ...
##  $ datetime : chr  "1986-10-01" "1986-10-02" "1986-10-03" "1986-10-04" ...
##  $ disValue : chr  "30" "30" "30" "30" ...
##  $ qualCode : chr  "A" "A" "A" "A" ...

It appears as if the discharge value is a character (chr) class. This is likely because we had an additional row in our data. Let’s convert the discharge column to a numeric class. In this case, we can reassign that column to be of class: integer given there are no decimal places.

# view class of the disValue column
class(discharge$disValue)

## [1] "character"

# convert column to integer
discharge$disValue <- as.integer(discharge$disValue)

str(discharge)

## 'data.frame':	9954 obs. of  5 variables:
##  $ agency_cd: chr  "USGS" "USGS" "USGS" "USGS" ...
##  $ site_no  : chr  "06730200" "06730200" "06730200" "06730200" ...
##  $ datetime : chr  "1986-10-01" "1986-10-02" "1986-10-03" "1986-10-04" ...
##  $ disValue : int  30 30 30 30 30 30 30 30 30 31 ...
##  $ qualCode : chr  "A" "A" "A" "A" ...

Converting Time Stamps

We have converted our discharge data to an integer class. However, the time stamp field, datetime is still a character class.

To work with and efficiently plot time series data, it is best to convert date and/or time data to a date/time class. As we have both date and time date, we will use the class POSIXct.

To learn more about different date/time classes, see the Dealing With Dates & Times in R - as.Date, POSIXct, POSIXlt tutorial.

#view class
class(discharge$datetime)

## [1] "character"

#convert to date/time class - POSIX
discharge$datetime <- as.POSIXct(discharge$datetime)

#recheck data structure
str(discharge)

## 'data.frame':	9954 obs. of  5 variables:
##  $ agency_cd: chr  "USGS" "USGS" "USGS" "USGS" ...
##  $ site_no  : chr  "06730200" "06730200" "06730200" "06730200" ...
##  $ datetime : POSIXct, format: "1986-10-01" "1986-10-02" ...
##  $ disValue : int  30 30 30 30 30 30 30 30 30 31 ...
##  $ qualCode : chr  "A" "A" "A" "A" ...

No Data Values

Next, let’s query our data to check whether there are no data values in it. The metadata associated with the data doesn’t specify what the values would be, NA or -9999 are common values

# check total number of NA values
sum(is.na(discharge$datetime))

## [1] 0

# check for "strange" values that could be an NA indicator
hist(discharge$disValue)

Excellent! The data contains no NoData values.

Plot The Data

Finally, we are ready to plot our data. We will use ggplot from the ggplot2 package to create our plot.

ggplot(discharge, aes(datetime, disValue)) +
  geom_point() +
  ggtitle("Stream Discharge (CFS) for Boulder Creek") +
  xlab("Date") + ylab("Discharge (Cubic Feet per Second)")

Questions:

  1. What patterns do you see in the data?
  2. Why might there be an increase in discharge during that time of year?

Plot Data Time Subsets With ggplot

We can plot a subset of our data within ggplot() by specifying the start and end times (in a limits object) for the x-axis with scale_x_datetime. Let’s plot data for the months directly around the Boulder flood: August 15 2013 - October 15 2013.

# Define Start and end times for the subset as R objects that are the time class
startTime <- as.POSIXct("2013-08-15 00:00:00")
endTime <- as.POSIXct("2013-10-15 00:00:00")

# create a start and end time R object
start.end <- c(startTime,endTime)
start.end

## [1] "2013-08-15 MDT" "2013-10-15 MDT"

# plot the data - Aug 15-October 15
ggplot(discharge,
      aes(datetime,disValue)) +
      geom_point() +
      scale_x_datetime(limits=start.end) +
      xlab("Date") + ylab("Discharge (Cubic Feet per Second)") +
      ggtitle("Stream Discharge (CFS) for Boulder Creek\nAugust 15 - October 15, 2013")

## Warning: Removed 9892 rows containing missing values (geom_point).

We get a warning message because we are “ignoring” lots of the data in the data set.

Plotly - Interactive (and Online) Plots

We have now successfully created a plot. We can turn that plot into an interactive plot using Plotly. Plotly allows you to create interactive plots that can also be shared online. If you are new to Plotly, view the companion mini-lesson Interactive Data Vizualization with R and Plotly to learn how to set up an account and access your username and API key.

Time subsets in plotly

To plot a subset of the total data we have to manually subset the data as the Plotly package doesn’t (yet?) recognize the limits method of subsetting.

Here we create a new R object with entries corresponding to just the dates we want and then plot that data.

# subset out some of the data - Aug 15 - October 15
discharge.aug.oct2013 <- subset(discharge, 
                        datetime >= as.POSIXct('2013-08-15 00:00',
                                              tz = "America/Denver") & 
                        datetime <= as.POSIXct('2013-10-15 23:59', 
                                              tz = "America/Denver"))

# plot the data
disPlot.plotly <- ggplot(data=discharge.aug.oct2013,
        aes(datetime,disValue)) +
        geom_point(size=3)     # makes the points larger than default

disPlot.plotly

# add title and labels
disPlot.plotly <- disPlot.plotly + 
	theme(axis.title.x = element_blank()) +
	xlab("Time") + ylab("Stream Discharge (CFS)") +
	ggtitle("Stream Discharge - Boulder Creek 2013")

disPlot.plotly

# view plotly plot in R
ggplotly(disPlot.plotly)

If you are satisfied with your plot you can now publish it to your Plotly account.

# set username
Sys.setenv("plotly_username"="yourUserNameHere")
# set user key
Sys.setenv("plotly_api_key"="yourUserKeyHere")

# publish plotly plot to your plotly online account if you want. 
plotly_POST(disPlot.plotly)

Additional Resources

Additional information on USGS streamflow measurements and data:

API Data Access

USGS data can be downloaded via an API using a command line interface. This is particularly useful if you want to request data from multiple sites or build the data request into a script. Read more here about API downloads of USGS data.


Return to the Ecological Disturbance Teaching Module by clicking here.


Get Lesson Code

(some browsers may require you to right click.)