## Objective

Download, process, and visualize the discrimination training data from your group’s starling using an HTTP API and the R statistical environment.

## Introduction

Last week, your group started a starling on a song discrimination task. You designated 5 stimuli as S+, which means that pecks to a response key after any of those stimuli were reinforced. Another 10 stimuli were designated as S-, which means that pecks were not reinforced, and in fact were mildly punished by turning out the house lights for 10 seconds. A small embedded Beaglebone computer running an operant training program called decide managed this training for you over the course of the week. In each trial, it recorded data about which stimulus was presented, what response the bird gave, and how it was reinforced or punished. These data were stored in a database for later analysis.

There’s an increasing emphasis in science on storing and providing data in open, publicly-accessible formats. A big part of this effort is in creating Application Programming Interfaces (APIs) that allow data analysis software to query databases and unambiguously look up and access specific data units. For an example of a real-life API, check out the Allen Brain Institute’s Cell Type Database, which is a massive collection of morphological and electrophysiological data from neurons from many different regions of the brain. The API for your starling data is more modest, but you’ll learn how to use it to access your data for analysis on your own computer.

### Getting the software

Complete this section before coming to class. We’ll be using R, a free software package for statistical analysis, to analyze and plot the training data. These instructions for installing R are from Stephen Turner’s bioconnector workshop. R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment that makes using R much easier. You need R installed before you install RStudio. You can skip the first two steps if you’ve already got a copy installed.

1. Install R. You’ll need R version 3.0.0 or higher. Download and install R for Windows or Mac OS X (download the latest R-3.x.x.pkg file for your appropriate version of OS X).
2. Install RStudio. Download and install the latest stable version of /RStudio/ Desktop.
3. Install R packages. Launch RStudio (not R itself). Ensure that you have internet access, then enter the following commands into the Console panel (usually the lower-left panel, by default). Note that these commands are case-sensitive. At any point (especially if you’ve used R in the past), R may ask you if you want to update any old packages by asking Update all/some/none? [a/s/n]:. If you see this, type a at the propt and hit Enter to update any old packages. If you’re using a Windows machine you might get some errors about not having permission to modify the existing libraries – don’t worry about this message. You can avoid this error altogether by running RStudio as an administrator.
# Install packages from CRAN
install.packages("dplyr")
install.packages("ggplot2")
install.packages("jsonlite")

You can check that you’ve installed everything correctly by closing and reopening RStudio and entering the following commands at the console window:

library(dplyr)
library(ggplot2)
library(jsonlite)

These commands may produce some notes or other output, but as long as they work without an error message, you’re good to go. If you get a message that says something like: Error in library(packageName) : there is no package called packageName, then the required packages did not install correctly.

### Getting to know R

If you’ve never programmed before, the R language is a pretty good place to start. If you have some programming experience in another language, you should be able to apply concepts you know fairly easily. If you’ve already used R, you can probably skip this section. This primer is adapted from the UVA Bioconnector workshop courseware, and you can follow the link to get more introductory lessons. Also check out the R Software Carpentry site for more information.

#### Basic operations

At base, computers are glorified adding machines, and one of the simplest things you can use a programming language for is for simple calculations. Try typing this in directly into the console, pressing Enter after each line.

97+3
101*92
5^2

Each line in the example above is a statement in the R language. When you enter a statement into the console, R will evaluate the statement and print the output, if there is any.

The true power of the language comes from two features. The first is the ability to assign values to variables. A variable is indicated by a symbolic name that you can use in place of a number or other data structure. To assign values, in R, you use the assignment operator, <-. Here’s an example:

n_trials <- 23

In RStudio, the keyboard shortcut for <- is Alt-dash. Variable names can be almost anything you want them to be, although there are a few rules: they can’t start with a number, and they can only contain alphanumeric characters and a few symbols like . and _. Other character symbols are reserved by the language for other functions, like adding or subtracting. There are also some reserved names that you can’t use, either.

Note that the assignment statement above doesn’t generate any output. You can print the current value of a variable by typing it on a line by itself. You can also use the value of the variable in other statements and expressions:

n_trials
n_trials * 2
halfway_done <- n_trials / 2

In the last statement, we’ve use n_trials to generate a new value that is then assigned to a second variable. Note that variables only change value when you explicitly assign a value, so if you subsequently typed n_trials <- 1000, halfway_done would still be equal to 1.

#### Functions

The second powerful feature of programming languages is the ability to define functions, which take some number of inputs and generate an output. Some functions are built-in to R, like sqrt, log, and abs.

# Notice that this is a comment.
# Anything behind a # is "commented out" and is not run.
sqrt(144)
log(1000)

# You can pass the output of one function to the input of another
sqrt(log(1000))

Most functions have documentation that you can access with help(functionname) or ?functionname. Look at the help for log. Notice that the function can take an additional input, or argument to specify the base of the log. Arguments can be specified positionally, or by explicitly naming them. The following two statements are equivalent:

log(1000, base=10)
log(1000, 10)

### Getting the data

The transport mechanism of the API is HTTP, which is the protocol you use to access web pages and a whole host of other internet services. HTTP consists of a request that you send to a server, which parses your request and replies the data you requested or an appropriate error message. The request is encoded in the Universal Resource Locator (URL). One huge advantage of using an HTTP API is that you can look at the data in a web browser. Paste the following URL into your browser of choice. You may need to enter a username and password, which you can get from a TA or the instructor.

http://aplonis.psyc.virginia.edu/decide/api/subjects/active

Your browser should display a text document with a series of records in a format called Javascript Object Notation (JSON). JSON is used extensively for data exchange on the internet, because it’s simple and fast for both humans and computers to read and write. Each record describes one subjects and is a series of key-value pairs, with the key separated from the value by a colon and the pairs (or fields) separated by commas. An example record is shown below (with added whitespace to aid reading; you can “prettify” individual json records with this web page).

{
"_id" : "e5a95e45-dc0d-4902-9b03-a5596e615ebe",
"procedure" : "gng",
"controller" : "beagle-6",
"user" : "cdm8j@virginia.edu",
"start-time" : "2016-09-08T08:53:32.169-04:00",
"last-fed" : "2016-09-13T14:06:25.733-04:00",
"experiment" : "group3-train",
"stop-time" : "2016-09-08T08:53:25.550-04:00",
"last-trial" : "2016-09-13T14:17:56.762-04:00"
}

Each field indicates a property of the subject: _id is a unique identifier, procedure is the name of the procedure being run, controller is the name of the computer running the experiment, and experiment is the name of the training protocol. Find the record for your group’s bird (hint: it’s the one that’s running the experiment named after your group) and make a note of the _id and controller values.

Now let’s access the trial data for your group’s bird. Enter the following URL into your browser, substituting <id> and <experiment> with the values you noted above:

http://aplonis.psyc.virginia.edu/decide/api/subjects/<id>/trials?experiment=<experiment>

Notice how you are using the URL to inform the server what subject and experiment you want to access. The subject is specified as part of the base URL, whereas the experiment is specified as a query string. The query string can be used to further restrict what data you want to access. For example, if you add &result=punish to the end of the URL, you’ll only get trials where the bird got the answer wrong.

As before, the data is returned as a list of records in JSON format, but now each record corresponds to a trial. A prettified example is below. What do each of the fields mean?

{
"response": "timeout",
"category": "S+",
"usec": 464,
"name": "gng",
"time": "2016-09-08T08:53:56.986-04:00",
"correction": 0,
"experiment": "group3-train",
"result": "none",
"trial": 1,
"stimulus": "st399_27",
"subject": "e5a95e45-dc0d-4902-9b03-a5596e615ebe",
"correct": false
}

Being able to look at your data in a browser is nice, but it’s not very useful for tracking how your bird is learning over the course of training. The next step is to get the data into R and load it into a table format. Enter the following lines in R, replacing the URL below with the one you used in the browser above.

# load 3rd-party libraries
library(dplyr)
library(ggplot2)
library(jsonlite)

# load the data from the API
trials <- stream_in(url("http://aplonis.psyc.virginia.edu/decide/api/subjects/<id>/trials?experiment=<experiment>"))

# See what kind of data it is, and print the object to the screen
class(trials)
trials

# Set an option to only show a few elements
options(max.print = 80)
trials

### Working with data frames

A data.frame is a structured arrangement of data records, which in this case correspond to individual trials. Each record has the same set of fields. Each row in the data.frame corresponds to a record, and each column corresponds to a field. A data.frame is similar in some respects to the JSON format, but with the added restriction that every record has the same number of fields, and all the fields in each record are the same type (i.e., numeric or text). There are several built-in functions that are useful for working with data frames.

• Content:
• head(): shows the first 6 rows
• tail(): shows the last 6 rows
• Size:
• dim(): returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
• nrow(): returns the number of rows
• ncol(): returns the number of columns
• Summary:
• colnames() (or just names()): returns the column names
• str(): structure of the object and information about the class, length and content of each column
• summary(): works differently depending on what kind of object you pass to it. Passing a data frame to the summary() function prints out some summary statistics about each column (min, max, median, mean, etc.)

Exercise: Try each of the commands with trials and see what they print out. How many trials has your bird run over the last week?

#### Accessing variables & subsetting data frames

We can access individual variables within a data frame using the $ operator, e.g., mydataframe$specificVariable. Let’s print out the stimuli for each trial. Then let’s calculate the average delay time (using the built-in mean() function).

# display all stimuli
trials$stimulus # we could have also taken a head head(trials$stimulus)

# mean response time
mean(trials$rtime) Notice that the output of the last statement is NA. This is a special value in R that indicates data are missing or not applicable. mean() will return NA if any of the values are equal to NA. Look at the data: what trials have rtime equal to NA? Not every variable in the data frame is really useful. For this analysis, you know that all the data came from the same bird, controller, and experiment. We can drop some of the unused columns using the select function, which is part of the dplyr package that you loaded above. trials <- select(trials, trial, time, stimulus, category, correction, response, rtime, correct, result) Note that we reassigned the output of the select function to the trials variable, which overwrites its previous value. If you want to look at specific rows in a data.frame, you can use the filter() function, which is also part of dplyr. The filter() function takes two or more arguments. 1. The first argument is the data frame you want to filter, e.g. filter(gm.... 2. Each subsequent argument is a condition you must satisfy, e.g. filter(gm, year==1982). If you want to satisfy all of multiple conditions, you can use the “and” operator, &. The “or” operator | (the pipe character, usually shift-backslash) will return a subset that meet any of the conditions. Each condition is specified with a boolean operator: • ==: Equal to • !=: Not equal to • >, >=: Greater than, greater than or equal to • <, <=: Less than, less than or equal to Let’s try it out. ## Show only trials for the stimulus "st378_11". ## (pick a different stimulus if this one was never presented) filter(trials, stimulus == "st378_11") ## Exclude correction trials filter(trials, correction == 0) # Show only correct responses to "st378_11" filter(trials, stimulus == "st378_11", correct == TRUE) Finally, take a look at the class of what’s returned by a filter() function. The filter() function takes a data.frame and returns a data.frame. You can operate on this new data.frame just as you would any other data.frame using the $ operator.

#### Exercises

1. Calculate the average reaction time by first excluding trials where there was no response.
2. Calculate the total number of correction trials.
3. Calculate the total number of S+ trials.

### Tabulating data

Let’s calculate the accuracy for each stimulus. This introduces one of the most powerful functions in base R, xtabs(), which cross-tabulates data by counting the number of records that match a combination of conditions. Run the following command:

xtabs(~ stimulus + correct, trials)

The first argument to the function is a special R form called a formula. The ~ (“twiddle”) acts a bit like an equals sign; on the right hand side is a list of factors you want to cross-tabulate, separated by +. The dimensions of the output table always correspond to the number of factors. The number of rows or columns in the table correspond to the number of levels (unique values) in the factor.

#### Exercises

1. Use prop.table() to convert the counts into proportions. You’ll need to set the margin argument correctly to get results that make sense, so be sure to check out the documentation.
2. Calculate the total number of trials for each stimulus.
3. Does it make sense to include correction trials in your analysis? Why or why not? How could you exclude them from the tabulation?

### Plotting binary response data

The xtabs() function we used above was applied to the entire dataset, so it effectively averages over the entire experiment. But in this experiment, the animal was learning to discriminate between the stimuli, so we’d like to see how its responses change over time.

One impediment to visualizing accuracy directly from the data is that each trial is essentially a binary measure (correct or incorrect). Over time, the proportion of correct responses may increase, but any individual trial is either going to be correct or incorrect. Formally, this kind of data is known as point-process data.

A simple way of dealing with trial data is to divide it up into blocks and calculate the proportion of correct trials in each block. Here’s a simple way of blocking up the data:

blocksize <- 50
blocked_trials <- mutate(filter(trials, correction==0),
block=cut(1:n(),
breaks=seq(0, n(), by=blocksize),
labels=FALSE))

There are several important new functions here: cut(), mutate(), and seq(). Use the help() function to see what each one does, and how the two statements above generate a new data frame with a block column.

Now we can use the block column along with xtabs() to calculate accuracy in each block:

block_stats <- prop.table(xtabs(~ block + correct, blocked_trials), 1)

Look at block_stats and see how the proportion of correct responses changes with block number. You should be able to see a trend. A graph would be more useful, and R has some really great graphing capabilities. The most basic plotting command is plot():

plot(block_stats[,2])

A much more full-featured plotting package is called ggplot. It’s useful both for exploratory analysis and for preparing beautiful figures for posters and publications.

library(ggplot2)
ggplot(as.data.frame(block_stats), aes(as.numeric(block), Freq, color=correct)) + geom_point()

#### Exercises

1. Use xtabs() and prop.table() to calculate accuracy in the S+ and S- categories for each block. Which category is learned first?
2. Change the block size to 100 trials and replot the data using the first plot command. Now try a block size of 10 trials. How does these changes affect the appearance of the plot?
3. The graph we produced with ggplot() has some redundant information, and it’d be nice to see each category separately. Use xtabs(), filter(), and ggplot() to produce this plot.

### Summary plots and statistical tests

Let’s look at how the bird responded on each of the stimuli at the end of the training. We’ll use filter() and tail() to extract the trials we want:

# select the last 200 trials
last_trials <- tail(filter(trials, correction==0), 200)

We now tabulate the responses to each stimulus:

tab <- xtabs(~ stimulus + response, last_trials)

A nice way of visualizing the results is as a bar plot. We’re going to first use prop.table() to get the proportion of correct responses.

pcorrect <- prop.table(tab, margin=1)[,"peck_right"]
barplot(pcorrect)

You’ll probably see some variation in average response rate for the different stimuli. Because we’re plotting responses rather than accuracy, you should see values close to 1.0 for S+ stimuli and values close to 0 for S- stimuli. Some stimuli may be rather close to 0.5, which corresponds to chance performance (i.e., if the bird is just guessing on every trial, it’s not going to get more than 50% of the trials correct). But because we’ve only collected a finite number of trials, there’s some probability that we would get a result larger than 0.5 by chance. To exclude this possibility, we’ll need to do some statistics.

Statistical analysis is designed to address the important question, “How do you know you didn’t get that result by chance?”. In other words, if you flip a coin 4 times, there’s a pretty good chance (1 in 16, or 6.25%) that you’ll get all heads, but if you flip a coin 100 times, the chances of getting all heads are pretty slim, unless it’s a loaded coin.

You can use a χ-squared test (the χ is the Greek letter “chi”) to test the probability that the proportion of responses near the end of the training is really different from chance (which is 50% in this case). Here’s an example for one of the stimuli:

# run the chi-squared test for st15_2
chisq.test(tab["st15_2",])

What is the p-value reported in the output of the last statement? If this is less than 0.05, the odds that your starling learned to discriminate that stimulus are better than 1 in 20.

#### Exercises

1. Calculate p values for each of the stimuli in your training set using χ-squared tests. Which stimuli can you confidently state were learned by your starling?
2. What is the effect of increasing sample size on statistical tests? To answer this question, increase the total number of trials in your analysis to 400 and recalculate the p values. Do any of the stimuli now become significantly greater than chance (p < 0.05)?
3. Make a barplot and calculate p values for the stimulus categories instead of individual stimuli.
4. Why are there no error bars on the plots we generated? Does the concept of standard error make sense with binary data?