In this post, we’re going to show how a quant trader can manipulate stock price data using the dplyr
R package.
Getting set up and loading data
Load the dplyr
package via the tidyverse
package.
if (!require('tidyverse')) install.packages('tidyverse') library(tidyverse)
First, load some price data.
energystockprices.RDS
contains a data frame of daily price observations for 3 energy stocks.
prices <- readRDS('energystockprices.RDS') prices
We’ve organised our data so that
- Every column is variable.
- Every row is an observation.
In this data set:
- We have 13,314 rows in our data frame.
- Each row represents a daily price observation for a given stock.
- For each observation measure the open, high, low and close prices, and the volume traded.
This is a very helpful way to structure your price data. We’ll see how we can use the dplyr
package to manipulate price data for quant analysis.
The main dplyr verbs
There are 6 main functions to master in dplyr.
filter()
picks outs observations (rows) by some filter criteria
arrange()
reorders the observations (rows)
select()
picks out the variables (columns)
mutate()
creates new variables (columns) by applying transformations to existing variables
summarise()
allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.
Finally, the group_by()
causes the verbs above to act on a group at a time, rather than the whole dataset.
We’ll go through them one by one.
The filter()
dplyr verb
Filter returns only those observations (rows) which satisfy certain criteria.
Filtering rows for a single stock with filter()
and ==
Here we use filter
to just pull out prices for XOM.
A common mistake is to use =
instead of ==
prices %>% filter(ticker == 'XOM')
Filtering rows for multiple stocks with filter()
and %in%
Here we use %in% to match to multiple criteria. We pick out records for XOM and CVX.
prices %>% filter(ticker %in% c('XOM','CVX'))
The arrange()
dplyr verb
Arrange reorders the rows in your data by one or more criteria.
Reordering rows in order of volume traded ascending with arrange()
prices %>% arrange(desc(volume))
Reordering rows in order of date descending, then volume ascending with arrange()
We can sort by multiple variables using arrange()
. Here we sort by date descending then volume ascending.
prices %>% arrange(desc(date), volume)
The select()
dplyr verb
select()
is used to pull out certain columns from your data.
Selecting columns with select()
Here we use select()
to pull out the ticker
, date
and close
columns.
prices %>% select(ticker, date, close)
Chaining Transformations Together: Filtering, selecting and arranging
We can chain transformations together using the pipe operator %>%.
If you’re unsure how the pipe operator works, watch this…
We’re going to chain together a transformation which:
- filters out a single stock
- selects the date and close price
- re-arranges by date descending.
prices %>% filter(ticker == 'XOM') %>% select(date, close) %>% arrange(desc(date))
When debugging a dplyr pipeline, it can be helpful to select each transformation one by one, and press CTRL-Enter in Rstudio to just run the selected transformation.
The mutate()
dplyr verb
So far we’ve just filtered selected and re-arranged existing data. Now we’re going to look at creating new data.
Mutate lets us add variables (columns) to our data. These are usually derived from existing columns.
Add new variable (column) for log volume with mutate()
Here we use mutate to add a column called logvolume
to our data. It is calculated as the log of the volume
variable.
prices %>% mutate(logvolume = log(volume))
Add new variable (column) for intraday returns with mutate()
In this example, we add a column for the intraday (open to close) simple returns of the stock each day.
prices %>% mutate(intradayreturns = close / open - 1)
The summarise()
dplyr verb
Summarise()
summarises all our observations to a single row.
Calculate the mean volume over all observations using summarise()
Here we use summarise()
to calculate the mean volume traded over our complete data set. We also count the number of records using n()
prices %>% summarise(meanvolume = mean(volume), count = n())
The group_by()
dplyr verb
Summarising the entire data set isn’t always very useful.
Usually, we want to group by a variable, and then summarise that grouped data.
The group_by()
function tells the dplyr
verbs to operate on each group one at a time.
Use summarise()
with group_by()
to calculate mean traded volume for each stock
If we group by ticker, then call summarise, then dplyr
will preform the summary calculations separately for each ticker. We will get a row for each ticker.
prices %>% group_by(ticker) %>% summarise(meanvolume = mean(volume))
Use summarise()
with multiple group_by
variables, to calculate the mean traded volume for each stock for each year
In this example we:
- calculate a new variable
year
usingmutate()
- group by ticker and year
- summarise.
library(lubridate) prices %>% mutate(year = year(date)) %>% group_by(ticker, year) %>% summarise(meanvolume = mean(volume), obscount = n())
Using group_by()
with mutate()
to do grouped row-level transformations
We can also use group_by
with mutate()
to calculate new variables which are calculated separately for a given variable (or set of variables)
You’ll use this nearly every time you do any quant analysis to calculate periodic returns.
Using group_by
with mutate()
and lag()
to calculate daily close-to-close returns
prices %>% group_by(ticker) %>% arrange(date) %>% mutate(c2creturns = close / lag(close) - 1)
Summary
Arrange your data so:
- Every column is variable
- Every row is an observation
You can then easily use dplyr
to manipulate that data very efficiently.
There are 6 main functions to master in dplyr
.
filter()
picks outs observations (rows) by some filter criteriaarrange()
reorders the observations (rows)select()
picks out the variables (columns)mutate()
creates new variables (columns) by applying transformations to existing variablessummarise()
allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.
The group_by()
causes the verbs above to act on a group at a time, rather than the whole dataset.
Want the Code?
The code for this post is available in our github repository.
4 thoughts on “Financial Data Manipulation in dplyr for Quant Traders”