In this post, we’re going to show how a quant trader can manipulate stock price data using the
dplyr R package.
Getting set up and loading data
Load the
dplyr package via the
tidyverse package.
if (!require('tidyverse')) install.packages('tidyverse')
library(tidyverse)
First, load some price data.
energystockprices.RDS contains a data frame of daily price observations for 3 energy stocks.
prices <- readRDS('energystockprices.RDS')
prices

We’ve organised our data so that
- Every column is variable.
- Every row is an observation.
In this data set:
- We have 13,314 rows in our data frame.
- Each row represents a daily price observation for a given stock.
- For each observation measure the open, high, low and close prices, and the volume traded.
This is a very helpful way to structure your price data. We’ll see how we can use the
dplyr package to manipulate price data for quant analysis.
The main dplyr verbs
There are 6 main functions to master in dplyr.
filter() picks outs observations (rows) by some filter criteria
arrange() reorders the observations (rows)
select() picks out the variables (columns)
mutate() creates new variables (columns) by applying transformations to existing variables
summarise() allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.
Finally, the
group_by() causes the verbs above to act on a group at a time, rather than the whole dataset.
We’ll go through them one by one.
The filter() dplyr verb

Filter returns only those observations (rows) which satisfy certain criteria.
Filtering rows for a single stock with filter() and ==
Here we use
filter to just pull out prices for XOM.
A common mistake is to use
= instead of
==
prices %>%
filter(ticker == 'XOM')
Filtering rows for multiple stocks with filter() and %in%
Here we use %in% to match to multiple criteria. We pick out records for XOM and CVX.
prices %>%
filter(ticker %in% c('XOM','CVX'))
The arrange() dplyr verb

Arrange reorders the rows in your data by one or more criteria.
Reordering rows in order of volume traded ascending with arrange()
prices %>%
arrange(desc(volume))
Reordering rows in order of date descending, then volume ascending with arrange()
We can sort by multiple variables using
arrange(). Here we sort by date descending then volume ascending.
prices %>%
arrange(desc(date), volume)
The select() dplyr verb
select() is used to pull out certain columns from your data.
Selecting columns with select()
Here we use
select()to pull out the
ticker,
date and
close columns.
prices %>%
select(ticker, date, close)
Chaining Transformations Together: Filtering, selecting and arranging

We can chain transformations together using the pipe operator %>%.
If you’re unsure how the pipe operator works, watch this…
We’re going to chain together a transformation which:
- filters out a single stock
- selects the date and close price
- re-arranges by date descending.
prices %>%
filter(ticker == 'XOM') %>%
select(date, close) %>%
arrange(desc(date))
When debugging a dplyr pipeline, it can be helpful to select each transformation one by one, and press CTRL-Enter in Rstudio to just run the selected transformation.
The mutate() dplyr verb

So far we’ve just filtered selected and re-arranged existing data. Now we’re going to look at creating new data.
Mutate lets us add variables (columns) to our data. These are usually derived from existing columns.
Add new variable (column) for log volume with mutate()
Here we use mutate to add a column called
logvolume to our data. It is calculated as the log of the
volume variable.
prices %>%
mutate(logvolume = log(volume))
Add new variable (column) for intraday returns with mutate()
In this example, we add a column for the intraday (open to close) simple returns of the stock each day.
prices %>%
mutate(intradayreturns = close / open - 1)
The summarise() dplyr verb
Summarise() summarises all our observations to a single row.
Calculate the mean volume over all observations using summarise()
Here we use
summarise() to calculate the mean volume traded over our complete data set. We also count the number of records using
n()
prices %>%
summarise(meanvolume = mean(volume),
count = n())
The group_by() dplyr verb

Summarising the entire data set isn’t always very useful.
Usually, we want to group by a variable, and then summarise that grouped data.
The
group_by() function tells the
dplyr verbs to operate on each group one at a time.
Use summarise() with group_by() to calculate mean traded volume for each stock
If we group by ticker, then call summarise, then
dplyr will preform the summary calculations separately for each ticker. We will get a row for each ticker.
prices %>%
group_by(ticker) %>%
summarise(meanvolume = mean(volume))
Use summarise() with multiple group_by variables, to calculate the mean traded volume for each stock for each year
In this example we:
- calculate a new variable
yearusing mutate()
- group by ticker and year
- summarise.
library(lubridate)
prices %>%
mutate(year = year(date)) %>%
group_by(ticker, year) %>%
summarise(meanvolume = mean(volume),
obscount = n())
Using group_by() with mutate() to do grouped row-level transformations
We can also use
group_by with
mutate() to calculate new variables which are calculated separately for a given variable (or set of variables)
You’ll use this nearly every time you do any quant analysis to calculate periodic returns.
Using group_by with mutate() and lag() to calculate daily close-to-close returns
prices %>%
group_by(ticker) %>%
arrange(date) %>%
mutate(c2creturns = close / lag(close) - 1)
Summary
Arrange your data so:
- Every column is variable
- Every row is an observation
You can then easily use
dplyr to manipulate that data very efficiently.
There are 6 main functions to master in
dplyr.
filter()picks outs observations (rows) by some filter criteria
arrange() reorders the observations (rows)
select() picks out the variables (columns)
mutate() creates new variables (columns) by applying transformations to existing variables
summarise() allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.
The
group_by() causes the verbs above to act on a group at a time, rather than the whole dataset.
Want the Code?
The code for this post is available in our
github repository.
7 thoughts on “Financial Data Manipulation in dplyr for Quant Traders”