Financial Data Manipulation in dplyr for Quant Traders

In this post, we’re going to show how a quant trader can manipulate stock price data using the dplyr R package.

Getting set up and loading data

Load the dplyr package via the tidyverse package.

if (!require('tidyverse')) install.packages('tidyverse')
library(tidyverse)

First, load some price data.

energystockprices.RDS contains a data frame of daily price observations for 3 energy stocks.

prices <- readRDS('energystockprices.RDS')
prices

We’ve organised our data so that

  • Every column is variable.
  • Every row is an observation.

In this data set:

  • We have 13,314 rows in our data frame.
  • Each row represents a daily price observation for a given stock.
  • For each observation measure the open, high, low and close prices, and the volume traded.

This is a very helpful way to structure your price data. We’ll see how we can use the dplyr package to manipulate price data for quant analysis.

The main dplyr verbs

There are 6 main functions to master in dplyr.

filter() picks outs observations (rows) by some filter criteria
arrange() reorders the observations (rows)
select() picks out the variables (columns)
mutate() creates new variables (columns) by applying transformations to existing variables
summarise() allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.

Finally, the group_by() causes the verbs above to act on a group at a time, rather than the whole dataset.

We’ll go through them one by one.

The filter() dplyr verb

Filter returns only those observations (rows) which satisfy certain criteria.

Filtering rows for a single stock with filter() and ==

Here we use filter to just pull out prices for XOM.

A common mistake is to use = instead of ==

prices %>%
  filter(ticker == 'XOM')

Filtering rows for multiple stocks with filter() and %in%

Here we use %in% to match to multiple criteria. We pick out records for XOM and CVX.

prices %>%
  filter(ticker %in% c('XOM','CVX'))

The arrange() dplyr verb

Arrange reorders the rows in your data by one or more criteria.

Reordering rows in order of volume traded ascending with arrange()

prices %>%
  arrange(desc(volume))

Reordering rows in order of date descending, then volume ascending with arrange()

We can sort by multiple variables using arrange(). Here we sort by date descending then volume ascending.

prices %>%
  arrange(desc(date), volume)

The select() dplyr verb

select() is used to pull out certain columns from your data.

Selecting columns with select()

Here we use select()to pull out the ticker, date and close columns.

prices %>%
  select(ticker, date, close)

Chaining Transformations Together: Filtering, selecting and arranging

We can chain transformations together using the pipe operator %>%.

If you’re unsure how the pipe operator works, watch this…

We’re going to chain together a transformation which:

  • filters out a single stock
  • selects the date and close price
  • re-arranges by date descending.
prices %>%
  filter(ticker == 'XOM') %>%
  select(date, close) %>%
  arrange(desc(date))

 

When debugging a dplyr pipeline, it can be helpful to select each transformation one by one, and press CTRL-Enter in Rstudio to just run the selected transformation.

The mutate() dplyr verb

So far we’ve just filtered selected and re-arranged existing data. Now we’re going to look at creating new data.

Mutate lets us add variables (columns) to our data. These are usually derived from existing columns.

Add new variable (column) for log volume with mutate()

Here we use mutate to add a column called logvolume to our data. It is calculated as the log of the volume variable.

prices %>%
  mutate(logvolume = log(volume))

Add new variable (column) for intraday returns with mutate()

In this example, we add a column for the intraday (open to close) simple returns of the stock each day.

prices %>%
  mutate(intradayreturns = close / open - 1)

The summarise() dplyr verb

Summarise() summarises all our observations to a single row.

Calculate the mean volume over all observations using summarise()

Here we use summarise() to calculate the mean volume traded over our complete data set. We also count the number of records using n()

prices %>%
  summarise(meanvolume = mean(volume),
            count = n())

The group_by() dplyr verb

Summarising the entire data set isn’t always very useful.

Usually, we want to group by a variable, and then summarise that grouped data.

The group_by() function tells the dplyr verbs to operate on each group one at a time.

Use summarise() with group_by() to calculate mean traded volume for each stock

If we group by ticker, then call summarise, then dplyr will preform the summary calculations separately for each ticker. We will get a row for each ticker.

prices %>% 
  group_by(ticker) %>%
  summarise(meanvolume = mean(volume))

Use summarise() with multiple group_by variables, to calculate the mean traded volume for each stock for each year

In this example we:

  • calculate a new variable yearusing mutate()
  • group by ticker and year
  • summarise.
library(lubridate)
prices %>%
  mutate(year = year(date)) %>%
  group_by(ticker, year) %>%
  summarise(meanvolume = mean(volume),
            obscount = n())

Using group_by() with mutate() to do grouped row-level transformations

We can also use group_by with mutate() to calculate new variables which are calculated separately for a given variable (or set of variables)

You’ll use this nearly every time you do any quant analysis to calculate periodic returns.

Using group_by with mutate() and lag() to calculate daily close-to-close returns

prices %>%
  group_by(ticker) %>%
  arrange(date) %>%
  mutate(c2creturns = close / lag(close) - 1)

Summary

Arrange your data so:

  • Every column is variable
  • Every row is an observation

You can then easily use dplyr to manipulate that data very efficiently.

There are 6 main functions to master in dplyr.

  • filter()picks outs observations (rows) by some filter criteria
  • arrange() reorders the observations (rows)
  • select() picks out the variables (columns)
  • mutate() creates new variables (columns) by applying transformations to existing variables
  • summarise() allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.

The group_by() causes the verbs above to act on a group at a time, rather than the whole dataset.

Want the Code?

The code for this post is available in our github repository.

4 thoughts on “Financial Data Manipulation in dplyr for Quant Traders”

Leave a Comment