Financial Data Wrangling with dplyr

In this post, we’re going to show how a quant trader can manipulate stock price data using the dplyr R package.

Getting set up and loading data

Load the dplyr package via the tidyverse package.

if (!require('tidyverse')) install.packages('tidyverse')
library(tidyverse)

First, load some price data.

energystockprices.RDS contains a data frame of daily price observations for 3 energy stocks.

prices <- readRDS('energystockprices.RDS')
prices

We’ve organised our data so that

Every column is variable.
Every row is an observation.

In this data set:

We have 13,314 rows in our data frame.
Each row represents a daily price observation for a given stock.
For each observation measure the open, high, low and close prices, and the volume traded.

This is a very helpful way to structure your price data. We’ll see how we can use the dplyr package to manipulate price data for quant analysis.

The main dplyr verbs

There are 6 main functions to master in dplyr.

filter() picks outs observations (rows) by some filter criteria
arrange() reorders the observations (rows)
select() picks out the variables (columns)
mutate() creates new variables (columns) by applying transformations to existing variables
summarise() allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.

Finally, the group_by() causes the verbs above to act on a group at a time, rather than the whole dataset.

We’ll go through them one by one.

The `filter()` dplyr verb

Filter returns only those observations (rows) which satisfy certain criteria.

Filtering rows for a single stock with `filter()` and `==`

Here we use filter to just pull out prices for XOM.

A common mistake is to use = instead of ==

prices %>%
  filter(ticker == 'XOM')

Filtering rows for multiple stocks with `filter()` and `%in%`

Here we use %in% to match to multiple criteria. We pick out records for XOM and CVX.

prices %>%
  filter(ticker %in% c('XOM','CVX'))

The `arrange()` dplyr verb

Arrange reorders the rows in your data by one or more criteria.

Reordering rows in order of volume traded ascending with `arrange()`

prices %>%
  arrange(desc(volume))

Reordering rows in order of date descending, then volume ascending with `arrange()`

We can sort by multiple variables using arrange(). Here we sort by date descending then volume ascending.

prices %>%
  arrange(desc(date), volume)

The `select()` dplyr verb

select() is used to pull out certain columns from your data.

Selecting columns with `select()`

Here we use select()to pull out the ticker, date and close columns.

prices %>%
  select(ticker, date, close)

Chaining Transformations Together: Filtering, selecting and arranging

We can chain transformations together using the pipe operator %>%.

If you’re unsure how the pipe operator works, watch this…

We’re going to chain together a transformation which:

filters out a single stock
selects the date and close price
re-arranges by date descending.

prices %>%
  filter(ticker == 'XOM') %>%
  select(date, close) %>%
  arrange(desc(date))

When debugging a dplyr pipeline, it can be helpful to select each transformation one by one, and press CTRL-Enter in Rstudio to just run the selected transformation.

The `mutate()` dplyr verb

So far we’ve just filtered selected and re-arranged existing data. Now we’re going to look at creating new data.

Mutate lets us add variables (columns) to our data. These are usually derived from existing columns.

Add new variable (column) for log volume with `mutate()`

Here we use mutate to add a column called logvolume to our data. It is calculated as the log of the volume variable.

prices %>%
  mutate(logvolume = log(volume))

Add new variable (column) for intraday returns with `mutate()`

In this example, we add a column for the intraday (open to close) simple returns of the stock each day.

prices %>%
  mutate(intradayreturns = close / open - 1)

The `summarise()` dplyr verb

Summarise() summarises all our observations to a single row.

Calculate the mean volume over all observations using `summarise()`

Here we use summarise() to calculate the mean volume traded over our complete data set. We also count the number of records using n()

prices %>%
  summarise(meanvolume = mean(volume),
            count = n())

The `group_by()` dplyr verb

Summarising the entire data set isn’t always very useful.

Usually, we want to group by a variable, and then summarise that grouped data.

The group_by() function tells the dplyr verbs to operate on each group one at a time.

Use `summarise()` with `group_by()` to calculate mean traded volume for each stock

If we group by ticker, then call summarise, then dplyr will preform the summary calculations separately for each ticker. We will get a row for each ticker.

prices %>% 
  group_by(ticker) %>%
  summarise(meanvolume = mean(volume))

Use `summarise()` with multiple `group_by` variables, to calculate the mean traded volume for each stock for each year

In this example we:

calculate a new variable yearusing mutate()
group by ticker and year
summarise.

library(lubridate)
prices %>%
  mutate(year = year(date)) %>%
  group_by(ticker, year) %>%
  summarise(meanvolume = mean(volume),
            obscount = n())

Using `group_by()` with `mutate()` to do grouped row-level transformations

We can also use group_by with mutate() to calculate new variables which are calculated separately for a given variable (or set of variables)

You’ll use this nearly every time you do any quant analysis to calculate periodic returns.

Using `group_by` with `mutate()` and `lag()` to calculate daily close-to-close returns

prices %>%
  group_by(ticker) %>%
  arrange(date) %>%
  mutate(c2creturns = close / lag(close) - 1)

Summary

Arrange your data so:

Every column is variable
Every row is an observation

You can then easily use dplyr to manipulate that data very efficiently.

There are 6 main functions to master in dplyr.

filter()picks outs observations (rows) by some filter criteria
arrange() reorders the observations (rows)
select() picks out the variables (columns)
mutate() creates new variables (columns) by applying transformations to existing variables
summarise() allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.

The group_by() causes the verbs above to act on a group at a time, rather than the whole dataset.

Want the Code?

The code for this post is available in our github repository.

Financial Data Manipulation in dplyr for Quant Traders

Getting set up and loading data

The main dplyr verbs

The `filter()` dplyr verb

Filtering rows for a single stock with `filter()` and `==`

Filtering rows for multiple stocks with `filter()` and `%in%`

The `arrange()` dplyr verb

Reordering rows in order of volume traded ascending with `arrange()`

Reordering rows in order of date descending, then volume ascending with `arrange()`

The `select()` dplyr verb

Selecting columns with `select()`

Chaining Transformations Together: Filtering, selecting and arranging

The `mutate()` dplyr verb

Add new variable (column) for log volume with `mutate()`

Add new variable (column) for intraday returns with `mutate()`

The `summarise()` dplyr verb

Calculate the mean volume over all observations using `summarise()`

The `group_by()` dplyr verb

Use `summarise()` with `group_by()` to calculate mean traded volume for each stock

Use `summarise()` with multiple `group_by` variables, to calculate the mean traded volume for each stock for each year

Using `group_by()` with `mutate()` to do grouped row-level transformations

Using `group_by` with `mutate()` and `lag()` to calculate daily close-to-close returns

Summary

Want the Code?

7 thoughts on “Financial Data Manipulation in dplyr for Quant Traders”

Leave a Comment Cancel reply

© 2024 Robot Wealth Pty LtD

Privacy Policy | Terms Of Use | FAQ | [email protected]

Getting set up and loading data

The main dplyr verbs

The filter() dplyr verb

Filtering rows for a single stock with filter() and ==

Filtering rows for multiple stocks with filter() and %in%

The arrange() dplyr verb

Reordering rows in order of volume traded ascending with arrange()

Reordering rows in order of date descending, then volume ascending with arrange()

The select() dplyr verb

Selecting columns with select()

Chaining Transformations Together: Filtering, selecting and arranging

The mutate() dplyr verb

Add new variable (column) for log volume with mutate()

Add new variable (column) for intraday returns with mutate()

The summarise() dplyr verb

Calculate the mean volume over all observations using summarise()

The group_by() dplyr verb

Use summarise() with group_by() to calculate mean traded volume for each stock

Use summarise() with multiple group_by variables, to calculate the mean traded volume for each stock for each year

Using group_by() with mutate() to do grouped row-level transformations

Using group_by with mutate() and lag() to calculate daily close-to-close returns

Summary

Want the Code?

7 thoughts on “Financial Data Manipulation in dplyr for Quant Traders”

Leave a Comment Cancel reply

© 2024 Robot Wealth Pty LtD

Privacy Policy | Terms Of Use | FAQ | [email protected]

The `filter()` dplyr verb

Filtering rows for a single stock with `filter()` and `==`

Filtering rows for multiple stocks with `filter()` and `%in%`

The `arrange()` dplyr verb

Reordering rows in order of volume traded ascending with `arrange()`

Reordering rows in order of date descending, then volume ascending with `arrange()`

The `select()` dplyr verb

Selecting columns with `select()`

The `mutate()` dplyr verb

Add new variable (column) for log volume with `mutate()`

Add new variable (column) for intraday returns with `mutate()`

The `summarise()` dplyr verb

Calculate the mean volume over all observations using `summarise()`

The `group_by()` dplyr verb

Use `summarise()` with `group_by()` to calculate mean traded volume for each stock

Use `summarise()` with multiple `group_by` variables, to calculate the mean traded volume for each stock for each year

Using `group_by()` with `mutate()` to do grouped row-level transformations

Using `group_by` with `mutate()` and `lag()` to calculate daily close-to-close returns