In this post, we’re going to show how a quant trader can manipulate stock price data using the `dplyr`

R package.

## Getting set up and loading data

Load the `dplyr`

package via the `tidyverse`

package.

if (!require('tidyverse')) install.packages('tidyverse') library(tidyverse)

First, load some price data.

`energystockprices.RDS`

contains a data frame of daily price observations for 3 energy stocks.

prices <- readRDS('energystockprices.RDS') prices

We’ve organised our data so that

- Every column is variable.
- Every row is an observation.

In this data set:

- We have 13,314 rows in our data frame.
- Each row represents a daily price
*observation*for a given stock. - For each observation measure the open, high, low and close prices, and the volume traded.

This is a very helpful way to structure your price data. We’ll see how we can use the `dplyr`

package to manipulate price data for quant analysis.

## The main dplyr verbs

There are 6 main functions to master in dplyr.

`filter()`

picks outs observations (rows) by some filter criteria

`arrange()`

reorders the observations (rows)

`select()`

picks out the variables (columns)

`mutate()`

creates new variables (columns) by applying transformations to existing variables

`summarise()`

allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.

Finally, the `group_by()`

causes the verbs above to act on a group at a time, rather than the whole dataset.

We’ll go through them one by one.

### The `filter()`

dplyr verb

Filter returns only those observations (rows) which satisfy certain criteria.

#### Filtering rows for a single stock with `filter()`

and `==`

Here we use `filter`

to just pull out prices for XOM.

A common mistake is to use `=`

instead of `==`

prices %>% filter(ticker == 'XOM')

#### Filtering rows for multiple stocks with `filter()`

and `%in%`

Here we use %in% to match to multiple criteria. We pick out records for XOM and CVX.

prices %>% filter(ticker %in% c('XOM','CVX'))

### The `arrange()`

dplyr verb

Arrange reorders the rows in your data by one or more criteria.

#### Reordering rows in order of volume traded ascending with `arrange()`

prices %>% arrange(desc(volume))

#### Reordering rows in order of date descending, then volume ascending with `arrange()`

We can sort by multiple variables using `arrange()`

. Here we sort by date descending then volume ascending.

prices %>% arrange(desc(date), volume)

### The `select()`

dplyr verb

`select()`

is used to pull out certain columns from your data.

#### Selecting columns with `select()`

Here we use `select()`

to pull out the `ticker`

, `date`

and `close`

columns.

prices %>% select(ticker, date, close)

#### Chaining Transformations Together: Filtering, selecting and arranging

We can chain transformations together using the pipe operator %>%.

If you’re unsure how the pipe operator works, watch this…

We’re going to chain together a transformation which:

- filters out a single stock
- selects the date and close price
- re-arranges by date descending.

prices %>% filter(ticker == 'XOM') %>% select(date, close) %>% arrange(desc(date))

When debugging a dplyr pipeline, it can be helpful to select each transformation one by one, and press CTRL-Enter in Rstudio to just run the selected transformation.

### The `mutate()`

dplyr verb

So far we’ve just filtered selected and re-arranged existing data. Now we’re going to look at creating new data.

Mutate lets us add variables (columns) to our data. These are usually derived from existing columns.

#### Add new variable (column) for log volume with `mutate()`

Here we use mutate to add a column called `logvolume`

to our data. It is calculated as the log of the `volume`

variable.

prices %>% mutate(logvolume = log(volume))

#### Add new variable (column) for intraday returns with `mutate()`

In this example, we add a column for the intraday (open to close) simple returns of the stock each day.

prices %>% mutate(intradayreturns = close / open - 1)

### The `summarise()`

dplyr verb

`Summarise()`

summarises all our observations to a single row.

#### Calculate the mean volume over all observations using `summarise()`

Here we use `summarise()`

to calculate the mean volume traded over our complete data set. We also count the number of records using `n()`

prices %>% summarise(meanvolume = mean(volume), count = n())

### The `group_by()`

dplyr verb

Summarising the entire data set isn’t always very useful.

Usually, we want to group by a variable, and then summarise that grouped data.

The `group_by()`

function tells the `dplyr`

verbs to operate on each group one at a time.

#### Use `summarise()`

with `group_by()`

to calculate mean traded volume for each stock

If we group by ticker, then call summarise, then `dplyr`

will preform the summary calculations separately for each ticker. We will get a row for each ticker.

prices %>% group_by(ticker) %>% summarise(meanvolume = mean(volume))

#### Use `summarise()`

with multiple `group_by`

variables, to calculate the mean traded volume for each stock for each year

In this example we:

- calculate a new variable
`year`

using`mutate()`

- group by ticker and year
- summarise.

library(lubridate) prices %>% mutate(year = year(date)) %>% group_by(ticker, year) %>% summarise(meanvolume = mean(volume), obscount = n())

### Using `group_by()`

with `mutate()`

to do grouped row-level transformations

We can also use `group_by`

with `mutate()`

to calculate new variables which are calculated separately for a given variable (or set of variables)

You’ll use this nearly every time you do any quant analysis to calculate periodic returns.

#### Using `group_by`

with `mutate()`

and `lag()`

to calculate daily close-to-close returns

prices %>% group_by(ticker) %>% arrange(date) %>% mutate(c2creturns = close / lag(close) - 1)

## Summary

Arrange your data so:

- Every column is variable
- Every row is an observation

You can then easily use `dplyr`

to manipulate that data very efficiently.

There are 6 main functions to master in `dplyr`

.

`filter()`

picks outs observations (rows) by some filter criteria`arrange()`

reorders the observations (rows)`select()`

picks out the variables (columns)`mutate()`

creates new variables (columns) by applying transformations to existing variables`summarise()`

allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.

The `group_by()`

causes the verbs above to act on a group at a time, rather than the whole dataset.

## Want the Code?

The code for this post is available in our github repository.

## 4 thoughts on “Financial Data Manipulation in dplyr for Quant Traders”