# Financial Data Manipulation in dplyr for Quant Traders

In this post, we’re going to show how a quant trader can manipulate stock price data using the `dplyr`

R package.

## Getting set up and loading data

Load the `dplyr`

package via the `tidyverse`

package.

if (!require('tidyverse')) install.packages('tidyverse') library(tidyverse)

First, load some price data.

`energystockprices.RDS`

contains a data frame of daily price observations for 3 energy stocks.

prices <- readRDS('energystockprices.RDS') prices

We’ve organised our data so that

- Every column is variable.
- Every row is an observation.

In this data set:

- We have 13,314 rows in our data frame.
- Each row represents a daily price
*observation*for a given stock. - For each observation measure the open, high, low and close prices, and the volume traded.

This is a very helpful way to structure your price data. We’ll see how we can use the `dplyr`

package to manipulate price data for quant analysis.

## The main dplyr verbs

There are 6 main functions to master in dplyr.

`filter()`

picks outs observations (rows) by some filter criteria

`arrange()`

reorders the observations (rows)

`select()`

picks out the variables (columns)

`mutate()`

creates new variables (columns) by applying transformations to existing variables

`summarise()`

allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.

Finally, the `group_by()`

causes the verbs above to act on a group at a time, rather than the whole dataset.

We’ll go through them one by one.

### The `filter()`

dplyr verb

Filter returns only those observations (rows) which satisfy certain criteria.

#### Filtering rows for a single stock with `filter()`

and `==`

Here we use `filter`

to just pull out prices for XOM.

A common mistake is to use `=`

instead of `==`

prices %>% filter(ticker == 'XOM')

#### Filtering rows for multiple stocks with `filter()`

and `%in%`

Here we use %in% to match to multiple criteria. We pick out records for XOM and CVX.

prices %>% filter(ticker %in% c('XOM','CVX'))

### The `arrange()`

dplyr verb

Arrange reorders the rows in your data by one or more criteria.

#### Reordering rows in order of volume traded ascending with `arrange()`

prices %>% arrange(desc(volume))

#### Reordering rows in order of date descending, then volume ascending with `arrange()`

We can sort by multiple variables using `arrange()`

. Here we sort by date descending then volume ascending.

prices %>% arrange(desc(date), volume)

### The `select()`

dplyr verb

`select()`

is used to pull out certain columns from your data.

#### Selecting columns with `select()`

Here we use `select()`

to pull out the `ticker`

, `date`

and `close`

columns.

prices %>% select(ticker, date, close)

#### Chaining Transformations Together: Filtering, selecting and arranging

We can chain transformations together using the pipe operator %>%.

If you’re unsure how the pipe operator works, watch this…

We’re going to chain together a transformation which:

- filters out a single stock
- selects the date and close price
- re-arranges by date descending.

prices %>% filter(ticker == 'XOM') %>% select(date, close) %>% arrange(desc(date))

When debugging a dplyr pipeline, it can be helpful to select each transformation one by one, and press CTRL-Enter in Rstudio to just run the selected transformation.

### The `mutate()`

dplyr verb

So far we’ve just filtered selected and re-arranged existing data. Now we’re going to look at creating new data.

Mutate lets us add variables (columns) to our data. These are usually derived from existing columns.

#### Add new variable (column) for log volume with `mutate()`

Here we use mutate to add a column called `logvolume`

to our data. It is calculated as the log of the `volume`

variable.

prices %>% mutate(logvolume = log(volume))

#### Add new variable (column) for intraday returns with `mutate()`

In this example, we add a column for the intraday (open to close) simple returns of the stock each day.

prices %>% mutate(intradayreturns = close / open - 1)

### The `summarise()`

dplyr verb

`Summarise()`

summarises all our observations to a single row.

#### Calculate the mean volume over all observations using `summarise()`

Here we use `summarise()`

to calculate the mean volume traded over our complete data set. We also count the number of records using `n()`

prices %>% summarise(meanvolume = mean(volume), count = n())

### The `group_by()`

dplyr verb

Summarising the entire data set isn’t always very useful.

Usually, we want to group by a variable, and then summarise that grouped data.

The `group_by()`

function tells the `dplyr`

verbs to operate on each group one at a time.

#### Use `summarise()`

with `group_by()`

to calculate mean traded volume for each stock

If we group by ticker, then call summarise, then `dplyr`

will preform the summary calculations separately for each ticker. We will get a row for each ticker.

prices %>% group_by(ticker) %>% summarise(meanvolume = mean(volume))

#### Use `summarise()`

with multiple `group_by`

variables, to calculate the mean traded volume for each stock for each year

In this example we:

- calculate a new variable
`year`

using`mutate()`

- group by ticker and year
- summarise.

library(lubridate) prices %>% mutate(year = year(date)) %>% group_by(ticker, year) %>% summarise(meanvolume = mean(volume), obscount = n())

### Using `group_by()`

with `mutate()`

to do grouped row-level transformations

We can also use `group_by`

with `mutate()`

to calculate new variables which are calculated separately for a given variable (or set of variables)

You’ll use this nearly every time you do any quant analysis to calculate periodic returns.

#### Using `group_by`

with `mutate()`

and `lag()`

to calculate daily close-to-close returns

prices %>% group_by(ticker) %>% arrange(date) %>% mutate(c2creturns = close / lag(close) - 1)

## Summary

Arrange your data so:

- Every column is variable
- Every row is an observation

You can then easily use `dplyr`

to manipulate that data very efficiently.

There are 6 main functions to master in `dplyr`

.

`filter()`

picks outs observations (rows) by some filter criteria`arrange()`

reorders the observations (rows)`select()`

picks out the variables (columns)`mutate()`

creates new variables (columns) by applying transformations to existing variables`summarise()`

allows you to group and summarise data – reducing the data into a grouped summary with fewer rows.

The `group_by()`

causes the verbs above to act on a group at a time, rather than the whole dataset.

## Want the Code?

The code for this post is available in our github repository.

## (3) Comments

[…] like scraping web data or interacting with an API. Tools for doing this sort of thing in R’s tidyverse are really maturing, so I’m doing more and more of this without leaving […]

[…] Financial Data Manipulation in dplyr for Quant Traders […]

[…] usually work with long or “tidy” data – where each observation is an observation for a stock for a given […]