Holding data in a tidy format works wonders for one’s productivity.
Here we will explore the tidyr
package, which is all about creating tidy data.
In particular, let’s develop an understanding of the tidyr::pivot_longer
and tidyr::pivot_wider
functions for switching between different formats of tidy data.
In this video, you’ll learn:
- What tidy data looks like
- Why it’s a sensible approach
- The difference between long and wide tidy data
- How to efficiently switch between the two format
- When and why you’d use each of the two formats
What’s tidy data?
Tidy data is data where:
- Every column is variable.
- Every row is an observation.
- Every cell is a single value.
Why do we care?
It turns out there are huge benefits to thinking about the “shape” of your data and the best way to structure and manipulate it for your problem.
Tidy data is a standard way of shaping data that facilitates analysis. In particular, tidy data works very well with the tidyverse tools. Which means less time spent transforming and cleaning data and more time spent solving problems. In short, structuring data to facilitate analysis is an enormous productivity hack.
Thinking in these terms has had a MASSIVE impact on the effectiveness and speed of our research. We’re going to cover this in some detail in the Armageddon bootcamp, along with some specific patterns for doing trading analysis.
Wide vs Long Data
Let’s take a look at some long-format data. This code loads a long dataframe of daily returns for various indexes and prints the observations from the beginning of March 2020:
if (!require("pacman")) install.packages("pacman") pacman::p_load(tidyverse, here, knitr, kableExtra) load(here('data', 'indexreturns.RData')) dailyindex_df <- dailyindex_df %>% filter(date >= '2020-03-01') dailyindex_df %>% kable() %>% kable_styling(full_width = FALSE, position = 'center') %>% scroll_box(width = '800px', height = '300px')
date | ticker | returns |
---|---|---|
2020-03-02 | EQ_US | 0.0460048 |
2020-03-02 | EQ_NONUS_DEV | 0.0111621 |
2020-03-02 | EQ_EMER | 0.0114468 |
2020-03-02 | TN_US | -0.0008475 |
2020-03-02 | TB_US | -0.0073579 |
2020-03-02 | BOND_EMER | 0.0051502 |
2020-03-02 | GOLD | 0.0179376 |
2020-03-03 | EQ_US | -0.0280740 |
2020-03-03 | EQ_NONUS_DEV | 0.0097820 |
2020-03-03 | EQ_EMER | 0.0106093 |
2020-03-03 | TN_US | 0.0093299 |
2020-03-03 | TB_US | 0.0154987 |
2020-03-03 | BOND_EMER | 0.0093937 |
2020-03-03 | GOLD | 0.0311450 |
2020-03-04 | EQ_US | 0.0422302 |
2020-03-04 | EQ_NONUS_DEV | 0.0062777 |
2020-03-04 | EQ_EMER | 0.0097196 |
2020-03-04 | TN_US | -0.0016807 |
2020-03-04 | TB_US | -0.0099536 |
2020-03-04 | BOND_EMER | 0.0059222 |
2020-03-04 | GOLD | -0.0008743 |
2020-03-05 | EQ_US | -0.0336854 |
2020-03-05 | EQ_NONUS_DEV | -0.0014521 |
2020-03-05 | EQ_EMER | 0.0014743 |
2020-03-05 | TN_US | 0.0050505 |
2020-03-05 | TB_US | 0.0227882 |
2020-03-05 | BOND_EMER | -0.0067283 |
2020-03-05 | GOLD | 0.0151949 |
2020-03-06 | EQ_US | -0.0170369 |
2020-03-06 | EQ_NONUS_DEV | -0.0232132 |
2020-03-06 | EQ_EMER | -0.0262282 |
2020-03-06 | TN_US | 0.0041876 |
2020-03-06 | TB_US | 0.0498034 |
2020-03-06 | BOND_EMER | -0.0076207 |
2020-03-06 | GOLD | 0.0026644 |
2020-03-09 | EQ_US | -0.0761665 |
2020-03-09 | EQ_NONUS_DEV | 0.0000000 |
2020-03-09 | EQ_EMER | 0.0000000 |
2020-03-09 | TN_US | 0.0041701 |
2020-03-09 | TB_US | 0.0262172 |
2020-03-09 | BOND_EMER | 0.0000000 |
2020-03-09 | GOLD | 0.0018757 |
2020-03-10 | EQ_US | 0.0494037 |
2020-03-10 | EQ_NONUS_DEV | 0.0000000 |
2020-03-10 | EQ_EMER | 0.0000000 |
2020-03-10 | TN_US | -0.0091362 |
2020-03-10 | TB_US | -0.0486618 |
2020-03-10 | BOND_EMER | -0.0477816 |
2020-03-10 | GOLD | -0.0091271 |
2020-03-11 | EQ_US | -0.0487713 |
2020-03-11 | EQ_NONUS_DEV | 0.0000000 |
2020-03-11 | EQ_EMER | 0.0000000 |
2020-03-11 | TN_US | -0.0025147 |
2020-03-11 | TB_US | -0.0108696 |
2020-03-11 | BOND_EMER | -0.0206093 |
2020-03-11 | GOLD | -0.0107857 |
2020-03-12 | EQ_US | -0.0949084 |
2020-03-12 | EQ_NONUS_DEV | 0.0000000 |
2020-03-12 | EQ_EMER | 0.0000000 |
2020-03-12 | TN_US | 0.0008403 |
2020-03-12 | TB_US | -0.0155139 |
2020-03-12 | BOND_EMER | 0.0000000 |
2020-03-12 | GOLD | -0.0317549 |
2020-03-13 | EQ_US | 0.0931900 |
2020-03-13 | EQ_NONUS_DEV | 0.0000000 |
2020-03-13 | EQ_EMER | 0.0000000 |
2020-03-13 | TN_US | -0.0058774 |
2020-03-13 | TB_US | -0.0216678 |
2020-03-13 | BOND_EMER | -0.0439158 |
2020-03-13 | GOLD | -0.0462765 |
2020-03-16 | EQ_US | -0.1197921 |
2020-03-16 | EQ_NONUS_DEV | 0.0000000 |
2020-03-16 | EQ_EMER | 0.0000000 |
2020-03-16 | TN_US | 0.0126689 |
2020-03-16 | TB_US | 0.0510067 |
2020-03-16 | BOND_EMER | -0.0325359 |
2020-03-16 | GOLD | -0.0203396 |
2020-03-17 | EQ_US | 0.0597801 |
2020-03-17 | EQ_NONUS_DEV | 0.0000000 |
2020-03-17 | EQ_EMER | 0.0000000 |
2020-03-17 | TN_US | -0.0133445 |
2020-03-17 | TB_US | -0.0587484 |
2020-03-17 | BOND_EMER | -0.0197824 |
2020-03-17 | GOLD | 0.0265681 |
2020-03-18 | EQ_US | -0.0517360 |
2020-03-18 | EQ_NONUS_DEV | 0.0000000 |
2020-03-18 | EQ_EMER | 0.0000000 |
2020-03-18 | TN_US | -0.0067625 |
2020-03-18 | TB_US | -0.0434193 |
2020-03-18 | BOND_EMER | -0.0544904 |
2020-03-18 | GOLD | -0.0311081 |
2020-03-19 | EQ_US | 0.0047010 |
2020-03-19 | EQ_NONUS_DEV | 0.0000000 |
2020-03-19 | EQ_EMER | 0.0000000 |
2020-03-19 | TN_US | 0.0017021 |
2020-03-19 | TB_US | 0.0007092 |
2020-03-19 | BOND_EMER | -0.0234792 |
2020-03-19 | GOLD | 0.0011498 |
2020-03-20 | EQ_US | -0.0431907 |
2020-03-20 | EQ_NONUS_DEV | 0.0000000 |
2020-03-20 | EQ_EMER | 0.0000000 |
2020-03-20 | TN_US | 0.0144435 |
2020-03-20 | TB_US | 0.0652020 |
2020-03-20 | BOND_EMER | 0.0142077 |
2020-03-20 | GOLD | 0.0039756 |
2020-03-23 | EQ_US | -0.0292942 |
2020-03-23 | EQ_NONUS_DEV | -0.2532532 |
2020-03-23 | EQ_EMER | 0.0000000 |
2020-03-23 | TN_US | 0.0083752 |
2020-03-23 | TB_US | 0.0419162 |
2020-03-23 | BOND_EMER | -0.0215517 |
2020-03-23 | GOLD | 0.0568462 |
2020-03-24 | EQ_US | 0.0939740 |
2020-03-24 | EQ_NONUS_DEV | 0.0000000 |
2020-03-24 | EQ_EMER | 0.0000000 |
2020-03-24 | TN_US | -0.0041528 |
2020-03-24 | TB_US | -0.0051086 |
2020-03-24 | BOND_EMER | 0.0121145 |
2020-03-24 | GOLD | 0.0575354 |
2020-03-25 | EQ_US | 0.0115569 |
2020-03-25 | EQ_NONUS_DEV | 0.0000000 |
2020-03-25 | EQ_EMER | 0.0000000 |
2020-03-25 | TN_US | 0.0008340 |
2020-03-25 | TB_US | -0.0064185 |
2020-03-25 | BOND_EMER | 0.0315560 |
2020-03-25 | GOLD | -0.0174002 |
2020-03-26 | EQ_US | 0.0624207 |
2020-03-26 | EQ_NONUS_DEV | 0.0000000 |
2020-03-26 | EQ_EMER | 0.0000000 |
2020-03-26 | TN_US | 0.0016667 |
2020-03-26 | TB_US | 0.0064599 |
2020-03-26 | BOND_EMER | 0.0295359 |
2020-03-26 | GOLD | 0.0159455 |
2020-03-27 | EQ_US | -0.0336616 |
2020-03-27 | EQ_NONUS_DEV | 0.0000000 |
2020-03-27 | EQ_EMER | 0.0000000 |
2020-03-27 | TN_US | 0.0049917 |
2020-03-27 | TB_US | 0.0237484 |
2020-03-27 | BOND_EMER | -0.0081967 |
2020-03-27 | GOLD | -0.0037858 |
2020-03-30 | EQ_US | 0.0336403 |
2020-03-30 | EQ_NONUS_DEV | 0.0000000 |
2020-03-30 | EQ_EMER | 0.0000000 |
2020-03-30 | TN_US | -0.0008278 |
2020-03-30 | TB_US | -0.0050157 |
2020-03-30 | BOND_EMER | -0.0144628 |
2020-03-30 | GOLD | -0.0065711 |
2020-03-31 | EQ_US | -0.0159221 |
2020-03-31 | EQ_NONUS_DEV | 0.0000000 |
2020-03-31 | EQ_EMER | 0.0000000 |
2020-03-31 | TN_US | -0.0016570 |
2020-03-31 | TB_US | -0.0144928 |
2020-03-31 | BOND_EMER | 0.0115304 |
2020-03-31 | GOLD | -0.0283711 |
2020-04-01 | EQ_US | -0.0441380 |
2020-04-01 | EQ_NONUS_DEV | 0.0000000 |
2020-04-01 | EQ_EMER | 0.0000000 |
2020-04-01 | TN_US | 0.0008299 |
2020-04-01 | TB_US | 0.0147059 |
2020-04-01 | BOND_EMER | -0.0124352 |
2020-04-01 | GOLD | -0.0032808 |
2020-04-02 | EQ_US | 0.0230223 |
2020-04-02 | EQ_NONUS_DEV | 0.0000000 |
2020-04-02 | EQ_EMER | 0.0000000 |
2020-04-02 | TN_US | 0.0008292 |
2020-04-02 | TB_US | 0.0056711 |
2020-04-02 | BOND_EMER | 0.0000000 |
2020-04-02 | GOLD | 0.0291310 |
Long data is presented with one or more columns containing a key and another containing all the values.
In this example, the key, or so-called “unit of analysis: is date-ticker. That is, each value (in the returns column) is uniquely associated with a date-ticker joint key.
The joint key date-ticker would be the starting point of any analysis we’d want to do on this data set.
This is often easier to manage and process. However, if you’re used to looking at spreadsheets, it can be harder to understand intuitively. (I think that this difficulty evaporates fairly quickly once you start using the tools).
While structuring data as key-value pairs might seem odd if you’re not used to it, it does actually facilitate your conceptual clarity of the problem at hand.
For example, in the example above, it is clear that the unique identifier of each return is the date-ticker joint key.
With that clarity, it becomes much simpler to imagine the steps in an analysis workflow. You get quite productive and effective at this with a little practice.
Let’s compare this with the same data in wide format:
dailyindex_df %>% pivot_wider(names_from = ticker, values_from = returns) %>% kable() %>% kable_styling(position = 'center') %>% scroll_box(width = '800px', height = '300px')
date | EQ_US | EQ_NONUS_DEV | EQ_EMER | TN_US | TB_US | BOND_EMER | GOLD |
---|---|---|---|---|---|---|---|
2020-03-02 | 0.0460048 | 0.0111621 | 0.0114468 | -0.0008475 | -0.0073579 | 0.0051502 | 0.0179376 |
2020-03-03 | -0.0280740 | 0.0097820 | 0.0106093 | 0.0093299 | 0.0154987 | 0.0093937 | 0.0311450 |
2020-03-04 | 0.0422302 | 0.0062777 | 0.0097196 | -0.0016807 | -0.0099536 | 0.0059222 | -0.0008743 |
2020-03-05 | -0.0336854 | -0.0014521 | 0.0014743 | 0.0050505 | 0.0227882 | -0.0067283 | 0.0151949 |
2020-03-06 | -0.0170369 | -0.0232132 | -0.0262282 | 0.0041876 | 0.0498034 | -0.0076207 | 0.0026644 |
2020-03-09 | -0.0761665 | 0.0000000 | 0.0000000 | 0.0041701 | 0.0262172 | 0.0000000 | 0.0018757 |
2020-03-10 | 0.0494037 | 0.0000000 | 0.0000000 | -0.0091362 | -0.0486618 | -0.0477816 | -0.0091271 |
2020-03-11 | -0.0487713 | 0.0000000 | 0.0000000 | -0.0025147 | -0.0108696 | -0.0206093 | -0.0107857 |
2020-03-12 | -0.0949084 | 0.0000000 | 0.0000000 | 0.0008403 | -0.0155139 | 0.0000000 | -0.0317549 |
2020-03-13 | 0.0931900 | 0.0000000 | 0.0000000 | -0.0058774 | -0.0216678 | -0.0439158 | -0.0462765 |
2020-03-16 | -0.1197921 | 0.0000000 | 0.0000000 | 0.0126689 | 0.0510067 | -0.0325359 | -0.0203396 |
2020-03-17 | 0.0597801 | 0.0000000 | 0.0000000 | -0.0133445 | -0.0587484 | -0.0197824 | 0.0265681 |
2020-03-18 | -0.0517360 | 0.0000000 | 0.0000000 | -0.0067625 | -0.0434193 | -0.0544904 | -0.0311081 |
2020-03-19 | 0.0047010 | 0.0000000 | 0.0000000 | 0.0017021 | 0.0007092 | -0.0234792 | 0.0011498 |
2020-03-20 | -0.0431907 | 0.0000000 | 0.0000000 | 0.0144435 | 0.0652020 | 0.0142077 | 0.0039756 |
2020-03-23 | -0.0292942 | -0.2532532 | 0.0000000 | 0.0083752 | 0.0419162 | -0.0215517 | 0.0568462 |
2020-03-24 | 0.0939740 | 0.0000000 | 0.0000000 | -0.0041528 | -0.0051086 | 0.0121145 | 0.0575354 |
2020-03-25 | 0.0115569 | 0.0000000 | 0.0000000 | 0.0008340 | -0.0064185 | 0.0315560 | -0.0174002 |
2020-03-26 | 0.0624207 | 0.0000000 | 0.0000000 | 0.0016667 | 0.0064599 | 0.0295359 | 0.0159455 |
2020-03-27 | -0.0336616 | 0.0000000 | 0.0000000 | 0.0049917 | 0.0237484 | -0.0081967 | -0.0037858 |
2020-03-30 | 0.0336403 | 0.0000000 | 0.0000000 | -0.0008278 | -0.0050157 | -0.0144628 | -0.0065711 |
2020-03-31 | -0.0159221 | 0.0000000 | 0.0000000 | -0.0016570 | -0.0144928 | 0.0115304 | -0.0283711 |
2020-04-01 | -0.0441380 | 0.0000000 | 0.0000000 | 0.0008299 | 0.0147059 | -0.0124352 | -0.0032808 |
2020-04-02 | 0.0230223 | 0.0000000 | 0.0000000 | 0.0008292 | 0.0056711 | 0.0000000 | 0.0291310 |
This might look more familiar. Here we have a row for each date and a column for the return corresponding to each index. The unique values in the ticker column are actual columns in this wide format.
Data in this format is probably more amenable human consumption.
So which is better – wide or long format?
It depends!
You’ll find that storing your data in long format facilitates exploration and analysis, particularly if you use the tidyverse
tools. We highly recommend that you do all your tidy analysis in long format unless you have a good reason not to.
Long format data is also easy to maintain – adding a new variable (a new ticker, say) is as simple as appending rows to the bottom of the existing data frame (and maybe sorting it by date, if you wanted to).
One use case that you see all the time is using ggplot to visualise a variable for more than one member of some organising category, for example, a time series plot of a bunch of different price curves where the organising category is ticker.
On the other hand, wide-format data might be a better choice if you intend for a human to consume the data.
You will also find certain functions and algorithms that expect data in this format, for example stats::cor
.
The practical reality
A good rule of thumb (and one that we follow) is to keep your data in long format whenever you’re doing any data manipulation or processing and save wide format for displaying it.
Of course, there are exceptions and sometimes you have a reason not to do your processing in long format, for instance when a function requires a wide data frame.
That means that in reality, you’ll often find yourself wanting to switch between long and wide format. Fortunately, Fortunately, using the tidyr
package, it is very simple to convert from long to wide format and back again.
Pivoting from long to wide
We’ve already seen an example of pivoting from long to wide format. Let’s explore that in a little more detail.
We use tidyr::pivot_wider
to go from long to wide.
The most important arguments to the function are id_cols
, names_from
and values_from
, and they each specify a column in our long dataframe.
- The
id_cols
column specifies the unique identifier of each observation in our wide data frame. - The unique values in the
names_from
column become the column names in the wide data frame. - The values in the
values_from
column gets populated into the cells of the wide data frame.
In our example:
- We want to index our wide dataframe by date, so we specify
id_cols = date
- We want the tickers to form columns in the wide dataframe, so we specify
names_from = ticker
- We want to populate our wide dataframe with returns values, so we specify
values_from = returns
Here’s what that looks like:
dailyindex_df %>% pivot_wider(id_cols = date, names_from = ticker, values_from = returns) %>% kable() %>% kable_styling(position = 'center') %>% scroll_box(width = '800px', height = '300px')
date | EQ_US | EQ_NONUS_DEV | EQ_EMER | TN_US | TB_US | BOND_EMER | GOLD |
---|---|---|---|---|---|---|---|
2020-03-02 | 0.0460048 | 0.0111621 | 0.0114468 | -0.0008475 | -0.0073579 | 0.0051502 | 0.0179376 |
2020-03-03 | -0.0280740 | 0.0097820 | 0.0106093 | 0.0093299 | 0.0154987 | 0.0093937 | 0.0311450 |
2020-03-04 | 0.0422302 | 0.0062777 | 0.0097196 | -0.0016807 | -0.0099536 | 0.0059222 | -0.0008743 |
2020-03-05 | -0.0336854 | -0.0014521 | 0.0014743 | 0.0050505 | 0.0227882 | -0.0067283 | 0.0151949 |
2020-03-06 | -0.0170369 | -0.0232132 | -0.0262282 | 0.0041876 | 0.0498034 | -0.0076207 | 0.0026644 |
2020-03-09 | -0.0761665 | 0.0000000 | 0.0000000 | 0.0041701 | 0.0262172 | 0.0000000 | 0.0018757 |
2020-03-10 | 0.0494037 | 0.0000000 | 0.0000000 | -0.0091362 | -0.0486618 | -0.0477816 | -0.0091271 |
2020-03-11 | -0.0487713 | 0.0000000 | 0.0000000 | -0.0025147 | -0.0108696 | -0.0206093 | -0.0107857 |
2020-03-12 | -0.0949084 | 0.0000000 | 0.0000000 | 0.0008403 | -0.0155139 | 0.0000000 | -0.0317549 |
2020-03-13 | 0.0931900 | 0.0000000 | 0.0000000 | -0.0058774 | -0.0216678 | -0.0439158 | -0.0462765 |
2020-03-16 | -0.1197921 | 0.0000000 | 0.0000000 | 0.0126689 | 0.0510067 | -0.0325359 | -0.0203396 |
2020-03-17 | 0.0597801 | 0.0000000 | 0.0000000 | -0.0133445 | -0.0587484 | -0.0197824 | 0.0265681 |
2020-03-18 | -0.0517360 | 0.0000000 | 0.0000000 | -0.0067625 | -0.0434193 | -0.0544904 | -0.0311081 |
2020-03-19 | 0.0047010 | 0.0000000 | 0.0000000 | 0.0017021 | 0.0007092 | -0.0234792 | 0.0011498 |
2020-03-20 | -0.0431907 | 0.0000000 | 0.0000000 | 0.0144435 | 0.0652020 | 0.0142077 | 0.0039756 |
2020-03-23 | -0.0292942 | -0.2532532 | 0.0000000 | 0.0083752 | 0.0419162 | -0.0215517 | 0.0568462 |
2020-03-24 | 0.0939740 | 0.0000000 | 0.0000000 | -0.0041528 | -0.0051086 | 0.0121145 | 0.0575354 |
2020-03-25 | 0.0115569 | 0.0000000 | 0.0000000 | 0.0008340 | -0.0064185 | 0.0315560 | -0.0174002 |
2020-03-26 | 0.0624207 | 0.0000000 | 0.0000000 | 0.0016667 | 0.0064599 | 0.0295359 | 0.0159455 |
2020-03-27 | -0.0336616 | 0.0000000 | 0.0000000 | 0.0049917 | 0.0237484 | -0.0081967 | -0.0037858 |
2020-03-30 | 0.0336403 | 0.0000000 | 0.0000000 | -0.0008278 | -0.0050157 | -0.0144628 | -0.0065711 |
2020-03-31 | -0.0159221 | 0.0000000 | 0.0000000 | -0.0016570 | -0.0144928 | 0.0115304 | -0.0283711 |
2020-04-01 | -0.0441380 | 0.0000000 | 0.0000000 | 0.0008299 | 0.0147059 | -0.0124352 | -0.0032808 |
2020-04-02 | 0.0230223 | 0.0000000 | 0.0000000 | 0.0008292 | 0.0056711 | 0.0000000 | 0.0291310 |
Could that be any easier?
Actually, yes!
id_cols
defaults to any column or columns that aren’t specified by the names_from
and values_from
arguments. So in our case, we could actually not even bother with the id_cols
argument:
dailyindex_df_wide <- dailyindex_df %>% pivot_wider(names_from = ticker, values_from = returns) dailyindex_df_wide %>% kable() %>% kable_styling(position = 'center') %>% scroll_box(width = '800px', height = '300px')
date | EQ_US | EQ_NONUS_DEV | EQ_EMER | TN_US | TB_US | BOND_EMER | GOLD |
---|---|---|---|---|---|---|---|
2020-03-02 | 0.0460048 | 0.0111621 | 0.0114468 | -0.0008475 | -0.0073579 | 0.0051502 | 0.0179376 |
2020-03-03 | -0.0280740 | 0.0097820 | 0.0106093 | 0.0093299 | 0.0154987 | 0.0093937 | 0.0311450 |
2020-03-04 | 0.0422302 | 0.0062777 | 0.0097196 | -0.0016807 | -0.0099536 | 0.0059222 | -0.0008743 |
2020-03-05 | -0.0336854 | -0.0014521 | 0.0014743 | 0.0050505 | 0.0227882 | -0.0067283 | 0.0151949 |
2020-03-06 | -0.0170369 | -0.0232132 | -0.0262282 | 0.0041876 | 0.0498034 | -0.0076207 | 0.0026644 |
2020-03-09 | -0.0761665 | 0.0000000 | 0.0000000 | 0.0041701 | 0.0262172 | 0.0000000 | 0.0018757 |
2020-03-10 | 0.0494037 | 0.0000000 | 0.0000000 | -0.0091362 | -0.0486618 | -0.0477816 | -0.0091271 |
2020-03-11 | -0.0487713 | 0.0000000 | 0.0000000 | -0.0025147 | -0.0108696 | -0.0206093 | -0.0107857 |
2020-03-12 | -0.0949084 | 0.0000000 | 0.0000000 | 0.0008403 | -0.0155139 | 0.0000000 | -0.0317549 |
2020-03-13 | 0.0931900 | 0.0000000 | 0.0000000 | -0.0058774 | -0.0216678 | -0.0439158 | -0.0462765 |
2020-03-16 | -0.1197921 | 0.0000000 | 0.0000000 | 0.0126689 | 0.0510067 | -0.0325359 | -0.0203396 |
2020-03-17 | 0.0597801 | 0.0000000 | 0.0000000 | -0.0133445 | -0.0587484 | -0.0197824 | 0.0265681 |
2020-03-18 | -0.0517360 | 0.0000000 | 0.0000000 | -0.0067625 | -0.0434193 | -0.0544904 | -0.0311081 |
2020-03-19 | 0.0047010 | 0.0000000 | 0.0000000 | 0.0017021 | 0.0007092 | -0.0234792 | 0.0011498 |
2020-03-20 | -0.0431907 | 0.0000000 | 0.0000000 | 0.0144435 | 0.0652020 | 0.0142077 | 0.0039756 |
2020-03-23 | -0.0292942 | -0.2532532 | 0.0000000 | 0.0083752 | 0.0419162 | -0.0215517 | 0.0568462 |
2020-03-24 | 0.0939740 | 0.0000000 | 0.0000000 | -0.0041528 | -0.0051086 | 0.0121145 | 0.0575354 |
2020-03-25 | 0.0115569 | 0.0000000 | 0.0000000 | 0.0008340 | -0.0064185 | 0.0315560 | -0.0174002 |
2020-03-26 | 0.0624207 | 0.0000000 | 0.0000000 | 0.0016667 | 0.0064599 | 0.0295359 | 0.0159455 |
2020-03-27 | -0.0336616 | 0.0000000 | 0.0000000 | 0.0049917 | 0.0237484 | -0.0081967 | -0.0037858 |
2020-03-30 | 0.0336403 | 0.0000000 | 0.0000000 | -0.0008278 | -0.0050157 | -0.0144628 | -0.0065711 |
2020-03-31 | -0.0159221 | 0.0000000 | 0.0000000 | -0.0016570 | -0.0144928 | 0.0115304 | -0.0283711 |
2020-04-01 | -0.0441380 | 0.0000000 | 0.0000000 | 0.0008299 | 0.0147059 | -0.0124352 | -0.0032808 |
2020-04-02 | 0.0230223 | 0.0000000 | 0.0000000 | 0.0008292 | 0.0056711 | 0.0000000 | 0.0291310 |
Same result as above. Brilliant.
Pivoting from wide to long
For pivoting from wide to long, we use tidry::pivot_longer
.
The most important arguments to the function are cols
, names_to
and values_to
. You can probably guess at their relationship to the arguments to pivot_wider
.
cols
specifies the columns that we want to take from wide to long.names_to
specifies a name for the column in our long data frame that will hold the column names from the wide data frame.values_to
specifies a name for the column in our long data frame that will hold the values in the cells of the wide data frame.
In our example:
- We want to take the columns holding the returns for each ticker from wide to long, so we want
cols
to take all the columns except date. We can do that by specifyingcols = -date
- We want the names of the
cols
to be held in a long variable calledtickers
, so we specifynames_to = "ticker"
. Note that"ticker"
here is a string variable. - We want to hold the values from our wide columns in a long column called
"returns"
so we specifyvalues_to = "returns"
. Again note the string variable.
Here’s what that looks like:
dailyindex_df_wide %>% pivot_longer(cols = -date, names_to = 'ticker', values_to= 'returns') %>% kable() %>% kable_styling(full_width = FALSE, position = 'center') %>% scroll_box(width = '800px', height = '300px')
date | ticker | returns |
---|---|---|
2020-03-02 | EQ_US | 0.0460048 |
2020-03-02 | EQ_NONUS_DEV | 0.0111621 |
2020-03-02 | EQ_EMER | 0.0114468 |
2020-03-02 | TN_US | -0.0008475 |
2020-03-02 | TB_US | -0.0073579 |
2020-03-02 | BOND_EMER | 0.0051502 |
2020-03-02 | GOLD | 0.0179376 |
2020-03-03 | EQ_US | -0.0280740 |
2020-03-03 | EQ_NONUS_DEV | 0.0097820 |
2020-03-03 | EQ_EMER | 0.0106093 |
2020-03-03 | TN_US | 0.0093299 |
2020-03-03 | TB_US | 0.0154987 |
2020-03-03 | BOND_EMER | 0.0093937 |
2020-03-03 | GOLD | 0.0311450 |
2020-03-04 | EQ_US | 0.0422302 |
2020-03-04 | EQ_NONUS_DEV | 0.0062777 |
2020-03-04 | EQ_EMER | 0.0097196 |
2020-03-04 | TN_US | -0.0016807 |
2020-03-04 | TB_US | -0.0099536 |
2020-03-04 | BOND_EMER | 0.0059222 |
2020-03-04 | GOLD | -0.0008743 |
2020-03-05 | EQ_US | -0.0336854 |
2020-03-05 | EQ_NONUS_DEV | -0.0014521 |
2020-03-05 | EQ_EMER | 0.0014743 |
2020-03-05 | TN_US | 0.0050505 |
2020-03-05 | TB_US | 0.0227882 |
2020-03-05 | BOND_EMER | -0.0067283 |
2020-03-05 | GOLD | 0.0151949 |
2020-03-06 | EQ_US | -0.0170369 |
2020-03-06 | EQ_NONUS_DEV | -0.0232132 |
2020-03-06 | EQ_EMER | -0.0262282 |
2020-03-06 | TN_US | 0.0041876 |
2020-03-06 | TB_US | 0.0498034 |
2020-03-06 | BOND_EMER | -0.0076207 |
2020-03-06 | GOLD | 0.0026644 |
2020-03-09 | EQ_US | -0.0761665 |
2020-03-09 | EQ_NONUS_DEV | 0.0000000 |
2020-03-09 | EQ_EMER | 0.0000000 |
2020-03-09 | TN_US | 0.0041701 |
2020-03-09 | TB_US | 0.0262172 |
2020-03-09 | BOND_EMER | 0.0000000 |
2020-03-09 | GOLD | 0.0018757 |
2020-03-10 | EQ_US | 0.0494037 |
2020-03-10 | EQ_NONUS_DEV | 0.0000000 |
2020-03-10 | EQ_EMER | 0.0000000 |
2020-03-10 | TN_US | -0.0091362 |
2020-03-10 | TB_US | -0.0486618 |
2020-03-10 | BOND_EMER | -0.0477816 |
2020-03-10 | GOLD | -0.0091271 |
2020-03-11 | EQ_US | -0.0487713 |
2020-03-11 | EQ_NONUS_DEV | 0.0000000 |
2020-03-11 | EQ_EMER | 0.0000000 |
2020-03-11 | TN_US | -0.0025147 |
2020-03-11 | TB_US | -0.0108696 |
2020-03-11 | BOND_EMER | -0.0206093 |
2020-03-11 | GOLD | -0.0107857 |
2020-03-12 | EQ_US | -0.0949084 |
2020-03-12 | EQ_NONUS_DEV | 0.0000000 |
2020-03-12 | EQ_EMER | 0.0000000 |
2020-03-12 | TN_US | 0.0008403 |
2020-03-12 | TB_US | -0.0155139 |
2020-03-12 | BOND_EMER | 0.0000000 |
2020-03-12 | GOLD | -0.0317549 |
2020-03-13 | EQ_US | 0.0931900 |
2020-03-13 | EQ_NONUS_DEV | 0.0000000 |
2020-03-13 | EQ_EMER | 0.0000000 |
2020-03-13 | TN_US | -0.0058774 |
2020-03-13 | TB_US | -0.0216678 |
2020-03-13 | BOND_EMER | -0.0439158 |
2020-03-13 | GOLD | -0.0462765 |
2020-03-16 | EQ_US | -0.1197921 |
2020-03-16 | EQ_NONUS_DEV | 0.0000000 |
2020-03-16 | EQ_EMER | 0.0000000 |
2020-03-16 | TN_US | 0.0126689 |
2020-03-16 | TB_US | 0.0510067 |
2020-03-16 | BOND_EMER | -0.0325359 |
2020-03-16 | GOLD | -0.0203396 |
2020-03-17 | EQ_US | 0.0597801 |
2020-03-17 | EQ_NONUS_DEV | 0.0000000 |
2020-03-17 | EQ_EMER | 0.0000000 |
2020-03-17 | TN_US | -0.0133445 |
2020-03-17 | TB_US | -0.0587484 |
2020-03-17 | BOND_EMER | -0.0197824 |
2020-03-17 | GOLD | 0.0265681 |
2020-03-18 | EQ_US | -0.0517360 |
2020-03-18 | EQ_NONUS_DEV | 0.0000000 |
2020-03-18 | EQ_EMER | 0.0000000 |
2020-03-18 | TN_US | -0.0067625 |
2020-03-18 | TB_US | -0.0434193 |
2020-03-18 | BOND_EMER | -0.0544904 |
2020-03-18 | GOLD | -0.0311081 |
2020-03-19 | EQ_US | 0.0047010 |
2020-03-19 | EQ_NONUS_DEV | 0.0000000 |
2020-03-19 | EQ_EMER | 0.0000000 |
2020-03-19 | TN_US | 0.0017021 |
2020-03-19 | TB_US | 0.0007092 |
2020-03-19 | BOND_EMER | -0.0234792 |
2020-03-19 | GOLD | 0.0011498 |
2020-03-20 | EQ_US | -0.0431907 |
2020-03-20 | EQ_NONUS_DEV | 0.0000000 |
2020-03-20 | EQ_EMER | 0.0000000 |
2020-03-20 | TN_US | 0.0144435 |
2020-03-20 | TB_US | 0.0652020 |
2020-03-20 | BOND_EMER | 0.0142077 |
2020-03-20 | GOLD | 0.0039756 |
2020-03-23 | EQ_US | -0.0292942 |
2020-03-23 | EQ_NONUS_DEV | -0.2532532 |
2020-03-23 | EQ_EMER | 0.0000000 |
2020-03-23 | TN_US | 0.0083752 |
2020-03-23 | TB_US | 0.0419162 |
2020-03-23 | BOND_EMER | -0.0215517 |
2020-03-23 | GOLD | 0.0568462 |
2020-03-24 | EQ_US | 0.0939740 |
2020-03-24 | EQ_NONUS_DEV | 0.0000000 |
2020-03-24 | EQ_EMER | 0.0000000 |
2020-03-24 | TN_US | -0.0041528 |
2020-03-24 | TB_US | -0.0051086 |
2020-03-24 | BOND_EMER | 0.0121145 |
2020-03-24 | GOLD | 0.0575354 |
2020-03-25 | EQ_US | 0.0115569 |
2020-03-25 | EQ_NONUS_DEV | 0.0000000 |
2020-03-25 | EQ_EMER | 0.0000000 |
2020-03-25 | TN_US | 0.0008340 |
2020-03-25 | TB_US | -0.0064185 |
2020-03-25 | BOND_EMER | 0.0315560 |
2020-03-25 | GOLD | -0.0174002 |
2020-03-26 | EQ_US | 0.0624207 |
2020-03-26 | EQ_NONUS_DEV | 0.0000000 |
2020-03-26 | EQ_EMER | 0.0000000 |
2020-03-26 | TN_US | 0.0016667 |
2020-03-26 | TB_US | 0.0064599 |
2020-03-26 | BOND_EMER | 0.0295359 |
2020-03-26 | GOLD | 0.0159455 |
2020-03-27 | EQ_US | -0.0336616 |
2020-03-27 | EQ_NONUS_DEV | 0.0000000 |
2020-03-27 | EQ_EMER | 0.0000000 |
2020-03-27 | TN_US | 0.0049917 |
2020-03-27 | TB_US | 0.0237484 |
2020-03-27 | BOND_EMER | -0.0081967 |
2020-03-27 | GOLD | -0.0037858 |
2020-03-30 | EQ_US | 0.0336403 |
2020-03-30 | EQ_NONUS_DEV | 0.0000000 |
2020-03-30 | EQ_EMER | 0.0000000 |
2020-03-30 | TN_US | -0.0008278 |
2020-03-30 | TB_US | -0.0050157 |
2020-03-30 | BOND_EMER | -0.0144628 |
2020-03-30 | GOLD | -0.0065711 |
2020-03-31 | EQ_US | -0.0159221 |
2020-03-31 | EQ_NONUS_DEV | 0.0000000 |
2020-03-31 | EQ_EMER | 0.0000000 |
2020-03-31 | TN_US | -0.0016570 |
2020-03-31 | TB_US | -0.0144928 |
2020-03-31 | BOND_EMER | 0.0115304 |
2020-03-31 | GOLD | -0.0283711 |
2020-04-01 | EQ_US | -0.0441380 |
2020-04-01 | EQ_NONUS_DEV | 0.0000000 |
2020-04-01 | EQ_EMER | 0.0000000 |
2020-04-01 | TN_US | 0.0008299 |
2020-04-01 | TB_US | 0.0147059 |
2020-04-01 | BOND_EMER | -0.0124352 |
2020-04-01 | GOLD | -0.0032808 |
2020-04-02 | EQ_US | 0.0230223 |
2020-04-02 | EQ_NONUS_DEV | 0.0000000 |
2020-04-02 | EQ_EMER | 0.0000000 |
2020-04-02 | TN_US | 0.0008292 |
2020-04-02 | TB_US | 0.0056711 |
2020-04-02 | BOND_EMER | 0.0000000 |
2020-04-02 | GOLD | 0.0291310 |
And you can see that we’ve recovered our original long form dataframe.
An example
One example where you’d be forced to pivot you long returns data frame to wide would be to calculate a correlation matrix:
dailyindex_df %>% pivot_wider(names_from = ticker, values_from = returns) %>% select(-date) %>% cor(use = "pairwise.complete.obs", method='pearson') %>% kable() %>% kable_styling(position = 'center') %>% scroll_box(width = '800px', height = '300px')
EQ_US | EQ_NONUS_DEV | EQ_EMER | TN_US | TB_US | BOND_EMER | GOLD | |
---|---|---|---|---|---|---|---|
EQ_US | 1.0000000 | 0.1045511 | 0.1219745 | -0.5935488 | -0.4827843 | 0.1064579 | 0.2741258 |
EQ_NONUS_DEV | 0.1045511 | 1.0000000 | 0.1216288 | -0.2460011 | -0.2961201 | 0.1418452 | -0.4301298 |
EQ_EMER | 0.1219745 | 0.1216288 | 1.0000000 | -0.0362846 | -0.2751978 | 0.1345682 | 0.1158572 |
TN_US | -0.5935488 | -0.2460011 | -0.0362846 | 1.0000000 | 0.9324163 | 0.3326693 | 0.1643955 |
TB_US | -0.4827843 | -0.2961201 | -0.2751978 | 0.9324163 | 1.0000000 | 0.3107047 | 0.2451740 |
BOND_EMER | 0.1064579 | 0.1418452 | 0.1345682 | 0.3326693 | 0.3107047 | 1.0000000 | 0.3305851 |
GOLD | 0.2741258 | -0.4301298 | 0.1158572 | 0.1643955 | 0.2451740 | 0.3305851 | 1.0000000 |
You can see that we’ve also used the select
function from dplyr
to drop the date column before passing the wide data frame of returns to the cor
function for calculating the correlation matrix.
Plotting long format data
When you want to plot more than one variable on a single chart, long data is most definitely your friend:
dailyindex_df %>% ggplot(aes(x = date, y = returns, colour = ticker)) + geom_line()
Plotting each returns series in a grid is equally simple:
dailyindex_df %>% ggplot(aes(x = date, y = returns)) + geom_line() + facet_wrap(~ticker)
Using wide-format data to make a similar plot would require repeated calls to geom_line
for each variable, which is quite painstaking and brittle.
For example, if something changes upstream, such as the addition of a new ticker to the data set, your code will also need to change in order to plot it. That’s not the case if we use long data with a column holding the ticker variable.
If you liked this you’ll probably like these too…
How to Calculate Rolling Pairwise Correlations in the Tidyverse
How to Fill Gaps in Large Stock Data Universes Using tidyr and dplyr