I’m a bit late to the party with this one, but I was recently introduced to the feather format for working with tabular data. And let me tell you, as far as reading and writing data goes, it’s fast. Really fast. Not only has it provided a decent productivity boost, but the motivation for its development really resonates with me, so I figured I’d briefly share my experiences for any other latecomers to the feather party.
What is feather?
It’s a binary file format for storing data frames – the near-universal data container of choice for data science.
Why should you care?
Have I already mentioned that reading and writing feather files is fast?
Check this out. Here I’ve created a pandas data frame with one million rows and ten columns. Here’s how long it took to write that data frame to disk using both feather and gzip:
Yes, you read that correctly: 94 milliseconds for feather versus 33 seconds for gzip!
Here’s the read time for the each format:
The other thing I like about feather is that it is agnostic to your choice of the two main weapons of data science, namely Python pandas and R. In fact, the format was born out of a collaboration between two of the giants of data science – Wes McKinny, originator of the pandas project, and Hadley Wickham, to whom the R community owes a debt of gratitude for the tidyverse suite of tools. Apparently, these guys got together and lamented the lack of interoperability between Python and R for data science, and did something about it.
Noting that pandas and R data frames share many similarities, the feather format was developed in order to provide a common storage format for both. In practical terms, that means that you can store your data on disk in a format that is easy and fast to read into whatever platform you happen to be using.
This cross-platform approach really resonated with me. I often hear from readers who are agonising over the decision of which tool to use for analysing the markets – usually it’s a decision between Python and R, and you’d be amazed at how often I hear from people who get hung up on this decision. But it is completely the wrong thing to agonise over. Python and R are not mutually exclusive. Starting with one does not mean that you’re forever handcuffed to it and forbidden from using the other. They’re both wonderful tools in their own right, so why not skill up in both?
Of course, language agnosticism is nothing new. The Jupyter notebook has for a while supported both Python and R code – even in the same notebook. And now with feather you can store your data in a format that can easily be retrieved regardless of which language you happen to use. A highly personalised pick-and-choose approach to data science is the future, where you can use the tools that you like best for particular tasks, regardless of the language they were developed for or the preferences of your colleagues.
Using these sorts of tools, you have the power to implement whatever workflow is best for you or for your particular project, and even collaborate with people who like to work differently. For example, James loves the tidyverse suite of tools in R (quote: “its basically an in-memory SQL implementation, but with nicer grammar“), while I tend to do things faster in pandas. And we collaborate just fine doing things our own way. The point is, your choice of tool is much less important than you probably think it is. We have smart people like Wes and Hadley to thank for that luxury. Make the most of it.
What’s the catch?
There are two:
First, a feather file takes up more disk space than the same file compressed using gzip. Here’s a size comparison:
$ ls -lah total 92M drwxr-xr-x 1 Kris 197612 0 Jun 8 17:27 ./ drwxr-xr-x 1 Kris 197612 0 Jun 8 17:27 ../ -rw-r--r-- 1 Kris 197612 77M Jun 8 17:09 test_df.feather -rw-r--r-- 1 Kris 197612 15M Jun 8 17:09 test_df.gzip.csv
You can see that the feather file takes up about five times the space of the gzip file. So you’re probably not going to choose feather for long term data storage. Its primary use cases are fast input/output and cross-platform interoperability.
Secondly, feather won’t work with non-standard indexes, like date-time objects – pandas will throw an error, as in the notebook below. Thankfully the solution is simple: before feathering, reset the index to the default integer sequence and save the actual index as a column. Then, when reading back in, you simply set that column as the index:
Installation and setup
On Python, you can install using conda from the conda-forge cannel with this command: conda install feather-format -c conda-forge .
You can also install using pip: pip install -U feather-format.
Also, make sure you’re using the latest pandas (0.24.x at the time of writing). There’s also a library called feather for working with this format in Python if you don’t want to use the pandas wrappers.
On R, simply install the feather library, then call library(feather).
The syntax to use feather is similar on Python pandas and R. First, pandas:
import pandas as pd df = pd.read_feather('myfile.feather')
library(feather) df <- read_feather('myfile.feather')
I read somewhere that R’s native RData format would read and write quicker than feather format. So if you’re on R and have no reason to save your data frames to a format that’s compatible with Python, you may find that RData is the better choice. However, I tested this quickly in a Jupyter notebook running R and didn’t find this to be case at all:
I’m not sure why my results were so different to what I read online. Perhaps the efficiency of RData and feather scales differently depending on the size of the data frame. Perhaps feather has improved since the time of the article that I read. Perhaps a single observation isn’t enough to draw any conclusions.
In any event, knowing about the feather format has given me a not-insignificant productivity boost. I hope it is of some use to you too.
1 thought on “Super Fast Cross-Platform Data I/O with Feather”