Modern data science is fundamentally multi-lingual.
At a minimum, most data scientists are comfortable working in R, Python and SQL; many add Java and/or Scala to their toolkit, and it’s not uncommon to also know one’s way around JavaScript.
Personally, I prefer to use R for data analysis. But, until recently, I’d tend to reach for Python for anything more general, like scraping web data or interacting with an API. Tools for doing this sort of thing in R’s tidyverse are really maturing, so I’m doing more and more of this without leaving R.
But I also have a pile of Python scripts that I used to lean on, and it would be nice to be able to continue to leverage that past work. Other data scientists who work in bigger teams would likely have even more of a need to switch contexts regularly.
Reticulate to the rescue
Thanks to the reticulate
package (install.packages('reticulate')
) and its integration with R Studio, we can run our Python code without ever leaving the comfort of home.
Some useful features of reticulate
include:
- Ability to call Python flexibly from within R:
- sourcing Python scripts
- importing Python modules
- using Python interactively in an R session
- embedding Python code in an R Markdown document
- Direct object translation (eg
pandas.DataFrame
–data.frame
,numpy.array
–matrix
etc) - Ability to bind to different Python environments
For me, the main benefit of reticulate
is streamlining my workflow. In this post, I’ll share an example. It’s trivial and we could replace this Python script with R code in no time at all, but I’m sure you have more complex Python scripts that you don’t feel like re-writing in R…
Scraping ETF Constituents with Python from R Studio
I have a Python script, download_spdr_holdings.py
for scraping ETF constituents from the SPDR website:
"""download ETF holdings to csv file""" import pandas as pd def get_holdings(spdr_ticker): url = f'http://www.sectorspdr.com/sectorspdr/IDCO.Client.Spdrs.Holdings/Export/ExportCsv?symbol={spdr_ticker}' df = pd.read_csv(url, skiprows=1).to_csv(f'{spdr_ticker}_holdings.csv', index=False) return df if __name__ == "__main__": tickers = ['XLB', 'XLE', 'XLF', 'XLI', 'XLK', 'XLP', 'XLU', 'XLV', 'XLY'] for t in tickers: get_holdings(t)
This simple script contains a function for saving the current constituents of a SPDR ETF to a csv file. When called as a module python -m download_spdr_holdings
, the script loops through a bunch of ETF tickers and saves their constituents to individual CSV files.
The intent is that these CSV files then get read into an R session where any actual analysis takes place.
With reticulate
, I can remove the disk I/O operations and read my data directly into my R session, using my existing Python script.
Connect reticulate to Python
First, I need to tell reticulate
about the Python environment I want it to use. reticulate
is smart enough to use the version of Python found on your PATH
by default, but I have a Conda environment running Python 3.7 named “py37” that I’d like to use. Hooking reticulate
into that environment is as easy as doing:
library(reticulate) reticulate::use_condaenv("py37")
reticulate
is flexible in its ability to hook into your various Python environments. In addition to use_condaenv()
for Conda environments, there’s use_virtualenv()
for virtual environments and use_python()
to specify a Python version that isn’t on your PATH
.
Bring Python code to R
To use my Python script as is directly in R Studio, I could source it by doing reticulate::source_python("download_spdr_holdings.py")
.
This will cause the Python script to run as if it were called from the command line as a module and will loop through all the tickers and save their constituents to CSV files as before. It will also add the function get_holdings
to my R session, and I can call it as I would any R function.
For instance, get_holdings('XLF')
will scrape the constituents of the XLF ETF and save them to disk.
Pretty cool, no?
However, the point of this exercise was to skip the disk I/O operations and read the ETF constituents directly into my R session. So I would need to modify my Python def
and call source_python()
again. I could also just copy the modified def
directly in an R Markdown notebook (I just need to specify my chunk as {python}
rather than {r}
:
import pandas as pd def get_holdings(spdr_ticker): """read in ETF holdings""" url = f"http://www.sectorspdr.com/sectorspdr/IDCO.Client.Spdrs.Holdings/Export/ExportCsv?symbol={spdr_ticker}" df = pd.read_csv(url, skiprows=1, usecols=[i for i in range(3)]) return df
I now have the get_holdings
function in my R session, and can call it as if it were an R function attached to the py
object that reticulate
creates to hold the Python session:
library(tidyverse) xlf <- py$get_holdings('XLF') xlf %>% arrange(desc(`Index Weight`)) %>% head(10)
## Symbol Company Name Index Weight ## 1 BAC Bank of America Corp 7.39% ## 2 WFC Wells Fargo & Co 3.90% ## 3 C Citigroup Inc 3.86% ## 4 SPGI S&P Global Inc 3.09% ## 5 CME CME Group Inc A 2.72% ## 6 BLK BlackRock Inc 2.47% ## 7 AXP American Express Co 2.37% ## 8 GS Goldman Sachs Group Inc 2.34% ## 9 MMC Marsh & McLennan Companies 2.25% ## 10 ICE Intercontinental Exchange Inc 2.18%
Notice that to use the def
from the Python session embedded in my R session, I had to ask for it using py$object_name
– this is different than if I sourced a Python file directly, in which case the Python function becomes available directly in the R session (ie I don’t need py$
).
Importing Python modules
Importing Python modules with reticulate::import()
produces the same behaviour:
np <- import("numpy") np$array(list(c(1, 2, 3), c(4, 5, 6)))
## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6
Notice that my numpy array is created using R list objects in a manner analogous to Python lists: np.array([[1, 2, 3], [4, 5, 6]])
.
REPL-ing Python in R Studio
You can also open an interactive Python session within R by calling reticulate::repl_python()
. Any objects created within the Python session are available in the R session via the py
object.
Conclusion
Withreticulate
you can run your Python scripts in RStudio. It embeds a Python session within an R session, and allows you to pass objects between the two sessions.
Thanks for all the great stuff from Robot Wealth. If you used Python rather than R in general, then Robot Wealth would be my home page.
Is there any discussion on Robot Wealth about when R would be more useful, and when would Python? With limited time it is difficult to decide whether to commit to R when you are already competent in Python and have so many other demands on learning time. Do you think R will still have any advantages over Python in some contexts in 5 years time?
Our view on that is:
R is more productive for data analysis and has better libraries (especially for finance, derivative pricing and time series analysis).
Python is a better all-purpose programming language.
So we use R for all interactive data analysis (where possible) and Python for most plumbing tasks.
Most of our data processing pipeline is written in python and SQL. Most of our execution code is in C or Java. Most of our research is in R, and some is in python.
Thanks James. Would you mind expanding on when that research (mostly in R, some in Python) might be in Python and when in R?
We like to use the best tool for the job. For data analysis, that’s nearly always R. I love Python too and we use it extensively, just not in the things that we usually show on the blog (as those things are generally related to data analysis).
My personal view is that even if you’re an experienced Python coder, learning R for data analysis pays immense dividends in terms of productivity.
Thanks Kris. You and James taking the time to answer is really appreciated. I understand that R’s relative strengths lie in data analysis, research and statistics, and i’ve heard good things about Tidyverse and R Studio, but i was really wondering about specifics about what R can do that Python cannot do as well or as easily? Is Pandas really behind R’s equivalent when it comes to time series for example? If R is still ahead in some specifics, do you think that there are Python packages that are catching up? Thanks again and all the best, Jon.
Python, from having just finished a data science bootcamp, is probably what you want to use for things like more general ML algos (your random forests, XG boosts, etc.), since it’s very easy to get a model set up, and probably easier to work with the deep learning stuff (keras, etc.)
But for quantitative finance, R blows Python out of the water. There are just so many more libraries devoted for quantitative finance, like xts, zoo, quantmod, PerformanceAnalytics, PortfolioAnalytics, blotter/quantstrat, etc.
These aren’t libraries that some student can just port over in his free time, since they’re libraries written by very high-level practitioners in industry over many years. And if you need those specific tools, Python is completely outclassed. But even the basic portfolio management stuff is just much easier in R than Python.
No problem Jon. I wouldn’t say it’s so much about pandas being behind the tidyverse tools – it’s just different. In my experience, the biggest benefit of choosing R for data analysis is that you can be incredibly productive in a relatively short amount of time. It leverages functional programming concepts, which are a really nice fit for data analysis problems generally, and allows you to structure an analysis worfklow that matches the way you’d intuitively think about a problem. There are a bunch of specific examples of tidyverse workflows on the blog – if you’re interested it’s worth your time to look at them and think about how you’d solve the same problem in pandas.
But if I were you I’d just bite the bullet and learn R!! After all, R and python don’t represent an all or nothing choice. Being fluent in both is a superpower.
Illya makes some very good points about the R packages for quant finance in one of the other comments too. That’s extremely relevant.
Thanks loads Kris and Ilya. Those answers definitely take me a step forward and that is much appreciated.
Hi
Thanks for your descriptions.
I want to run a command in terminal by a R script.
In past, I used a python script and ran following commands:
os.chdir(‘../Routing/SourceCode’)
os.system(‘./rout ../../RoutingSetup/Hableh.txt’)
Is there a way for runing this commands in R?
¡Asombroso! — Amazing!
Thanks a lot!!!
Very worthy comments, thank you all, specially Kris and IIya. I’m R practitioner who is incursioning in Python with the need to build criteria in how and when to use and bring the best of each tool.
Another good article which sets the framework for R and Python in RStudio comes from posit: https://support.posit.co/hc/en-us/articles/1500007929061-Using-Python-with-the-RStudio-IDE