How to Run Python from R Studio

Posted on May 15, 2020 by Kris Longmore
11 comments
660 Views

Modern data science is fundamentally multi-lingual.

At a minimum, most data scientists are comfortable working in R, Python and SQL; many add Java and/or Scala to their toolkit, and it’s not uncommon to also know one’s way around JavaScript.

Personally, I prefer to use R for data analysis. But, until recently, I’d tend to reach for Python for anything more general, like scraping web data or interacting with an API. Tools for doing this sort of thing in R’s tidyverse are really maturing, so I’m doing more and more of this without leaving R.

But I also have a pile of Python scripts that I used to lean on, and it would be nice to be able to continue to leverage that past work. Other data scientists who work in bigger teams would likely have even more of a need to switch contexts regularly.

Reticulate to the rescue

Thanks to the reticulate package (install.packages('reticulate')) and its integration with R Studio, we can run our Python code without ever leaving the comfort of home.

Some useful features of reticulate include:

  • Ability to call Python flexibly from within R:
    • sourcing Python scripts
    • importing Python modules
    • using Python interactively in an R session
    • embedding Python code in an R Markdown document
  • Direct object translation (eg pandas.DataFramedata.frame, numpy.arraymatrix etc)
  • Ability to bind to different Python environments

For me, the main benefit of reticulate is streamlining my workflow. In this post, I’ll share an example. It’s trivial and we could replace this Python script with R code in no time at all, but I’m sure you have more complex Python scripts that you don’t feel like re-writing in R…

Scraping ETF Constituents with Python from R Studio

I have a Python script, download_spdr_holdings.py for scraping ETF constituents from the SPDR website:

"""download ETF holdings to csv file"""
import pandas as pd

def get_holdings(spdr_ticker):

    url = f'http://www.sectorspdr.com/sectorspdr/IDCO.Client.Spdrs.Holdings/Export/ExportCsv?symbol={spdr_ticker}'
    df = pd.read_csv(url, skiprows=1).to_csv(f'{spdr_ticker}_holdings.csv', index=False)

    return df


if __name__ == "__main__":

    tickers = ['XLB', 'XLE', 'XLF', 'XLI', 'XLK', 'XLP', 'XLU', 'XLV', 'XLY']

    for t in tickers:
        get_holdings(t)

This simple script contains a function for saving the current constituents of a SPDR ETF to a csv file. When called as a module python -m download_spdr_holdings, the script loops through a bunch of ETF tickers and saves their constituents to individual CSV files.

The intent is that these CSV files then get read into an R session where any actual analysis takes place.

With reticulate, I can remove the disk I/O operations and read my data directly into my R session, using my existing Python script.

Connect reticulate to Python

First, I need to tell reticulate about the Python environment I want it to use. reticulate is smart enough to use the version of Python found on your PATH by default, but I have a Conda environment running Python 3.7 named “py37” that I’d like to use. Hooking reticulate into that environment is as easy as doing:

library(reticulate)
reticulate::use_condaenv("py37")

reticulate is flexible in its ability to hook into your various Python environments. In addition to use_condaenv() for Conda environments, there’s use_virtualenv() for virtual environments and use_python() to specify a Python version that isn’t on your PATH.

Bring Python code to R

To use my Python script as is directly in R Studio, I could source it by doing reticulate::source_python("download_spdr_holdings.py").

This will cause the Python script to run as if it were called from the command line as a module and will loop through all the tickers and save their constituents to CSV files as before. It will also add the function get_holdings to my R session, and I can call it as I would any R function.

For instance, get_holdings('XLF') will scrape the constituents of the XLF ETF and save them to disk.

Pretty cool, no?

However, the point of this exercise was to skip the disk I/O operations and read the ETF constituents directly into my R session. So I would need to modify my Python def and call source_python() again. I could also just copy the modified def directly in an R Markdown notebook (I just need to specify my chunk as {python} rather than {r}:

import pandas as pd

def get_holdings(spdr_ticker):

    """read in ETF holdings"""

    url = f"http://www.sectorspdr.com/sectorspdr/IDCO.Client.Spdrs.Holdings/Export/ExportCsv?symbol={spdr_ticker}"
    df = pd.read_csv(url, skiprows=1, usecols=[i for i in range(3)]) 

    return df

I now have the get_holdings function in my R session, and can call it as if it were an R function attached to the py object that reticulate creates to hold the Python session:

library(tidyverse)

xlf <- py$get_holdings('XLF')
xlf %>%
  arrange(desc(`Index Weight`)) %>%
  head(10)
##    Symbol                  Company Name Index Weight
## 1     BAC          Bank of America Corp        7.39%
## 2     WFC              Wells Fargo & Co        3.90%
## 3       C                 Citigroup Inc        3.86%
## 4    SPGI                S&P Global Inc        3.09%
## 5     CME               CME Group Inc A        2.72%
## 6     BLK                 BlackRock Inc        2.47%
## 7     AXP           American Express Co        2.37%
## 8      GS       Goldman Sachs Group Inc        2.34%
## 9     MMC    Marsh & McLennan Companies        2.25%
## 10    ICE Intercontinental Exchange Inc        2.18%

Notice that to use the def from the Python session embedded in my R session, I had to ask for it using py$object_name – this is different than if I sourced a Python file directly, in which case the Python function becomes available directly in the R session (ie I don’t need py$).

Importing Python modules

Importing Python modules with reticulate::import() produces the same behaviour:

np <- import("numpy")

np$array(list(c(1, 2, 3), c(4, 5, 6)))
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Notice that my numpy array is created using R list objects in a manner analogous to Python lists: np.array([[1, 2, 3], [4, 5, 6]]).

REPL-ing Python in R Studio

You can also open an interactive Python session within R by calling reticulate::repl_python(). Any objects created within the Python session are available in the R session via the py object.

Conclusion

Withreticulate you can run your Python scripts in RStudio. It embeds a Python session within an R session, and allows you to pass objects between the two sessions.

(11) Comments

JonB451
May 21, 2020 at 8:46 pm

Thanks for all the great stuff from Robot Wealth. If you used Python rather than R in general, then Robot Wealth would be my home page.

JonB451
May 22, 2020 at 12:12 am

Is there any discussion on Robot Wealth about when R would be more useful, and when would Python? With limited time it is difficult to decide whether to commit to R when you are already competent in Python and have so many other demands on learning time. Do you think R will still have any advantages over Python in some contexts in 5 years time?

May 22, 2020 at 3:27 pm

Our view on that is:

R is more productive for data analysis and has better libraries (especially for finance, derivative pricing and time series analysis).

Python is a better all-purpose programming language.

So we use R for all interactive data analysis (where possible) and Python for most plumbing tasks.

Most of our data processing pipeline is written in python and SQL. Most of our execution code is in C or Java. Most of our research is in R, and some is in python.

JonB451
May 25, 2020 at 12:30 am

Thanks James. Would you mind expanding on when that research (mostly in R, some in Python) might be in Python and when in R?

May 25, 2020 at 3:06 pm

We like to use the best tool for the job. For data analysis, that’s nearly always R. I love Python too and we use it extensively, just not in the things that we usually show on the blog (as those things are generally related to data analysis).

My personal view is that even if you’re an experienced Python coder, learning R for data analysis pays immense dividends in terms of productivity.

Jon
May 26, 2020 at 10:50 pm

Thanks Kris. You and James taking the time to answer is really appreciated. I understand that R’s relative strengths lie in data analysis, research and statistics, and i’ve heard good things about Tidyverse and R Studio, but i was really wondering about specifics about what R can do that Python cannot do as well or as easily? Is Pandas really behind R’s equivalent when it comes to time series for example? If R is still ahead in some specifics, do you think that there are Python packages that are catching up? Thanks again and all the best, Jon.

May 28, 2020 at 7:51 am

Python, from having just finished a data science bootcamp, is probably what you want to use for things like more general ML algos (your random forests, XG boosts, etc.), since it’s very easy to get a model set up, and probably easier to work with the deep learning stuff (keras, etc.)

But for quantitative finance, R blows Python out of the water. There are just so many more libraries devoted for quantitative finance, like xts, zoo, quantmod, PerformanceAnalytics, PortfolioAnalytics, blotter/quantstrat, etc.

These aren’t libraries that some student can just port over in his free time, since they’re libraries written by very high-level practitioners in industry over many years. And if you need those specific tools, Python is completely outclassed. But even the basic portfolio management stuff is just much easier in R than Python.

May 28, 2020 at 8:29 am

No problem Jon. I wouldn’t say it’s so much about pandas being behind the tidyverse tools – it’s just different. In my experience, the biggest benefit of choosing R for data analysis is that you can be incredibly productive in a relatively short amount of time. It leverages functional programming concepts, which are a really nice fit for data analysis problems generally, and allows you to structure an analysis worfklow that matches the way you’d intuitively think about a problem. There are a bunch of specific examples of tidyverse workflows on the blog – if you’re interested it’s worth your time to look at them and think about how you’d solve the same problem in pandas.

But if I were you I’d just bite the bullet and learn R!! After all, R and python don’t represent an all or nothing choice. Being fluent in both is a superpower.

Illya makes some very good points about the R packages for quant finance in one of the other comments too. That’s extremely relevant.

Jon
May 31, 2020 at 8:58 pm

Thanks loads Kris and Ilya. Those answers definitely take me a step forward and that is much appreciated.

[…] How to Run Python from R Studio […]

Hossein
November 4, 2020 at 10:37 am

Hi
Thanks for your descriptions.
I want to run a command in terminal by a R script.
In past, I used a python script and ran following commands:

os.chdir(‘../Routing/SourceCode’)
os.system(‘./rout ../../RoutingSetup/Hableh.txt’)

Is there a way for runing this commands in R?

Leave a Comment