How To Get Historical S&P 500 Constituents Data For Free

In this post, we are going to construct snapshots of historic S&P 500 index constituents, from freely available data on the internet.

Why?

Well, one of the biggest challenges in looking for opportunities amongst a broad universe of stocks is choosing what stock “universe” to look at.

One approach to dealing with this is to pick the stocks that are currently in the S&P 500 index.

Unfortunately, the stocks that are currently in the S&P 500 index weren’t all there last year. A third of them weren’t there ten years ago…

If we create a historical data set by picking current S&P 500 index constituents, then we will be including historical data for smaller stocks that weren’t in the index at that time.

These are all going to be stocks that did very well, historically, or else they wouldn’t have gotten in the index!

So this universe selection technique biases our stock returns higher.

The average past returns of current SPX constituents is higher than the average past returns of historic SPX constituents, due to this upward bias.

It’s easy to see how this may cause us to overstate the returns for any analysis that is net long stocks if we’re not careful.

It’s perhaps less obvious that this will significantly bias any analysis we do on that data…

Market inefficiencies are very small and noisy at the best of times. The inefficiency we’ve artificially introduced by our universe selection will be the largest effect in our data set by far.

Of course, the careful researcher will find ways to control for these effects – but it’s nice to minimise them to start with if we can…

Historical S&P 500 Constituents Data

A better starting point for our analysis would be to look at stocks that were actually in the index at the time. For that, we need to know what the historical SPX constituents actually were.

There are several companies that will sell this data to you – but let’s try to construct it for free from data that is freely available on the internet.

Getting Current S&P 500 Constituents for Free

Wikipedia publishes current S&P 500 component stocks here.

I checked this against the master data set we use in our trading at Robot Wealth (which we pay for) – and it all matches.

If we use the chrome inspector we can see that the S&P 500 stock constituents are in an HTML table with id #constituents

So let’s use the rvest R package to scrape that data into a data frame.

# Load dependencies
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, rvest)

wikispx <- read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
currentconstituents <- wikispx %>%
  html_node('#constituents') %>%
  html_table(header = TRUE)

currentconstituents

Getting S&P 500 Changes for Free

Wikipedia also publishes “Selected Changes to the list of S&P 500 components” on the same page.

This lists stocks that have been added or removed from the index as a result of acquisitions, or as the companies grow and shrink in market capitalisation.

I’ve checked this against our data set and it’s relatively accurate and complete up to about the year 2000. It gets less complete and accurate before then.

But we don’t need perfection here… so let’s scrape these changes.

The Chrome Inspector shows us they live in a table with id #changes.

spxchanges <- wikispx %>%
  html_node('#changes') %>%
  html_table(header = FALSE, fill = TRUE) %>%
  filter(row_number() > 2) %>% # First two rows are headers
  `colnames<-`(c('Date','AddTicker','AddName','RemovedTicker','RemovedName','Reason')) %>%
  mutate(Date = as.Date(Date, format = '%B %d, %Y'),
         year = year(Date),
         month = month(Date))

spxchanges

Create Monthly Snapshot of S&P 500 Index Constituents

Now we’re going to use this data to create monthly snapshots of what the SPX index used to look like.

To do this we:

start at the current S&P 500 index constituents
iterate backwards a month at a time and:
- add back the stocks that were removed
- remove the stocks that were added

If that sounds back to front, it’s because we are working backwards in time through the data!

# Start at the current constituents...
currentmonth <- as.Date(format(Sys.Date(), '%Y-%m-01'))
monthseq <- seq.Date(as.Date('1990-01-01'), currentmonth, by = 'month') %>% rev()

spxstocks <- currentconstituents %>% mutate(Date = currentmonth) %>% select(Date, Ticker = Symbol, Name = Security)
lastrunstocks <- spxstocks

# Iterate through months, working backwards
for (i in 2:length(monthseq)) {
  d <- monthseq[i]
  y <- year(d)
  m <- month(d)
  changes <- spxchanges %>% 
    filter(year == year(d), month == month(d)) 

  # Remove added tickers (we're working backwards in time, remember)
  tickerstokeep <- lastrunstocks %>% 
    anti_join(changes, by = c('Ticker' = 'AddTicker')) %>%
    mutate(Date = d)
  
  # Add back the removed tickers...
  tickerstoadd <- changes %>%
    filter(!RemovedTicker == '') %>%
        transmute(Date = d,
                  Ticker = RemovedTicker,
                  Name = RemovedName)
  
  thismonth <- tickerstokeep %>% bind_rows(tickerstoadd)
  spxstocks <- spxstocks %>% bind_rows(thismonth)  
  
  lastrunstocks <- thismonth
}
spxstocks

We’ve done it!

We have a free data set of historical SPX constituents going back to 1990.

It’s not going to be perfect, because it’s from Wikipedia, but it’s a much better starting point for a universe from which to investigate cross-sectional effects in large-cap equities.

Let’s sense check some things by plotting the number of stocks in the index by date:

spxstocks %>%
  group_by(Date) %>%
  summarise(count = n()) %>%
  ggplot(aes(x=Date, y=count)) +
    geom_line() +
    ggtitle('Count of historic SPX constituents by Date')

It looks reasonable.

We should probably be increasingly wary about its accuracy the further back we go in time. But, that’s fine, we often have to make do.

We don’t need perfection – but we do need to be acutely aware of the various ways we might be biasing our results.

Using the Data to Quantify Universe Selection Bias

Now let me show you how I might use this for some analysis.

I want to illustrate the extent to which universe selection biases the returns from the universe.

I have a dataframe of price data called prices from our research data set which has a ton of daily price observations for listed and delisted stocks.

I’m going to join these prices to the historical list of SPX constituents, and create a column called inSPXwhich illustrates whether that stock was in the SPX index that month.

distinct_tickers <- unique(spxstocks$Ticker) 
# Get the stock prices
prices_df <- prices %>% filter(date >= '1990-01-01', ticker %in% distinct_tickers)

spxstocks <- spxstocks %>%
  mutate(month = month(Date),
         year = year(Date),
         ticker = Ticker)

prices_df <- prices_df %>%
  mutate(month = month(date),
         year = year(date)) %>%
  left_join(spxstocks, by = c('month','year','ticker')) %>%
  mutate(inSPX = !is.na(Ticker)) %>%
  select(ticker, date, open, high, low, close, volume, dividends, closeunadj, inSPX)

The data looks like this:

This is a good way of organising things because:

I keep the history of all prices for all tickers that ever appear in my dataset. This means I can calculate features on tickers over lookback windows when it wasn’t in the index.
I can filter on inSPX == TRUE to get the state of the index at any point. I’d only do returns analysis on stuff that was actually in the index.

Now I’m going to look at the difference in past mean returns in current constituents when they were in the index, vs when they were not.

Here we plot the mean daily returns back to 2000 for stocks that are currently in the index and summarise for periods when they were not in the index (red) and when they were (green).

returns <- prices_df %>% 
  filter(date >= '2000-01-01') %>%
  group_by(ticker) %>%
  arrange(date) %>%
  mutate(totalreturns = (close / lag(close) - 1)) %>%
  na.omit() 

current_tickers <- spxstocks %>% filter(Date == '2020-05-01') %>% pull(Ticker)

returns %>%
  filter(ticker %in% current_tickers) %>%
  group_by(inSPX) %>%
  summarise(meanreturnpct = mean(totalreturns) * 100) %>%
  ggplot(aes(x=inSPX, y = meanreturnpct, fill = inSPX)) + 
    geom_bar(stat='identity') + 
    ggtitle('Mean Daily Returns of current SPX constituents') +
    theme_bw()

You can see that including past returns data for current constituents that weren’t in the index at the time will significantly bias the mean returns of the data set.

The mean returns of the current constituents when they weren’t in the index was almost twice as high as it was since they’ve been in it.

Now you know, you can control for it in your analysis. And you now have the code to build yourself a less biased data set…

Want all the Code?

You can get the code and data from our github repository here.

6 thoughts on “How To Get Historical S&P 500 Constituents Data For Free”

Pingback: Find Cheap Options for Effective Crash Protection Using Crash Regressions - Robot Wealth
Pingback: How to Fill Gaps in Large Stock Data Universes Using tidyr and dplyr - Robot Wealth
Pingback: How to extend ETF prices with mutual fund data using SQL - Robot Wealth
- fja0568
  
  March 3, 2021 at 6:10 am
  
  I have a similar project that I keep current on github, “Historical Lists of S&P 500 components since 1996”. https://github.com/fja05680/sp500
Keith Bines

November 1, 2020 at 12:35 am

Hi and thanks for the post.
Have you extended this to calculate the historical constituents weights?
Any recommendations on how to go about doing this?

keith
Jon

March 24, 2021 at 12:16 am

Thanks for yet another interesting and valuable post. Do you know please where we can find free historical data for non-US government bonds online?