The changing face of market data providers
Over the last few years, a number of new market data providers have come online. They tend to have modern websites, broad coverage, and well-documented RESTful APIs. Their services are often priced very competitively – especially for personal use – and usually have generous free tiers.
One such newcomer is finnhub.io . Its offering includes stock, bond, crpto, and FX historical price data and real time trades and quotes. Their fundamental data offering is noticably broad and includes current values and point-in-time snapshots for numerous metrics. There are also some interesting alternative data sets including measures of social media sentiment, insider transactions and insider sentiment, senate lobbying, government spending, and others. In addition, there’s a real-time newsfeed delivered over websockets.
The free tier is quite generous and offers more than enough for doing proof of concept work and testing ideas. While you only get a year’s worth of historical data per API call on the free tier (more if you specify a lower resolution, like monthly), you can make up to 50 calls per minute.
In this post, we’ll explore the finnhub.io free tier via its REST API.
Get started
A nice python library for working with the finnhub.io is available: pip install finnhub-python
To access the API, you’ll need an API key. Get one here.
To get started, import the libraries we need and set up a finnhub.Client
with your API key:
import finnhub
import os
import time
import datetime
from zoneinfo import ZoneInfo
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Set up client
# Note the FINNHUB_KEY environment variable stores my API key
finnhub_client = finnhub.Client(api_key=os.environ['FINNHUB_KEY'])
OHLCV stock prices
The stock prices endpoint requires a symbol, a resolution (‘D’ for daily data), and a date range consisting of ‘from’ and ‘to’ values as UNIX timestamps.
We’ll get a week’s worth of AAPL data from February 2023.
# helper function for making UNIX timestamps
def unix_timestamp_from_date(date, format='%Y-%m-%d'):
'''Transform human readable date string to UNIX timestamp'''
return int(
datetime.datetime.strptime(date, format)
.replace(tzinfo=ZoneInfo('US/Eastern'))
.timestamp()
)
# api query paramters
symbol = 'AAPL'
resolution = 'D'
from_date = unix_timestamp_from_date('2023-02-06')
to_date = unix_timestamp_from_date('2023-02-10')
# make request and print
res = finnhub_client.stock_candles(
symbol,
resolution,
from_date,
to_date
)
display(res)
{'c': [151.73, 154.65, 151.92, 150.87, 151.01], 'h': [153.1, 155.23, 154.58, 154.33, 151.3401], 'l': [150.78, 150.64, 151.168, 150.42, 149.22], 'o': [152.575, 150.64, 153.88, 153.775, 149.46], 's': 'ok', 't': [1675641600, 1675728000, 1675814400, 1675900800, 1675987200], 'v': [69858306, 83322551, 64120079, 56007143, 57450708]}
The data comes down as a dictionary of lists. The docs state that prices are adjusted for splits. Spot checking some data against other sources suggests it’s also adjusted for dividends.
The dictionary keys consist of letters that represent:
c
: close pricesh
: high pricesl
: low priceso
: open pricess
: statust
: timestampsv
: volume traded
We’ll want to transform that response data into a pandas DataFrame
:
# Ditch the status code
try:
res.pop('s')
except KeyError as e:
print("Already ditched status code")
# Create dataframe from remaining records
df = pd.DataFrame(res) \
.rename(columns={
'c': 'close',
'h': 'high',
'l': 'low',
'o': 'open',
't': 'timestamp',
'v': 'volume'
}) \
.set_index(keys = 'timestamp')
# I like things in a certain order
df = df[['open', 'high', 'low', 'close', 'volume']]
# Convert index to human-readable date format
df.index = pd.to_datetime(df.index, unit='s')
display(df)
open | high | low | close | volume | |
---|---|---|---|---|---|
timestamp | |||||
2023-02-06 | 152.575 | 153.1000 | 150.780 | 151.73 | 69858306 |
2023-02-07 | 150.640 | 155.2300 | 150.640 | 154.65 | 83322551 |
2023-02-08 | 153.880 | 154.5800 | 151.168 | 151.92 | 64120079 |
2023-02-09 | 153.775 | 154.3300 | 150.420 | 150.87 | 56007143 |
2023-02-10 | 149.460 | 151.3401 | 149.220 | 151.01 | 57450708 |
Looks good!
Fundamental data
The basic company financials endpoint serves quite a lot of data. The response object is a dictionary of dictionaries that includes current values for various metrics (under the metric
outer key) as well as historical point-in-time snapshots (under the series
outer key).
# Get basic company financials
financials = finnhub_client.company_basic_financials(symbol, 'all')
# Outer keys of response object
display(financials.keys())
dict_keys(['metric', 'metricType', 'series', 'symbol'])
Current metrics
Let’s check out how many current metrics we have and display the first ten:
# current metrics
# how many?
print(f"Response contains {len(financials['metric'].keys())} \
current fundamental data points\n")
# display the first ten
display(list(financials['metric'].items())[:10])
Response contains 117 current fundamental data points
[('10DayAverageTradingVolume', 64.76467), ('13WeekPriceReturnDaily', 0.83284), ('26WeekPriceReturnDaily', -11.05993), ('3MonthAverageTradingVolume', 1528.63578), ('52WeekHigh', 179.61), ('52WeekHighDate', '2022-03-30'), ('52WeekLow', 124.18), ('52WeekLowDate', '2023-01-03'), ('52WeekPriceReturnDaily', -9.66959), ('5DayPriceReturnDaily', 1.0198)]
That looks quite comprehensive! For this ticker, we have values for 117 fundamental metrics.
Historical point-in-time snapshots
The historical point-in-time data is under the series
outer key. The data is further organised into annual and quarterly metrics, like this:
{
'annual': {
'annual_metric1': [
{'period': date_string, 'v': value},
{'period': date_string, 'v': value},
{'period': date_string, 'v': value},
...
etc
],
'annual_metric2': [
{'period': date_string, 'v': value},
{'period': date_string, 'v': value},
{'period': date_string, 'v': value},
...
etc
],
},
'quarterly': {
'quarterly_metric1': [
{'period': date_string, 'v': value},
{'period': date_string, 'v': value},
{'period': date_string, 'v': value},
...
etc
],
'quarterly_metric2': [
{'period': date_string, 'v': value},
{'period': date_string, 'v': value},
{'period': date_string, 'v': value},
...
etc
],
}
}
Let’s first look at what annual series we get:
# point-in-time values
display(financials['series']['annual'].keys())
dict_keys(['bookValue', 'cashRatio', 'currentRatio', 'ebitPerShare', 'eps', 'ev', 'fcfMargin', 'grossMargin', 'longtermDebtTotalAsset', 'longtermDebtTotalCapital', 'longtermDebtTotalEquity', 'netDebtToTotalCapital', 'netDebtToTotalEquity', 'netMargin', 'operatingMargin', 'pb', 'pe', 'pfcf', 'pretaxMargin', 'ps', 'quickRatio', 'roa', 'roe', 'roic', 'rotc', 'salesPerShare', 'sgaToSale', 'totalDebtToEquity', 'totalDebtToTotalAsset', 'totalDebtToTotalCapital', 'totalRatio'])
Again, pretty comprehensive. Let’s put that data into a dataframe:
# Make a list of the available quarterly metrics
annual_metric_names = list(financials['series']['annual'].keys())
# Make a list of dataframes of each metric indexed by period and named appropriately
dfs = [
pd.DataFrame(financials['series']['annual'][metric]) \
.rename(columns={'v':metric}) \
.set_index(keys='period') \
for metric in annual_metric_names
]
# Join dataframes on index
annual_metrics = dfs[0].join(dfs[1:], how='outer')
annual_metrics
bookValue | cashRatio | currentRatio | ebitPerShare | eps | ev | fcfMargin | grossMargin | longtermDebtTotalAsset | longtermDebtTotalCapital | … | roa | roe | roic | rotc | salesPerShare | sgaToSale | totalDebtToEquity | totalDebtToTotalAsset | totalDebtToTotalCapital | totalRatio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
period | |||||||||||||||||||||
2022-09-24 | 50672 | 0.153563 | 0.8794 | 7.3158 | 6.1132 | 2514884.00 | 0.3369 | 0.4331 | 0.2805 | 0.5764 | … | 0.2829 | 1.9696 | 0.5813 | 0.6957 | 24.1536 | 0.5669 | 2.3881 | 0.3430 | 0.7048 | 1.1677 |
2021-09-25 | 63090 | 0.278449 | 1.0746 | 6.4601 | 5.6140 | 2493617.00 | 0.3147 | 0.4178 | 0.3108 | 0.5783 | … | 0.2697 | 1.5007 | 0.5019 | 0.5775 | 21.6909 | 0.5822 | 1.9903 | 0.3577 | 0.6656 | 1.2191 |
2020-09-26 | 65339 | 0.360710 | 1.3636 | 3.7818 | 3.2754 | 2041191.00 | 0.3205 | 0.3823 | 0.3046 | 0.5530 | … | 0.1773 | 0.8787 | 0.3217 | 0.3715 | 15.6615 | 0.6177 | 1.7309 | 0.3492 | 0.6338 | 1.2527 |
2019-09-28 | 90488 | 0.462022 | 1.5401 | 3.4378 | 2.9714 | 1071363.00 | 0.3070 | 0.3782 | 0.2712 | 0.4624 | … | 0.1632 | 0.6106 | 0.2783 | 0.3220 | 13.9909 | 0.6218 | 1.1940 | 0.3192 | 0.5442 | 1.3648 |
2018-09-29 | 107147 | 0.223525 | 1.1329 | 3.5449 | 2.9766 | 1186220.00 | 0.3417 | 0.3834 | 0.2563 | 0.4229 | … | 0.1628 | 0.5556 | 0.2686 | 0.3199 | 13.2797 | 0.6166 | 1.0685 | 0.3130 | 0.5166 | 1.4144 |
2017-09-30 | 134047 | 0.201252 | 1.2761 | 2.9202 | 2.3017 | 889855.00 | 0.3345 | 0.3847 | 0.2590 | 0.3893 | … | 0.1288 | 0.3607 | 0.1936 | 0.2456 | 10.9123 | 0.6153 | 0.8630 | 0.3082 | 0.4632 | 1.5556 |
2016-09-24 | 128249 | 0.259271 | 1.3527 | 2.7282 | 2.0766 | 674795.00 | 0.3662 | 0.3908 | 0.2345 | 0.3504 | … | 0.1420 | 0.3562 | 0.2122 | 0.2788 | 9.8013 | 0.6092 | 0.6786 | 0.2705 | 0.4043 | 1.6630 |
2015-09-26 | 119355 | 0.262002 | 1.1088 | 3.0740 | 2.3042 | 670483.00 | 0.3969 | 0.4006 | 0.1837 | 0.2903 | … | 0.1839 | 0.4474 | 0.2907 | 0.3878 | 10.0861 | 0.5994 | 0.5390 | 0.2216 | 0.3502 | 1.6980 |
2014-09-27 | 111547 | 0.218194 | 1.0801 | 2.1438 | 1.6132 | 608712.00 | 0.3803 | 0.3859 | 0.1250 | 0.1974 | … | 0.1704 | 0.3542 | 0.2691 | 0.3575 | 7.4638 | 0.6141 | 0.3164 | 0.1522 | 0.2404 | 1.9273 |
2013-09-28 | 123549 | 0.326607 | 1.6786 | 1.8784 | 1.4198 | 435827.00 | 0.3671 | 0.3762 | 0.0819 | 0.1207 | … | 0.1789 | 0.2998 | 0.2636 | 0.3487 | 6.5518 | 0.6238 | 0.1373 | 0.0819 | 0.1207 | 2.4805 |
2012-09-29 | 118210 | 0.278813 | 1.4958 | 2.0870 | 1.5767 | 607370.00 | 0.3850 | 0.4387 | 0.0000 | 0.0000 | … | 0.2370 | 0.3530 | 0.3530 | 0.4673 | 5.9129 | 0.5613 | 0.0000 | 0.0000 | 0.0000 | 3.0432 |
2011-09-24 | 76615 | 0.350912 | 1.6084 | 1.2884 | 0.9884 | 364842.00 | 0.4155 | 0.4048 | 0.0000 | 0.0000 | … | 0.2228 | 0.3383 | 0.3383 | 0.4410 | 4.1275 | 0.5952 | 0.0000 | 0.0000 | 0.0000 | 2.9271 |
2010-09-25 | 47791 | 0.543432 | 2.0113 | 0.7101 | 0.5412 | 254736.00 | 0.3176 | 0.3938 | 0.0000 | 0.0000 | … | 0.1864 | 0.2932 | 0.2932 | 0.3847 | 2.5191 | 0.6062 | 0.0000 | 0.0000 | 0.0000 | 2.7447 |
2009-09-26 | 31640 | 0.457414 | 2.7425 | 0.4623 | 0.3243 | 161495.00 | 0.2651 | 0.4014 | 0.0000 | 0.0000 | … | 0.1734 | 0.2603 | 0.2603 | 0.3710 | 1.6894 | 0.5986 | 0.0000 | 0.0000 | 0.0000 | 2.9948 |
2008-09-27 | 22297 | 1.045242 | 2.6411 | 0.3297 | 0.2422 | 81372.30 | 0.2879 | 0.3520 | 0.0000 | 0.0000 | … | 0.1692 | 0.2744 | 0.2744 | 0.3735 | 1.4842 | 0.6480 | 0.0000 | 0.0000 | 0.0000 | 2.6071 |
2007-09-29 | 14532 | 1.007759 | 2.3659 | 0.1770 | 0.1404 | 126607.00 | 0.2627 | 0.3317 | 0.0000 | 0.0000 | … | 0.1379 | 0.2405 | 0.2405 | 0.3033 | 0.9871 | 0.6683 | 0.0000 | 0.0000 | 0.0000 | 2.3437 |
2006-09-30 | 9984 | 0.992084 | 2.2519 | 0.0998 | 0.0809 | 57277.00 | 0.1490 | 0.2898 | 0.0000 | 0.0000 | … | 0.1156 | 0.1992 | 0.1992 | 0.2457 | 0.7861 | 0.7102 | 0.0000 | 0.0000 | 0.0000 | 2.3826 |
2005-09-24 | 7428 | 1.001147 | 2.9538 | 0.0685 | 0.0553 | 41187.00 | 0.2006 | 0.2901 | 0.0000 | 0.0000 | … | 0.1153 | 0.1788 | 0.1788 | 0.2212 | 0.5806 | 0.7099 | 0.0000 | 0.0000 | 0.0000 | 2.8170 |
2004-09-25 | 5076 | 1.119955 | 2.6613 | 0.0144 | 0.0123 | 11589.80 | 0.1341 | 0.2726 | 0.0000 | 0.0000 | … | 0.0330 | 0.0524 | 0.0524 | 0.0617 | 0.3816 | 0.7274 | 0.0000 | 0.0000 | 0.0000 | 2.7068 |
2003-09-27 | 4223 | 1.440815 | 2.4977 | 0.0000 | 0.0034 | 4629.27 | 0.0730 | 0.2752 | 0.0000 | 0.0000 | … | 0.0101 | 0.0163 | 0.0152 | NaN | 0.3050 | 0.7248 | 0.0720 | 0.0446 | 0.0672 | 2.6292 |
2002-09-28 | 4095 | 1.358263 | 3.2497 | 0.0008 | 0.0032 | 3267.85 | 0.0458 | 0.2792 | 0.0502 | 0.0716 | … | 0.0103 | 0.0159 | 0.0147 | 0.0039 | 0.2834 | 0.7208 | 0.0772 | 0.0502 | 0.0716 | 2.8588 |
2001-09-29 | 3920 | 1.521739 | 3.3880 | -0.0178 | -0.0013 | 3384.29 | 0.0778 | 0.2303 | 0.0526 | 0.0748 | … | NaN | NaN | NaN | NaN | 0.2771 | 0.7697 | 0.0809 | 0.0526 | 0.0748 | 2.8658 |
2000-09-30 | 4107 | 0.616141 | 2.8076 | 0.0259 | 0.0390 | 6989.50 | 0.1265 | 0.2713 | 0.0441 | 0.0681 | … | 0.1155 | 0.1914 | 0.1784 | 0.1184 | 0.3956 | 0.7287 | 0.0730 | 0.0441 | 0.0681 | 2.5234 |
23 rows × 31 columns
Pretty handy… we have 23 years worth of annual company fundamentals.
Let’s do the same thing with the quarterly series. First, what metrics do we get?
# point-in-time values
display(financials['series']['quarterly'].keys())
dict_keys(['bookValue', 'cashRatio', 'currentRatio', 'ebitPerShare', 'eps', 'ev', 'fcfMargin', 'fcfPerShareTTM', 'grossMargin', 'longtermDebtTotalAsset', 'longtermDebtTotalCapital', 'longtermDebtTotalEquity', 'netDebtToTotalCapital', 'netDebtToTotalEquity', 'netMargin', 'operatingMargin', 'pb', 'peTTM', 'pfcfTTM', 'pretaxMargin', 'psTTM', 'quickRatio', 'roaTTM', 'roeTTM', 'roicTTM', 'rotcTTM', 'salesPerShare', 'sgaToSale', 'totalDebtToEquity', 'totalDebtToTotalAsset', 'totalDebtToTotalCapital', 'totalRatio'])
Let’s stick this data into a dataframe:
# Make a list of the available quarterly metrics
quarterly_metric_names = list(financials['series']['quarterly'].keys())
# Make a list of dataframes indexed by period and named by metric
dfs = [
pd.DataFrame(financials['series']['quarterly'][metric]) \
.rename(columns={'v': metric}) \
.set_index(keys='period') \
for metric in quarterly_metric_names
]
# Join dataframes on index
quarterly_metrics = dfs[0].join(dfs[1:], how='outer')
quarterly_metrics
bookValue | cashRatio | currentRatio | ebitPerShare | eps | ev | fcfMargin | fcfPerShareTTM | grossMargin | longtermDebtTotalAsset | … | roaTTM | roeTTM | roicTTM | rotcTTM | salesPerShare | sgaToSale | totalDebtToEquity | totalDebtToTotalAsset | totalDebtToTotalCapital | totalRatio | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
period | |||||||||||||||||||||
2022-12-31 | 56727 | 0.149578 | 0.9380 | 2.2572 | 1.8801 | 2157515.00 | 0.3226 | 7.6303 | 0.4296 | 0.2873 | … | 0.2746 | 1.6345 | 0.5409 | 0.6478 | 7.3424 | 0.5704 | 1.9587 | 0.3204 | 0.6620 | 1.1956 |
2022-09-24 | 50672 | 0.153563 | 0.8794 | 1.5444 | 1.2855 | 2513943.00 | 0.3041 | 8.3332 | 0.4226 | 0.2805 | … | 0.2810 | 1.6090 | 0.5464 | 0.6539 | 5.5927 | 0.5774 | 2.3695 | 0.3404 | 0.7032 | 1.1677 |
2022-06-25 | 58107 | 0.211761 | 0.8646 | 1.4190 | 1.1955 | 2384979.00 | 0.3013 | 8.0064 | 0.4326 | 0.2816 | … | 0.2808 | 1.5297 | 0.5324 | 0.6323 | 5.1013 | 0.5674 | 2.0598 | 0.3559 | 0.6732 | 1.2089 |
2022-03-26 | 67399 | 0.220363 | 0.9268 | 1.8277 | 1.5247 | 2957573.00 | 0.3154 | 7.8393 | 0.4375 | 0.2947 | … | 0.2886 | 1.5288 | 0.5387 | 0.6309 | 5.9305 | 0.5625 | 1.7802 | 0.3422 | 0.6403 | 1.2379 |
2021-12-25 | 71932 | 0.251528 | 1.0378 | 2.5115 | 2.0964 | 3032429.00 | 0.4015 | 7.5044 | 0.4376 | 0.2797 | … | 0.2875 | 1.4981 | 0.5290 | 0.6151 | 7.5032 | 0.5624 | 1.7071 | 0.3221 | 0.6306 | 1.2326 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2000-12-30 | 3712 | 1.061087 | 3.0092 | -0.0222 | -0.0103 | 3570.24 | 0.0089 | 0.0314 | -0.0209 | 0.0520 | … | NaN | NaN | NaN | NaN | 0.0533 | 1.0209 | 0.0838 | 0.0520 | 0.0773 | 2.6324 |
2000-09-30 | 4107 | 0.616141 | 2.8076 | 0.0041 | 0.0084 | 6989.50 | 0.1422 | 0.0537 | 0.2497 | 0.0441 | … | 0.1110 | 0.1859 | 0.1736 | 0.1153 | 0.0923 | 0.7503 | 0.0730 | 0.0441 | 0.0681 | 2.5234 |
2000-07-01 | 4176 | 0.717565 | 2.7533 | 0.0083 | 0.0099 | 16308.50 | 0.1342 | NaN | 0.2975 | 0.0433 | … | NaN | NaN | NaN | NaN | 0.0901 | 0.7025 | 0.0718 | 0.0433 | 0.0670 | 2.5152 |
2000-04-01 | 4215 | 0.896924 | 2.6508 | 0.0083 | 0.0114 | 20122.50 | 0.0452 | NaN | 0.2823 | 0.0428 | … | NaN | NaN | NaN | NaN | 0.0954 | 0.7177 | 0.0712 | 0.0428 | 0.0664 | 2.5097 |
2000-01-01 | 4414 | 0.807125 | 2.4982 | 0.0050 | 0.0092 | 16753.70 | 0.1754 | NaN | 0.2591 | 0.0395 | … | NaN | NaN | NaN | NaN | 0.1172 | 0.7409 | 0.0680 | 0.0395 | 0.0636 | 2.3916 |
93 rows × 32 columns
Looks like we have essentially the same data as for the annual series, but at a quarterly resolution. Neat!
Restatements
One thing to be aware of with this sort of data is that sometimes companies will “restate” their reported financials after the initial filing, usually to correct an error.
Using a dataset based on restatements can introduce some bias into an analysis because you’d be using data that you didn’t have available at the time you thought you had it.
To correct for that, you’d need an “as reported” dataset – something that recorded that information available at each point in time, and not corrected retrospectively.
The finnhub.io API docs don’t specify whether the endpoint I used here contains restatements. However, I suspect that it doesn’t because there’s an additional endpoint called “Financials As Reported”, which isn’t available on the free tier.
Alternative data
finnhub.io serves some interesting alternative data:
- Insider sentiment
- Social media sentiment
- Senate lobbying disclosures
- Government spending
- Visa applications submitted by companies
- Many others
Example: insider sentiment
The “Insider Sentiment” endpoint serves the Monthly Share Purchase Ratio (MSPR), which is a measure of whether insider transactions are dominated by buying (a value closer to 100) or selling (a value closer to -100) over the previous month. You can read about the MSPR and how it’s calculated here.
For now, let’s just grab 10 years worth of this data for a single stock and plot it against the stock’s forward returns on the understanding that we don’t have enough data to say anything meaningful (around 120 observations total).
This API can understand human-readable date strings, unlike the prices API.
# Get ten year's worth of insider sentiment data
# api query paramters
symbol = 'AAPL'
from_date = '2013-01-01'
to_date = '2023-02-21'
is_response = finnhub_client.stock_insider_sentiment(
symbol,
from_date,
to_date
)
# Print the first few records
display(is_response['data'][:4])
[{'symbol': 'AAPL', 'year': 2013, 'month': 1, 'change': 246, 'mspr': 0.2078967}, {'symbol': 'AAPL', 'year': 2013, 'month': 2, 'change': 3934, 'mspr': 42.06587}, {'symbol': 'AAPL', 'year': 2013, 'month': 3, 'change': -26460, 'mspr': -20.025429}, {'symbol': 'AAPL', 'year': 2013, 'month': 4, 'change': -9818, 'mspr': -21.024454}]
# Make a dataframe from insider sentiment data
is_df = pd.DataFrame(is_response['data'])
is_df.head(10)
symbol | year | month | change | mspr | |
---|---|---|---|---|---|
0 | AAPL | 2013 | 1 | 246 | 0.207897 |
1 | AAPL | 2013 | 2 | 3934 | 42.065870 |
2 | AAPL | 2013 | 3 | -26460 | -20.025429 |
3 | AAPL | 2013 | 4 | -9818 | -21.024454 |
4 | AAPL | 2013 | 6 | -321301 | -28.045816 |
5 | AAPL | 2013 | 7 | -79734 | -48.365260 |
6 | AAPL | 2013 | 8 | -75200 | -34.034233 |
7 | AAPL | 2013 | 9 | -28171 | -31.069471 |
8 | AAPL | 2013 | 10 | 1930 | 3.064367 |
9 | AAPL | 2013 | 11 | -4486 | -100.000000 |
Notice that we have some missing data – above we don’t have a record for month 5. We’ll need to deal with that later.
I’ll also need to lag the mspr
field (the sentiment score) to align it with the date on which it would have been actionable. We’ll do this once we have some price data.
Next we’ll grab some monthly price data and munge it for our needs:
resolution = 'M'
unix_from_date = unix_timestamp_from_date(from_date)
unix_to_date = unix_timestamp_from_date(to_date)
price_response = finnhub_client.stock_candles(
symbol,
resolution,
unix_from_date,
unix_to_date
)
# Ditch the status code
try:
price_response.pop('s')
except KeyError as e:
print("Already ditched status code")
# Create dataframe from remaining records
price_df = pd.DataFrame(price_response) \
.rename(columns={
'c': 'close',
'h': 'high',
'l': 'low',
'o': 'open',
't': 'timestamp',
'v': 'volume'
})
# Ditch stuff we won't need
price_df = price_df[['timestamp', 'close']]
# Convert timestamp to human-readable date format
price_df['date'] = pd.to_datetime(price_df['timestamp'], unit='s')
# create year and month fields
price_df['year'] = price_df['date'].dt.year
price_df['month'] = price_df['date'].dt.month
price_df.head()
timestamp | close | date | year | month | |
---|---|---|---|---|---|
0 | 1356998400 | 16.267479 | 2013-01-01 | 2013 | 1 |
1 | 1359676800 | 15.764271 | 2013-02-01 | 2013 | 2 |
2 | 1362096000 | 15.809271 | 2013-03-01 | 2013 | 3 |
3 | 1364774400 | 15.813561 | 2013-04-01 | 2013 | 4 |
4 | 1367366400 | 16.061950 | 2013-05-01 | 2013 | 5 |
You can see that our data isn’t going to overlap neatly. There are some months in our sentiment dataframe that are missing from our prices dataframe and vice versa.
I’ll merge the two dataframes using an outer join – this will preserve all of our data and insert NaN into any record that’s missing.
From there, I can lag the sentiment score appropriately. Only then can I remove records that contain NaN (doing so prior would misalign some of our data).
df = pd.merge(price_df, is_df, on=['year', 'month'], how='outer') \
.sort_values(['year', 'month']) \
.set_index(keys='date')
df.head(10)
timestamp | close | year | month | symbol | change | mspr | |
---|---|---|---|---|---|---|---|
date | |||||||
2013-01-01 | 1356998400 | 16.267479 | 2013 | 1 | AAPL | 246.0 | 0.207897 |
2013-02-01 | 1359676800 | 15.764271 | 2013 | 2 | AAPL | 3934.0 | 42.065870 |
2013-03-01 | 1362096000 | 15.809271 | 2013 | 3 | AAPL | -26460.0 | -20.025429 |
2013-04-01 | 1364774400 | 15.813561 | 2013 | 4 | AAPL | -9818.0 | -21.024454 |
2013-05-01 | 1367366400 | 16.061950 | 2013 | 5 | NaN | NaN | NaN |
2013-06-01 | 1370044800 | 14.161771 | 2013 | 6 | AAPL | -321301.0 | -28.045816 |
2013-07-01 | 1372636800 | 16.161771 | 2013 | 7 | AAPL | -79734.0 | -48.365260 |
2013-08-01 | 1375315200 | 17.400550 | 2013 | 8 | AAPL | -75200.0 | -34.034233 |
2013-09-01 | 1377993600 | 17.026771 | 2013 | 9 | AAPL | -28171.0 | -31.069471 |
2013-10-01 | 1380585600 | 18.667911 | 2013 | 10 | AAPL | 1930.0 | 3.064367 |
Now we can calculate close-to-close monthly returns and lag the sentiment score:
# Calculate monthly open-to-close returns
df['return'] = np.log(df.close/df.close.shift())
# Lag sentiment score
df['mspr_lag'] = df['mspr'].shift()
df.head(10)
timestamp | close | year | month | symbol | change | mspr | return | mspr_lag | |
---|---|---|---|---|---|---|---|---|---|
date | |||||||||
2013-01-01 | 1356998400 | 16.267479 | 2013 | 1 | AAPL | 246.0 | 0.207897 | NaN | NaN |
2013-02-01 | 1359676800 | 15.764271 | 2013 | 2 | AAPL | 3934.0 | 42.065870 | -0.031422 | 0.207897 |
2013-03-01 | 1362096000 | 15.809271 | 2013 | 3 | AAPL | -26460.0 | -20.025429 | 0.002850 | 42.065870 |
2013-04-01 | 1364774400 | 15.813561 | 2013 | 4 | AAPL | -9818.0 | -21.024454 | 0.000271 | -20.025429 |
2013-05-01 | 1367366400 | 16.061950 | 2013 | 5 | NaN | NaN | NaN | 0.015585 | -21.024454 |
2013-06-01 | 1370044800 | 14.161771 | 2013 | 6 | AAPL | -321301.0 | -28.045816 | -0.125907 | NaN |
2013-07-01 | 1372636800 | 16.161771 | 2013 | 7 | AAPL | -79734.0 | -48.365260 | 0.132102 | -28.045816 |
2013-08-01 | 1375315200 | 17.400550 | 2013 | 8 | AAPL | -75200.0 | -34.034233 | 0.073853 | -48.365260 |
2013-09-01 | 1377993600 | 17.026771 | 2013 | 9 | AAPL | -28171.0 | -31.069471 | -0.021715 | -34.034233 |
2013-10-01 | 1380585600 | 18.667911 | 2013 | 10 | AAPL | 1930.0 | 3.064367 | 0.092019 | -31.069471 |
An easy way to look for a relationship between this sentiment metric and next month’s returns is to just scatter plot them:
sns.lmplot(data=df, x='mspr_lag', y='return', height=4, aspect=1.5);
Unfortunately, we don’t see much of any kind of relationship.
The sentiment metric does jump around quite a bit – let’s see if smoothing it with a 3-month moving average reveals anything interesting:
df['mspr_roll'] = df['mspr_lag'].rolling(window=3).mean()
df
timestamp | close | year | month | symbol | change | mspr | return | mspr_lag | mspr_roll | |
---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||
2013-01-01 | 1356998400 | 16.267479 | 2013 | 1 | AAPL | 246.0 | 0.207897 | NaN | NaN | NaN |
2013-02-01 | 1359676800 | 15.764271 | 2013 | 2 | AAPL | 3934.0 | 42.065870 | -0.031422 | 0.207897 | NaN |
2013-03-01 | 1362096000 | 15.809271 | 2013 | 3 | AAPL | -26460.0 | -20.025429 | 0.002850 | 42.065870 | NaN |
2013-04-01 | 1364774400 | 15.813561 | 2013 | 4 | AAPL | -9818.0 | -21.024454 | 0.000271 | -20.025429 | 7.416113 |
2013-05-01 | 1367366400 | 16.061950 | 2013 | 5 | NaN | NaN | NaN | 0.015585 | -21.024454 | 0.338662 |
… | … | … | … | … | … | … | … | … | … | … |
2022-10-01 | 1664582400 | 153.340000 | 2022 | 10 | AAPL | -1297681.0 | -30.305650 | 0.103956 | 100.000000 | NaN |
2022-11-01 | 1667260800 | 148.030000 | 2022 | 11 | AAPL | -20200.0 | -100.000000 | -0.035243 | -30.305650 | -3.333313 |
2022-12-01 | 1669852800 | 129.930000 | 2022 | 12 | NaN | NaN | NaN | -0.130419 | -100.000000 | -10.101883 |
2023-01-01 | 1672531200 | 144.290000 | 2023 | 1 | NaN | NaN | NaN | 0.104829 | NaN | NaN |
2023-02-01 | 1675209600 | 152.550000 | 2023 | 2 | AAPL | -1685.0 | -5.882353 | 0.055667 | NaN | NaN |
122 rows × 10 columns
sns.lmplot(data=df, x='mspr_roll', y='return', height=4, aspect=1.5);
The relationship is still super noisy. If anything, it’s a little opposite to what we’d expect, especially at the extreme values of our smoothed metric.
Maybe the SEC is succeeding in stamping out insider trading.
The reality is that insiders buy and sell for all sorts of reasons, many of which have nothing to do with their views on the stock itself.
As I mentioned, there aren’t a whole lot of data points here – perhaps this would be interesting to revisit using a broader universe of stocks.
Conclusion
finnhub.io is a modern market data service with a well-documented REST API, open source client libraries, broad coverage, interesting data sets, and a generous free tier.
What’s your preferred source of market data and why? Let me know in the comments!
Thanks Kris. I enjoyed this blog and found it interesting the amount data and how reasonably easy it is to fetch, process and do some visual analysis of it.
Looking forward to your next blog.
Cheers,
Alen
Thanks for reading Alen! Same here – it’s such a luxury having all that data at your fingertips and so easy to work with.
Thanks Kris, enjoyed the post. Suss on the value of the alternate data but really wont know till tested.
Hey Steven, yeah I agree. My personal experience is that much of that stuff tends to be correlated with contemporaneous returns, but not very useful in trying to predict forward returns. I’m sure there are exceptions out there, as well as novel use cases.
Thank you!
Very useful, kind of standing on two legs these days.. Learned a ton of R coding during your classes in RW, but also I see the use of doing stuff in Python from time to time. Currently working with Airflow (thanks to Ajet of RW) then Python comes in handy as well although running R code with Airflow is working as well using a sidecar.
Wes
Hey Wes! Thanks for the comment. I’m with you on that one… I like doing data analysis in R more than any other language I’ve tried, but for data plumbing type applications, Python is usually my first choice.