Remote Data Access¶
Warning
In pandas 0.17.0, the sub-package pandas.io.data
will be removed in favor of a separately installable pandas-datareader package. This will allow the data modules to be independently updated to your pandas installation. The API for pandas-datareader v0.1.1
is the same as in pandas v0.16.1
. (GH8961)
You should replace the imports of the following:
from pandas.io import data, wb
With:
from pandas_datareader import data, wb
Functions from pandas.io.data
and pandas.io.ga
extract data from various Internet sources into a DataFrame. Currently the following sources are supported:
It should be noted, that various sources support different kinds of data, so not all sources implement the same methods and the data elements returned might also differ.
Yahoo! Finance¶
In [1]: import pandas.io.data as web
In [2]: import datetime
In [3]: start = datetime.datetime(2010, 1, 1)
In [4]: end = datetime.datetime(2013, 1, 27)
In [5]: f = web.DataReader("F", 'yahoo', start, end)
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-5-679b50ab962b> in <module>()
----> 1 f = web.DataReader("F", 'yahoo', start, end)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in DataReader(name, data_source, start, end, retry_count, pause)
89 return get_data_yahoo(symbols=name, start=start, end=end,
90 adjust_price=False, chunksize=25,
---> 91 retry_count=retry_count, pause=pause)
92 elif data_source == "google":
93 return get_data_google(symbols=name, start=start, end=end,
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_data_yahoo(symbols, start, end, retry_count, pause, adjust_price, ret_index, chunksize, interval)
436 raise ValueError("Invalid interval: valid values are 'd', 'w', 'm' and 'v'")
437 return _get_data_from(symbols, start, end, interval, retry_count, pause,
--> 438 adjust_price, ret_index, chunksize, 'yahoo')
439
440
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_data_from(symbols, start, end, interval, retry_count, pause, adjust_price, ret_index, chunksize, source)
377 # If a single symbol, (e.g., 'GOOG')
378 if isinstance(symbols, (compat.string_types, int)):
--> 379 hist_data = src_fn(symbols, start, end, interval, retry_count, pause)
380 # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
381 elif isinstance(symbols, DataFrame):
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_hist_yahoo(sym, start, end, interval, retry_count, pause)
220 '&g=%s' % interval +
221 '&ignore=.csv')
--> 222 return _retry_read_url(url, retry_count, pause, 'Yahoo!')
223
224
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _retry_read_url(url, retry_count, pause, name)
197
198 raise IOError("after %d tries, %s did not "
--> 199 "return a 200 for url %r" % (retry_count, name, url))
200
201
IOError: after 3 tries, Yahoo! did not return a 200 for url 'http://ichart.finance.yahoo.com/table.csv?s=F&a=0&b=1&c=2010&d=0&e=27&f=2013&g=d&ignore=.csv'
In [6]: f.ix['2010-01-04']
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-a881e6f45410> in <module>()
----> 1 f.ix['2010-01-04']
AttributeError: 'file' object has no attribute 'ix'
Yahoo! Finance Options¶
*Experimental*
The Options
class allows the download of options data from Yahoo! Finance.
The get_all_data
method downloads and caches option data for all expiry months
and provides a formatted DataFrame
with a hierarchical index, so it is easy to get
to the specific option you want.
In [7]: from pandas.io.data import Options
In [8]: aapl = Options('aapl', 'yahoo')
In [9]: data = aapl.get_all_data()
---------------------------------------------------------------------------
RemoteDataError Traceback (most recent call last)
<ipython-input-9-a2dc6d91ec5f> in <module>()
----> 1 data = aapl.get_all_data()
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_all_data(self, call, put)
1130
1131 try:
-> 1132 expiry_dates = self.expiry_dates
1133 except AttributeError:
1134 expiry_dates, _ = self._get_expiry_dates_and_links()
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in expiry_dates(self)
1161 expiry_dates = self._expiry_dates
1162 except AttributeError:
-> 1163 expiry_dates, _ = self._get_expiry_dates_and_links()
1164 return expiry_dates
1165
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_expiry_dates_and_links(self)
1176
1177 url = self._OPTIONS_BASE_URL.format(sym=self.symbol)
-> 1178 root = self._parse_url(url)
1179
1180 try:
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _parse_url(self, url)
1208 except _network_error_classes:
1209 raise RemoteDataError("Unable to parse URL "
-> 1210 "{0!r}".format(url))
1211 else:
1212 root = doc.getroot()
RemoteDataError: Unable to parse URL 'http://finance.yahoo.com/q/op?s=AAPL'
In [10]: data.iloc[0:5, 0:5]
Out[10]:
id Date Col_1 Col_2 Col_3
0 26 2010-10-18 X 27.50 True
1 42 2010-10-19 Y -12.50 False
2 63 2010-10-20 Z 5.73 True
# Show the $100 strike puts at all expiry dates:
In [11]: data.loc[(100, slice(None), 'put'),:].iloc[0:5, 0:5]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-b8b1bd8f16d0> in <module>()
----> 1 data.loc[(100, slice(None), 'put'),:].iloc[0:5, 0:5]
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in __getitem__(self, key)
1223 def __getitem__(self, key):
1224 if type(key) is tuple:
-> 1225 return self._getitem_tuple(key)
1226 else:
1227 return self._getitem_axis(key, axis=0)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_tuple(self, tup)
736 def _getitem_tuple(self, tup):
737 try:
--> 738 return self._getitem_lowerdim(tup)
739 except IndexingError:
740 pass
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_lowerdim(self, tup)
861 for i, key in enumerate(tup):
862 if is_label_like(key) or isinstance(key, tuple):
--> 863 section = self._getitem_axis(key, axis=i)
864
865 # we have yielded a scalar ?
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
1359 raise ValueError('Cannot index with multidimensional key')
1360
-> 1361 return self._getitem_iterable(key, axis=axis)
1362
1363 # nested tuple slicing
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_iterable(self, key, axis)
969 def _getitem_iterable(self, key, axis=0):
970 if self._should_validate_iterable(axis):
--> 971 self._has_valid_type(key, axis)
972
973 labels = self.obj._get_axis(axis)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _has_valid_type(self, key, axis)
1304
1305 # TODO: don't check the entire key unless necessary
-> 1306 if len(key) and np.all(ax.get_indexer_for(key) < 0):
1307
1308 raise KeyError("None of [%s] are in the [%s]" %
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer_for(self, target, **kwargs)
2004 """ guaranteed return of an indexer even when non-unique """
2005 if self.is_unique:
-> 2006 return self.get_indexer(target, **kwargs)
2007 indexer, _ = self.get_indexer_non_unique(target, **kwargs)
2008 return indexer
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
1887 target = target.astype(object)
1888 return this.get_indexer(target, method=method, limit=limit,
-> 1889 tolerance=tolerance)
1890
1891 if not self.is_unique:
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
1905 'backfill or nearest reindexing')
1906
-> 1907 indexer = self._engine.get_indexer(target._values)
1908
1909 return com._ensure_platform_int(indexer)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_indexer (pandas/index.c:5840)()
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.lookup (pandas/hashtable.c:13174)()
TypeError: unhashable type
# Show the volume traded of $100 strike puts at all expiry dates:
In [12]: data.loc[(100, slice(None), 'put'),'Vol'].head()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-12ae0bf819cc> in <module>()
----> 1 data.loc[(100, slice(None), 'put'),'Vol'].head()
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in __getitem__(self, key)
1223 def __getitem__(self, key):
1224 if type(key) is tuple:
-> 1225 return self._getitem_tuple(key)
1226 else:
1227 return self._getitem_axis(key, axis=0)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_tuple(self, tup)
736 def _getitem_tuple(self, tup):
737 try:
--> 738 return self._getitem_lowerdim(tup)
739 except IndexingError:
740 pass
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_lowerdim(self, tup)
861 for i, key in enumerate(tup):
862 if is_label_like(key) or isinstance(key, tuple):
--> 863 section = self._getitem_axis(key, axis=i)
864
865 # we have yielded a scalar ?
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
1359 raise ValueError('Cannot index with multidimensional key')
1360
-> 1361 return self._getitem_iterable(key, axis=axis)
1362
1363 # nested tuple slicing
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_iterable(self, key, axis)
969 def _getitem_iterable(self, key, axis=0):
970 if self._should_validate_iterable(axis):
--> 971 self._has_valid_type(key, axis)
972
973 labels = self.obj._get_axis(axis)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _has_valid_type(self, key, axis)
1304
1305 # TODO: don't check the entire key unless necessary
-> 1306 if len(key) and np.all(ax.get_indexer_for(key) < 0):
1307
1308 raise KeyError("None of [%s] are in the [%s]" %
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer_for(self, target, **kwargs)
2004 """ guaranteed return of an indexer even when non-unique """
2005 if self.is_unique:
-> 2006 return self.get_indexer(target, **kwargs)
2007 indexer, _ = self.get_indexer_non_unique(target, **kwargs)
2008 return indexer
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
1887 target = target.astype(object)
1888 return this.get_indexer(target, method=method, limit=limit,
-> 1889 tolerance=tolerance)
1890
1891 if not self.is_unique:
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
1905 'backfill or nearest reindexing')
1906
-> 1907 indexer = self._engine.get_indexer(target._values)
1908
1909 return com._ensure_platform_int(indexer)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_indexer (pandas/index.c:5840)()
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.lookup (pandas/hashtable.c:13174)()
TypeError: unhashable type
If you don’t want to download all the data, more specific requests can be made.
In [13]: import datetime
In [14]: expiry = datetime.date(2016, 1, 1)
In [15]: data = aapl.get_call_data(expiry=expiry)
---------------------------------------------------------------------------
RemoteDataError Traceback (most recent call last)
<ipython-input-15-15c0ecd77bfd> in <module>()
----> 1 data = aapl.get_call_data(expiry=expiry)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_call_data(self, month, year, expiry)
833 for the expiry of the options.
834 """
--> 835 expiry = self._try_parse_dates(year, month, expiry)
836 return self._get_data_in_date_range(expiry, call=True, put=False)
837
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _try_parse_dates(self, year, month, expiry)
993 expiry = [self._validate_expiry(exp) for exp in expiry]
994 else:
--> 995 expiry = [self._validate_expiry(expiry)]
996
997 if len(expiry) == 0:
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _validate_expiry(self, expiry)
1017 If the expiry date does not have options that expire on that day, return next expiry"""
1018
-> 1019 expiry_dates = self.expiry_dates
1020 expiry = to_datetime(expiry)
1021 if hasattr(expiry, 'date'):
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in expiry_dates(self)
1161 expiry_dates = self._expiry_dates
1162 except AttributeError:
-> 1163 expiry_dates, _ = self._get_expiry_dates_and_links()
1164 return expiry_dates
1165
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_expiry_dates_and_links(self)
1176
1177 url = self._OPTIONS_BASE_URL.format(sym=self.symbol)
-> 1178 root = self._parse_url(url)
1179
1180 try:
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _parse_url(self, url)
1208 except _network_error_classes:
1209 raise RemoteDataError("Unable to parse URL "
-> 1210 "{0!r}".format(url))
1211 else:
1212 root = doc.getroot()
RemoteDataError: Unable to parse URL 'http://finance.yahoo.com/q/op?s=AAPL'
In [16]: data.iloc[0:5:, 0:5]
Out[16]:
id Date Col_1 Col_2 Col_3
0 26 2010-10-18 X 27.50 True
1 42 2010-10-19 Y -12.50 False
2 63 2010-10-20 Z 5.73 True
Note that if you call get_all_data
first, this second call will happen much faster,
as the data is cached.
If a given expiry date is not available, data for the next available expiry will be returned (January 15, 2015 in the above example).
Available expiry dates can be accessed from the expiry_dates
property.
In [17]: aapl.expiry_dates
---------------------------------------------------------------------------
RemoteDataError Traceback (most recent call last)
<ipython-input-17-c392ccd6fc72> in <module>()
----> 1 aapl.expiry_dates
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in expiry_dates(self)
1161 expiry_dates = self._expiry_dates
1162 except AttributeError:
-> 1163 expiry_dates, _ = self._get_expiry_dates_and_links()
1164 return expiry_dates
1165
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_expiry_dates_and_links(self)
1176
1177 url = self._OPTIONS_BASE_URL.format(sym=self.symbol)
-> 1178 root = self._parse_url(url)
1179
1180 try:
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _parse_url(self, url)
1208 except _network_error_classes:
1209 raise RemoteDataError("Unable to parse URL "
-> 1210 "{0!r}".format(url))
1211 else:
1212 root = doc.getroot()
RemoteDataError: Unable to parse URL 'http://finance.yahoo.com/q/op?s=AAPL'
In [18]: data = aapl.get_call_data(expiry=aapl.expiry_dates[0])
---------------------------------------------------------------------------
RemoteDataError Traceback (most recent call last)
<ipython-input-18-110cc070174a> in <module>()
----> 1 data = aapl.get_call_data(expiry=aapl.expiry_dates[0])
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in expiry_dates(self)
1161 expiry_dates = self._expiry_dates
1162 except AttributeError:
-> 1163 expiry_dates, _ = self._get_expiry_dates_and_links()
1164 return expiry_dates
1165
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_expiry_dates_and_links(self)
1176
1177 url = self._OPTIONS_BASE_URL.format(sym=self.symbol)
-> 1178 root = self._parse_url(url)
1179
1180 try:
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _parse_url(self, url)
1208 except _network_error_classes:
1209 raise RemoteDataError("Unable to parse URL "
-> 1210 "{0!r}".format(url))
1211 else:
1212 root = doc.getroot()
RemoteDataError: Unable to parse URL 'http://finance.yahoo.com/q/op?s=AAPL'
In [19]: data.iloc[0:5:, 0:5]
Out[19]:
id Date Col_1 Col_2 Col_3
0 26 2010-10-18 X 27.50 True
1 42 2010-10-19 Y -12.50 False
2 63 2010-10-20 Z 5.73 True
A list-like object containing dates can also be passed to the expiry parameter, returning options data for all expiry dates in the list.
In [20]: data = aapl.get_near_stock_price(expiry=aapl.expiry_dates[0:3])
---------------------------------------------------------------------------
RemoteDataError Traceback (most recent call last)
<ipython-input-20-c65c968dea4d> in <module>()
----> 1 data = aapl.get_near_stock_price(expiry=aapl.expiry_dates[0:3])
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in expiry_dates(self)
1161 expiry_dates = self._expiry_dates
1162 except AttributeError:
-> 1163 expiry_dates, _ = self._get_expiry_dates_and_links()
1164 return expiry_dates
1165
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_expiry_dates_and_links(self)
1176
1177 url = self._OPTIONS_BASE_URL.format(sym=self.symbol)
-> 1178 root = self._parse_url(url)
1179
1180 try:
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _parse_url(self, url)
1208 except _network_error_classes:
1209 raise RemoteDataError("Unable to parse URL "
-> 1210 "{0!r}".format(url))
1211 else:
1212 root = doc.getroot()
RemoteDataError: Unable to parse URL 'http://finance.yahoo.com/q/op?s=AAPL'
In [21]: data.iloc[0:5:, 0:5]
Out[21]:
id Date Col_1 Col_2 Col_3
0 26 2010-10-18 X 27.50 True
1 42 2010-10-19 Y -12.50 False
2 63 2010-10-20 Z 5.73 True
The month
and year
parameters can be used to get all options data for a given month.
Google Finance¶
In [22]: import pandas.io.data as web
In [23]: import datetime
In [24]: start = datetime.datetime(2010, 1, 1)
In [25]: end = datetime.datetime(2013, 1, 27)
In [26]: f = web.DataReader("F", 'google', start, end)
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-26-514f2d6b98c6> in <module>()
----> 1 f = web.DataReader("F", 'google', start, end)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in DataReader(name, data_source, start, end, retry_count, pause)
93 return get_data_google(symbols=name, start=start, end=end,
94 adjust_price=False, chunksize=25,
---> 95 retry_count=retry_count, pause=pause)
96 elif data_source == "fred":
97 return get_data_fred(name, start, end)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_data_google(symbols, start, end, retry_count, pause, adjust_price, ret_index, chunksize)
472 """
473 return _get_data_from(symbols, start, end, None, retry_count, pause,
--> 474 adjust_price, ret_index, chunksize, 'google')
475
476
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_data_from(symbols, start, end, interval, retry_count, pause, adjust_price, ret_index, chunksize, source)
377 # If a single symbol, (e.g., 'GOOG')
378 if isinstance(symbols, (compat.string_types, int)):
--> 379 hist_data = src_fn(symbols, start, end, interval, retry_count, pause)
380 # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
381 elif isinstance(symbols, DataFrame):
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_hist_google(sym, start, end, interval, retry_count, pause)
241 "enddate": end.strftime('%b %d, %Y'),
242 "output": "csv"}))
--> 243 return _retry_read_url(url, retry_count, pause, 'Google')
244
245
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _retry_read_url(url, retry_count, pause, name)
197
198 raise IOError("after %d tries, %s did not "
--> 199 "return a 200 for url %r" % (retry_count, name, url))
200
201
IOError: after 3 tries, Google did not return a 200 for url 'http://www.google.com/finance/historical?q=F&startdate=Jan+01%2C+2010&enddate=Jan+27%2C+2013&output=csv'
In [27]: f.ix['2010-01-04']
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-27-a881e6f45410> in <module>()
----> 1 f.ix['2010-01-04']
AttributeError: 'file' object has no attribute 'ix'
FRED¶
In [28]: import pandas.io.data as web
In [29]: import datetime
In [30]: start = datetime.datetime(2010, 1, 1)
In [31]: end = datetime.datetime(2013, 1, 27)
In [32]: gdp=web.DataReader("GDP", "fred", start, end)
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-32-ee552a87dd44> in <module>()
----> 1 gdp=web.DataReader("GDP", "fred", start, end)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in DataReader(name, data_source, start, end, retry_count, pause)
95 retry_count=retry_count, pause=pause)
96 elif data_source == "fred":
---> 97 return get_data_fred(name, start, end)
98 elif data_source == "famafrench":
99 return get_data_famafrench(name)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_data_fred(name, start, end)
511 "a valid FRED series.".format(name))
512 raise
--> 513 df = concat([fetch_data(url, n) for url, n in zip(urls, names)],
514 axis=1, join='outer')
515 return df
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in fetch_data(url, name)
500
501 def fetch_data(url, name):
--> 502 with urlopen(url) as resp:
503 data = read_csv(resp, index_col=0, parse_dates=True,
504 header=None, skiprows=1, names=["DATE", name],
/usr/lib/python2.7/contextlib.pyc in __enter__(self)
15 def __enter__(self):
16 try:
---> 17 return self.gen.next()
18 except StopIteration:
19 raise RuntimeError("generator didn't yield")
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/common.pyc in urlopen(*args, **kwargs)
49 @contextmanager
50 def urlopen(*args, **kwargs):
---> 51 with closing(_urlopen(*args, **kwargs)) as f:
52 yield f
53
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
429 req = meth(req)
430
--> 431 response = self._open(req, data)
432
433 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
447 protocol = req.get_type()
448 result = self._call_chain(self.handle_open, protocol, protocol +
--> 449 '_open', req)
450 if result:
451 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
407 func = getattr(handler, meth_name)
408
--> 409 result = func(*args)
410 if result is not None:
411 return result
/usr/lib/python2.7/urllib2.pyc in http_open(self, req)
1225
1226 def http_open(self, req):
-> 1227 return self.do_open(httplib.HTTPConnection, req)
1228
1229 http_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
1195 except socket.error, err: # XXX what error?
1196 h.close()
-> 1197 raise URLError(err)
1198 else:
1199 try:
URLError: <urlopen error [Errno 111] Connection refused>
In [33]: gdp.ix['2013-01-01']
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-33-4fda7a1657eb> in <module>()
----> 1 gdp.ix['2013-01-01']
NameError: name 'gdp' is not defined
# Multiple series:
In [34]: inflation = web.DataReader(["CPIAUCSL", "CPILFESL"], "fred", start, end)
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-34-c7b3843168f6> in <module>()
----> 1 inflation = web.DataReader(["CPIAUCSL", "CPILFESL"], "fred", start, end)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in DataReader(name, data_source, start, end, retry_count, pause)
95 retry_count=retry_count, pause=pause)
96 elif data_source == "fred":
---> 97 return get_data_fred(name, start, end)
98 elif data_source == "famafrench":
99 return get_data_famafrench(name)
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_data_fred(name, start, end)
511 "a valid FRED series.".format(name))
512 raise
--> 513 df = concat([fetch_data(url, n) for url, n in zip(urls, names)],
514 axis=1, join='outer')
515 return df
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in fetch_data(url, name)
500
501 def fetch_data(url, name):
--> 502 with urlopen(url) as resp:
503 data = read_csv(resp, index_col=0, parse_dates=True,
504 header=None, skiprows=1, names=["DATE", name],
/usr/lib/python2.7/contextlib.pyc in __enter__(self)
15 def __enter__(self):
16 try:
---> 17 return self.gen.next()
18 except StopIteration:
19 raise RuntimeError("generator didn't yield")
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/common.pyc in urlopen(*args, **kwargs)
49 @contextmanager
50 def urlopen(*args, **kwargs):
---> 51 with closing(_urlopen(*args, **kwargs)) as f:
52 yield f
53
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
429 req = meth(req)
430
--> 431 response = self._open(req, data)
432
433 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
447 protocol = req.get_type()
448 result = self._call_chain(self.handle_open, protocol, protocol +
--> 449 '_open', req)
450 if result:
451 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
407 func = getattr(handler, meth_name)
408
--> 409 result = func(*args)
410 if result is not None:
411 return result
/usr/lib/python2.7/urllib2.pyc in http_open(self, req)
1225
1226 def http_open(self, req):
-> 1227 return self.do_open(httplib.HTTPConnection, req)
1228
1229 http_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
1195 except socket.error, err: # XXX what error?
1196 h.close()
-> 1197 raise URLError(err)
1198 else:
1199 try:
URLError: <urlopen error [Errno 111] Connection refused>
In [35]: inflation.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-35-7b22de5ccbc6> in <module>()
----> 1 inflation.head()
NameError: name 'inflation' is not defined
Fama/French¶
Dataset names are listed at Fama/French Data Library.
In [36]: import pandas.io.data as web
In [37]: ip = web.DataReader("5_Industry_Portfolios", "famafrench")
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-37-5a3e59d5b894> in <module>()
----> 1 ip = web.DataReader("5_Industry_Portfolios", "famafrench")
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in DataReader(name, data_source, start, end, retry_count, pause)
97 return get_data_fred(name, start, end)
98 elif data_source == "famafrench":
---> 99 return get_data_famafrench(name)
100
101
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_data_famafrench(name)
523 zip_file_path = '{0}/{1}_TXT.zip'.format(_FAMAFRENCH_URL, name)
524
--> 525 with urlopen(zip_file_path) as url:
526 raw = url.read()
527
/usr/lib/python2.7/contextlib.pyc in __enter__(self)
15 def __enter__(self):
16 try:
---> 17 return self.gen.next()
18 except StopIteration:
19 raise RuntimeError("generator didn't yield")
/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/common.pyc in urlopen(*args, **kwargs)
49 @contextmanager
50 def urlopen(*args, **kwargs):
---> 51 with closing(_urlopen(*args, **kwargs)) as f:
52 yield f
53
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
429 req = meth(req)
430
--> 431 response = self._open(req, data)
432
433 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
447 protocol = req.get_type()
448 result = self._call_chain(self.handle_open, protocol, protocol +
--> 449 '_open', req)
450 if result:
451 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
407 func = getattr(handler, meth_name)
408
--> 409 result = func(*args)
410 if result is not None:
411 return result
/usr/lib/python2.7/urllib2.pyc in http_open(self, req)
1225
1226 def http_open(self, req):
-> 1227 return self.do_open(httplib.HTTPConnection, req)
1228
1229 http_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
1195 except socket.error, err: # XXX what error?
1196 h.close()
-> 1197 raise URLError(err)
1198 else:
1199 try:
URLError: <urlopen error [Errno 111] Connection refused>
In [38]: ip[4].ix[192607]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-38-1cb52c705436> in <module>()
----> 1 ip[4].ix[192607]
NameError: name 'ip' is not defined
World Bank¶
pandas
users can easily access thousands of panel data series from the
World Bank’s World Development Indicators
by using the wb
I/O functions.
Indicators¶
Either from exploring the World Bank site, or using the search function included, every world bank indicator is accessible.
For example, if you wanted to compare the Gross Domestic Products per capita in
constant dollars in North America, you would use the search
function:
In [1]: from pandas.io import wb
In [2]: wb.search('gdp.*capita.*const').iloc[:,:2]
Out[2]:
id name
3242 GDPPCKD GDP per Capita, constant US$, millions
5143 NY.GDP.PCAP.KD GDP per capita (constant 2005 US$)
5145 NY.GDP.PCAP.KN GDP per capita (constant LCU)
5147 NY.GDP.PCAP.PP.KD GDP per capita, PPP (constant 2005 internation...
Then you would use the download
function to acquire the data from the World
Bank’s servers:
In [3]: dat = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX'], start=2005, end=2008)
In [4]: print(dat)
NY.GDP.PCAP.KD
country year
Canada 2008 36005.5004978584
2007 36182.9138439757
2006 35785.9698172849
2005 35087.8925933298
Mexico 2008 8113.10219480083
2007 8119.21298908649
2006 7961.96818458178
2005 7666.69796097264
United States 2008 43069.5819857208
2007 43635.5852068142
2006 43228.111147107
2005 42516.3934699993
The resulting dataset is a properly formatted DataFrame
with a hierarchical
index, so it is easy to apply .groupby
transformations to it:
In [6]: dat['NY.GDP.PCAP.KD'].groupby(level=0).mean()
Out[6]:
country
Canada 35765.569188
Mexico 7965.245332
United States 43112.417952
dtype: float64
Now imagine you want to compare GDP to the share of people with cellphone contracts around the world.
In [7]: wb.search('cell.*%').iloc[:,:2]
Out[7]:
id name
3990 IT.CEL.SETS.FE.ZS Mobile cellular telephone users, female (% of ...
3991 IT.CEL.SETS.MA.ZS Mobile cellular telephone users, male (% of po...
4027 IT.MOB.COV.ZS Population coverage of mobile cellular telepho...
Notice that this second search was much faster than the first one because
pandas
now has a cached list of available data series.
In [13]: ind = ['NY.GDP.PCAP.KD', 'IT.MOB.COV.ZS']
In [14]: dat = wb.download(indicator=ind, country='all', start=2011, end=2011).dropna()
In [15]: dat.columns = ['gdp', 'cellphone']
In [16]: print(dat.tail())
gdp cellphone
country year
Swaziland 2011 2413.952853 94.9
Tunisia 2011 3687.340170 100.0
Uganda 2011 405.332501 100.0
Zambia 2011 767.911290 62.0
Zimbabwe 2011 419.236086 72.4
Finally, we use the statsmodels
package to assess the relationship between
our two variables using ordinary least squares regression. Unsurprisingly,
populations in rich countries tend to use cellphones at a higher rate:
In [17]: import numpy as np
In [18]: import statsmodels.formula.api as smf
In [19]: mod = smf.ols("cellphone ~ np.log(gdp)", dat).fit()
In [20]: print(mod.summary())
OLS Regression Results
==============================================================================
Dep. Variable: cellphone R-squared: 0.297
Model: OLS Adj. R-squared: 0.274
Method: Least Squares F-statistic: 13.08
Date: Thu, 25 Jul 2013 Prob (F-statistic): 0.00105
Time: 15:24:42 Log-Likelihood: -139.16
No. Observations: 33 AIC: 282.3
Df Residuals: 31 BIC: 285.3
Df Model: 1
===============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept 16.5110 19.071 0.866 0.393 -22.384 55.406
np.log(gdp) 9.9333 2.747 3.616 0.001 4.331 15.535
==============================================================================
Omnibus: 36.054 Durbin-Watson: 2.071
Prob(Omnibus): 0.000 Jarque-Bera (JB): 119.133
Skew: -2.314 Prob(JB): 1.35e-26
Kurtosis: 11.077 Cond. No. 45.8
==============================================================================
Country Codes¶
New in version 0.15.1.
The country
argument accepts a string or list of mixed
two or three character
ISO country codes, as well as dynamic World Bank exceptions to the ISO standards.
For a list of the the hard-coded country codes (used solely for error handling logic) see pandas.io.wb.country_codes
.
Problematic Country Codes & Indicators¶
Note
The World Bank’s country list and indicators are dynamic. As of 0.15.1,
wb.download()
is more flexible. To achieve this, the warning
and exception logic changed.
The world bank converts some country codes in their response, which makes error checking by pandas difficult. Retired indicators still persist in the search.
Given the new flexibility of 0.15.1, improved error handling by the user may be necessary for fringe cases.
To help identify issues:
There are at least 4 kinds of country codes:
- Standard (2/3 digit ISO) - returns data, will warn and error properly.
- Non-standard (WB Exceptions) - returns data, but will falsely warn.
- Blank - silently missing from the response.
- Bad - causes the entire response from WB to fail, always exception inducing.
There are at least 3 kinds of indicators:
- Current - Returns data.
- Retired - Appears in search results, yet won’t return data.
- Bad - Will not return data.
Use the errors
argument to control warnings and exceptions. Setting
errors to ignore or warn, won’t stop failed responses. (ie, 100% bad
indicators, or a single “bad” (#4 above) country code).
See docstrings for more info.
Google Analytics¶
The ga
module provides a wrapper for
Google Analytics API
to simplify retrieving traffic data.
Result sets are parsed into a pandas DataFrame with a shape and data types
derived from the source table.
Configuring Access to Google Analytics¶
The first thing you need to do is to setup accesses to Google Analytics API. Follow the steps below:
- In the Google Developers Console
- enable the Analytics API
- create a new project
- create a new Client ID for an “Installed Application” (in the “APIs & auth / Credentials section” of the newly created project)
- download it (JSON file)
- On your machine
- rename it to
client_secrets.json
- move it to the
pandas/io
module directory
- rename it to
The first time you use the read_ga()
function, a browser window will open to ask you to authentify to the Google API. Do proceed.
Using the Google Analytics API¶
The following will fetch users and pageviews (metrics) data per day of the week, for the first semester of 2014, from a particular property.
import pandas.io.ga as ga
ga.read_ga(
account_id = "2360420",
profile_id = "19462946",
property_id = "UA-2360420-5",
metrics = ['users', 'pageviews'],
dimensions = ['dayOfWeek'],
start_date = "2014-01-01",
end_date = "2014-08-01",
index_col = 0,
filters = "pagePath=~aboutus;ga:country==France",
)
The only mandatory arguments are metrics,
dimensions
and start_date
. We strongly recommend that you always specify the account_id
, profile_id
and property_id
to avoid accessing the wrong data bucket in Google Analytics.
The index_col
argument indicates which dimension(s) has to be taken as index.
The filters
argument indicates the filtering to apply to the query. In the above example, the page URL has to contain aboutus
AND the visitors country has to be France.
Detailed information in the following: