Remote Data Access

Warning

In pandas 0.17.0, the sub-package pandas.io.data will be removed in favor of a separately installable pandas-datareader package. This will allow the data modules to be independently updated to your pandas installation. The API for pandas-datareader v0.1.1 is the same as in pandas v0.16.1. (GH8961)

You should replace the imports of the following:

from pandas.io import data, wb

With:

from pandas_datareader import data, wb

Functions from pandas.io.data and pandas.io.ga extract data from various Internet sources into a DataFrame. Currently the following sources are supported:

It should be noted, that various sources support different kinds of data, so not all sources implement the same methods and the data elements returned might also differ.

Yahoo! Finance

In [1]: import pandas.io.data as web

In [2]: import datetime

In [3]: start = datetime.datetime(2010, 1, 1)

In [4]: end = datetime.datetime(2013, 1, 27)

In [5]: f = web.DataReader("F", 'yahoo', start, end)
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-5-679b50ab962b> in <module>()
----> 1 f = web.DataReader("F", 'yahoo', start, end)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in DataReader(name, data_source, start, end, retry_count, pause)
     89         return get_data_yahoo(symbols=name, start=start, end=end,
     90                               adjust_price=False, chunksize=25,
---> 91                               retry_count=retry_count, pause=pause)
     92     elif data_source == "google":
     93         return get_data_google(symbols=name, start=start, end=end,

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_data_yahoo(symbols, start, end, retry_count, pause, adjust_price, ret_index, chunksize, interval)
    436         raise ValueError("Invalid interval: valid values are 'd', 'w', 'm' and 'v'")
    437     return _get_data_from(symbols, start, end, interval, retry_count, pause,
--> 438                           adjust_price, ret_index, chunksize, 'yahoo')
    439 
    440 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_data_from(symbols, start, end, interval, retry_count, pause, adjust_price, ret_index, chunksize, source)
    377     # If a single symbol, (e.g., 'GOOG')
    378     if isinstance(symbols, (compat.string_types, int)):
--> 379         hist_data = src_fn(symbols, start, end, interval, retry_count, pause)
    380     # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
    381     elif isinstance(symbols, DataFrame):

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_hist_yahoo(sym, start, end, interval, retry_count, pause)
    220            '&g=%s' % interval +
    221            '&ignore=.csv')
--> 222     return _retry_read_url(url, retry_count, pause, 'Yahoo!')
    223 
    224 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _retry_read_url(url, retry_count, pause, name)
    197 
    198     raise IOError("after %d tries, %s did not "
--> 199                   "return a 200 for url %r" % (retry_count, name, url))
    200 
    201 

IOError: after 3 tries, Yahoo! did not return a 200 for url 'http://ichart.finance.yahoo.com/table.csv?s=F&a=0&b=1&c=2010&d=0&e=27&f=2013&g=d&ignore=.csv'

In [6]: f.ix['2010-01-04']
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-a881e6f45410> in <module>()
----> 1 f.ix['2010-01-04']

AttributeError: 'file' object has no attribute 'ix'

Yahoo! Finance Options

*Experimental*

The Options class allows the download of options data from Yahoo! Finance.

The get_all_data method downloads and caches option data for all expiry months and provides a formatted DataFrame with a hierarchical index, so it is easy to get to the specific option you want.

In [7]: from pandas.io.data import Options

In [8]: aapl = Options('aapl', 'yahoo')

In [9]: data = aapl.get_all_data()
---------------------------------------------------------------------------
RemoteDataError                           Traceback (most recent call last)
<ipython-input-9-a2dc6d91ec5f> in <module>()
----> 1 data = aapl.get_all_data()

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_all_data(self, call, put)
   1130 
   1131         try:
-> 1132             expiry_dates = self.expiry_dates
   1133         except AttributeError:
   1134             expiry_dates, _ = self._get_expiry_dates_and_links()

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in expiry_dates(self)
   1161             expiry_dates = self._expiry_dates
   1162         except AttributeError:
-> 1163             expiry_dates, _ = self._get_expiry_dates_and_links()
   1164         return expiry_dates
   1165 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_expiry_dates_and_links(self)
   1176 
   1177         url = self._OPTIONS_BASE_URL.format(sym=self.symbol)
-> 1178         root = self._parse_url(url)
   1179 
   1180         try:

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _parse_url(self, url)
   1208         except _network_error_classes:
   1209             raise RemoteDataError("Unable to parse URL "
-> 1210                                   "{0!r}".format(url))
   1211         else:
   1212             root = doc.getroot()

RemoteDataError: Unable to parse URL 'http://finance.yahoo.com/q/op?s=AAPL'

In [10]: data.iloc[0:5, 0:5]
Out[10]: 
   id       Date Col_1  Col_2  Col_3
0  26 2010-10-18     X  27.50   True
1  42 2010-10-19     Y -12.50  False
2  63 2010-10-20     Z   5.73   True

# Show the $100 strike puts at all expiry dates:
In [11]: data.loc[(100, slice(None), 'put'),:].iloc[0:5, 0:5]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-b8b1bd8f16d0> in <module>()
----> 1 data.loc[(100, slice(None), 'put'),:].iloc[0:5, 0:5]

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in __getitem__(self, key)
   1223     def __getitem__(self, key):
   1224         if type(key) is tuple:
-> 1225             return self._getitem_tuple(key)
   1226         else:
   1227             return self._getitem_axis(key, axis=0)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_tuple(self, tup)
    736     def _getitem_tuple(self, tup):
    737         try:
--> 738             return self._getitem_lowerdim(tup)
    739         except IndexingError:
    740             pass

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_lowerdim(self, tup)
    861         for i, key in enumerate(tup):
    862             if is_label_like(key) or isinstance(key, tuple):
--> 863                 section = self._getitem_axis(key, axis=i)
    864 
    865                 # we have yielded a scalar ?

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
   1359                     raise ValueError('Cannot index with multidimensional key')
   1360 
-> 1361                 return self._getitem_iterable(key, axis=axis)
   1362 
   1363             # nested tuple slicing

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_iterable(self, key, axis)
    969     def _getitem_iterable(self, key, axis=0):
    970         if self._should_validate_iterable(axis):
--> 971             self._has_valid_type(key, axis)
    972 
    973         labels = self.obj._get_axis(axis)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _has_valid_type(self, key, axis)
   1304 
   1305             # TODO: don't check the entire key unless necessary
-> 1306             if len(key) and np.all(ax.get_indexer_for(key) < 0):
   1307 
   1308                 raise KeyError("None of [%s] are in the [%s]" %

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer_for(self, target, **kwargs)
   2004         """ guaranteed return of an indexer even when non-unique """
   2005         if self.is_unique:
-> 2006             return self.get_indexer(target, **kwargs)
   2007         indexer, _ = self.get_indexer_non_unique(target, **kwargs)
   2008         return indexer

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
   1887             target = target.astype(object)
   1888             return this.get_indexer(target, method=method, limit=limit,
-> 1889                                     tolerance=tolerance)
   1890 
   1891         if not self.is_unique:

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
   1905                                  'backfill or nearest reindexing')
   1906 
-> 1907             indexer = self._engine.get_indexer(target._values)
   1908 
   1909         return com._ensure_platform_int(indexer)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_indexer (pandas/index.c:5840)()

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.lookup (pandas/hashtable.c:13174)()

TypeError: unhashable type

# Show the volume traded of $100 strike puts at all expiry dates:
In [12]: data.loc[(100, slice(None), 'put'),'Vol'].head()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-12ae0bf819cc> in <module>()
----> 1 data.loc[(100, slice(None), 'put'),'Vol'].head()

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in __getitem__(self, key)
   1223     def __getitem__(self, key):
   1224         if type(key) is tuple:
-> 1225             return self._getitem_tuple(key)
   1226         else:
   1227             return self._getitem_axis(key, axis=0)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_tuple(self, tup)
    736     def _getitem_tuple(self, tup):
    737         try:
--> 738             return self._getitem_lowerdim(tup)
    739         except IndexingError:
    740             pass

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_lowerdim(self, tup)
    861         for i, key in enumerate(tup):
    862             if is_label_like(key) or isinstance(key, tuple):
--> 863                 section = self._getitem_axis(key, axis=i)
    864 
    865                 # we have yielded a scalar ?

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
   1359                     raise ValueError('Cannot index with multidimensional key')
   1360 
-> 1361                 return self._getitem_iterable(key, axis=axis)
   1362 
   1363             # nested tuple slicing

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_iterable(self, key, axis)
    969     def _getitem_iterable(self, key, axis=0):
    970         if self._should_validate_iterable(axis):
--> 971             self._has_valid_type(key, axis)
    972 
    973         labels = self.obj._get_axis(axis)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _has_valid_type(self, key, axis)
   1304 
   1305             # TODO: don't check the entire key unless necessary
-> 1306             if len(key) and np.all(ax.get_indexer_for(key) < 0):
   1307 
   1308                 raise KeyError("None of [%s] are in the [%s]" %

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer_for(self, target, **kwargs)
   2004         """ guaranteed return of an indexer even when non-unique """
   2005         if self.is_unique:
-> 2006             return self.get_indexer(target, **kwargs)
   2007         indexer, _ = self.get_indexer_non_unique(target, **kwargs)
   2008         return indexer

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
   1887             target = target.astype(object)
   1888             return this.get_indexer(target, method=method, limit=limit,
-> 1889                                     tolerance=tolerance)
   1890 
   1891         if not self.is_unique:

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit, tolerance)
   1905                                  'backfill or nearest reindexing')
   1906 
-> 1907             indexer = self._engine.get_indexer(target._values)
   1908 
   1909         return com._ensure_platform_int(indexer)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_indexer (pandas/index.c:5840)()

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.lookup (pandas/hashtable.c:13174)()

TypeError: unhashable type

If you don’t want to download all the data, more specific requests can be made.

In [13]: import datetime

In [14]: expiry = datetime.date(2016, 1, 1)

In [15]: data = aapl.get_call_data(expiry=expiry)
---------------------------------------------------------------------------
RemoteDataError                           Traceback (most recent call last)
<ipython-input-15-15c0ecd77bfd> in <module>()
----> 1 data = aapl.get_call_data(expiry=expiry)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_call_data(self, month, year, expiry)
    833         for the expiry of the options.
    834         """
--> 835         expiry = self._try_parse_dates(year, month, expiry)
    836         return self._get_data_in_date_range(expiry, call=True, put=False)
    837 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _try_parse_dates(self, year, month, expiry)
    993                 expiry = [self._validate_expiry(exp) for exp in expiry]
    994             else:
--> 995                 expiry = [self._validate_expiry(expiry)]
    996 
    997             if len(expiry) == 0:

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _validate_expiry(self, expiry)
   1017         If the expiry date does not have options that expire on that day, return next expiry"""
   1018 
-> 1019         expiry_dates = self.expiry_dates
   1020         expiry = to_datetime(expiry)
   1021         if hasattr(expiry, 'date'):

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in expiry_dates(self)
   1161             expiry_dates = self._expiry_dates
   1162         except AttributeError:
-> 1163             expiry_dates, _ = self._get_expiry_dates_and_links()
   1164         return expiry_dates
   1165 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_expiry_dates_and_links(self)
   1176 
   1177         url = self._OPTIONS_BASE_URL.format(sym=self.symbol)
-> 1178         root = self._parse_url(url)
   1179 
   1180         try:

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _parse_url(self, url)
   1208         except _network_error_classes:
   1209             raise RemoteDataError("Unable to parse URL "
-> 1210                                   "{0!r}".format(url))
   1211         else:
   1212             root = doc.getroot()

RemoteDataError: Unable to parse URL 'http://finance.yahoo.com/q/op?s=AAPL'

In [16]: data.iloc[0:5:, 0:5]
Out[16]: 
   id       Date Col_1  Col_2  Col_3
0  26 2010-10-18     X  27.50   True
1  42 2010-10-19     Y -12.50  False
2  63 2010-10-20     Z   5.73   True

Note that if you call get_all_data first, this second call will happen much faster, as the data is cached.

If a given expiry date is not available, data for the next available expiry will be returned (January 15, 2015 in the above example).

Available expiry dates can be accessed from the expiry_dates property.

In [17]: aapl.expiry_dates
---------------------------------------------------------------------------
RemoteDataError                           Traceback (most recent call last)
<ipython-input-17-c392ccd6fc72> in <module>()
----> 1 aapl.expiry_dates

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in expiry_dates(self)
   1161             expiry_dates = self._expiry_dates
   1162         except AttributeError:
-> 1163             expiry_dates, _ = self._get_expiry_dates_and_links()
   1164         return expiry_dates
   1165 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_expiry_dates_and_links(self)
   1176 
   1177         url = self._OPTIONS_BASE_URL.format(sym=self.symbol)
-> 1178         root = self._parse_url(url)
   1179 
   1180         try:

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _parse_url(self, url)
   1208         except _network_error_classes:
   1209             raise RemoteDataError("Unable to parse URL "
-> 1210                                   "{0!r}".format(url))
   1211         else:
   1212             root = doc.getroot()

RemoteDataError: Unable to parse URL 'http://finance.yahoo.com/q/op?s=AAPL'

In [18]: data = aapl.get_call_data(expiry=aapl.expiry_dates[0])
---------------------------------------------------------------------------
RemoteDataError                           Traceback (most recent call last)
<ipython-input-18-110cc070174a> in <module>()
----> 1 data = aapl.get_call_data(expiry=aapl.expiry_dates[0])

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in expiry_dates(self)
   1161             expiry_dates = self._expiry_dates
   1162         except AttributeError:
-> 1163             expiry_dates, _ = self._get_expiry_dates_and_links()
   1164         return expiry_dates
   1165 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_expiry_dates_and_links(self)
   1176 
   1177         url = self._OPTIONS_BASE_URL.format(sym=self.symbol)
-> 1178         root = self._parse_url(url)
   1179 
   1180         try:

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _parse_url(self, url)
   1208         except _network_error_classes:
   1209             raise RemoteDataError("Unable to parse URL "
-> 1210                                   "{0!r}".format(url))
   1211         else:
   1212             root = doc.getroot()

RemoteDataError: Unable to parse URL 'http://finance.yahoo.com/q/op?s=AAPL'

In [19]: data.iloc[0:5:, 0:5]
Out[19]: 
   id       Date Col_1  Col_2  Col_3
0  26 2010-10-18     X  27.50   True
1  42 2010-10-19     Y -12.50  False
2  63 2010-10-20     Z   5.73   True

A list-like object containing dates can also be passed to the expiry parameter, returning options data for all expiry dates in the list.

In [20]: data = aapl.get_near_stock_price(expiry=aapl.expiry_dates[0:3])
---------------------------------------------------------------------------
RemoteDataError                           Traceback (most recent call last)
<ipython-input-20-c65c968dea4d> in <module>()
----> 1 data = aapl.get_near_stock_price(expiry=aapl.expiry_dates[0:3])

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in expiry_dates(self)
   1161             expiry_dates = self._expiry_dates
   1162         except AttributeError:
-> 1163             expiry_dates, _ = self._get_expiry_dates_and_links()
   1164         return expiry_dates
   1165 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_expiry_dates_and_links(self)
   1176 
   1177         url = self._OPTIONS_BASE_URL.format(sym=self.symbol)
-> 1178         root = self._parse_url(url)
   1179 
   1180         try:

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _parse_url(self, url)
   1208         except _network_error_classes:
   1209             raise RemoteDataError("Unable to parse URL "
-> 1210                                   "{0!r}".format(url))
   1211         else:
   1212             root = doc.getroot()

RemoteDataError: Unable to parse URL 'http://finance.yahoo.com/q/op?s=AAPL'

In [21]: data.iloc[0:5:, 0:5]
Out[21]: 
   id       Date Col_1  Col_2  Col_3
0  26 2010-10-18     X  27.50   True
1  42 2010-10-19     Y -12.50  False
2  63 2010-10-20     Z   5.73   True

The month and year parameters can be used to get all options data for a given month.

Google Finance

In [22]: import pandas.io.data as web

In [23]: import datetime

In [24]: start = datetime.datetime(2010, 1, 1)

In [25]: end = datetime.datetime(2013, 1, 27)

In [26]: f = web.DataReader("F", 'google', start, end)
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-26-514f2d6b98c6> in <module>()
----> 1 f = web.DataReader("F", 'google', start, end)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in DataReader(name, data_source, start, end, retry_count, pause)
     93         return get_data_google(symbols=name, start=start, end=end,
     94                                adjust_price=False, chunksize=25,
---> 95                                retry_count=retry_count, pause=pause)
     96     elif data_source == "fred":
     97         return get_data_fred(name, start, end)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_data_google(symbols, start, end, retry_count, pause, adjust_price, ret_index, chunksize)
    472     """
    473     return _get_data_from(symbols, start, end, None, retry_count, pause,
--> 474                           adjust_price, ret_index, chunksize, 'google')
    475 
    476 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_data_from(symbols, start, end, interval, retry_count, pause, adjust_price, ret_index, chunksize, source)
    377     # If a single symbol, (e.g., 'GOOG')
    378     if isinstance(symbols, (compat.string_types, int)):
--> 379         hist_data = src_fn(symbols, start, end, interval, retry_count, pause)
    380     # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
    381     elif isinstance(symbols, DataFrame):

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _get_hist_google(sym, start, end, interval, retry_count, pause)
    241                                "enddate": end.strftime('%b %d, %Y'),
    242                                "output": "csv"}))
--> 243     return _retry_read_url(url, retry_count, pause, 'Google')
    244 
    245 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in _retry_read_url(url, retry_count, pause, name)
    197 
    198     raise IOError("after %d tries, %s did not "
--> 199                   "return a 200 for url %r" % (retry_count, name, url))
    200 
    201 

IOError: after 3 tries, Google did not return a 200 for url 'http://www.google.com/finance/historical?q=F&startdate=Jan+01%2C+2010&enddate=Jan+27%2C+2013&output=csv'

In [27]: f.ix['2010-01-04']
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-27-a881e6f45410> in <module>()
----> 1 f.ix['2010-01-04']

AttributeError: 'file' object has no attribute 'ix'

FRED

In [28]: import pandas.io.data as web

In [29]: import datetime

In [30]: start = datetime.datetime(2010, 1, 1)

In [31]: end = datetime.datetime(2013, 1, 27)

In [32]: gdp=web.DataReader("GDP", "fred", start, end)
---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-32-ee552a87dd44> in <module>()
----> 1 gdp=web.DataReader("GDP", "fred", start, end)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in DataReader(name, data_source, start, end, retry_count, pause)
     95                                retry_count=retry_count, pause=pause)
     96     elif data_source == "fred":
---> 97         return get_data_fred(name, start, end)
     98     elif data_source == "famafrench":
     99         return get_data_famafrench(name)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_data_fred(name, start, end)
    511                               "a valid FRED series.".format(name))
    512             raise
--> 513     df = concat([fetch_data(url, n) for url, n in zip(urls, names)],
    514                 axis=1, join='outer')
    515     return df

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in fetch_data(url, name)
    500 
    501     def fetch_data(url, name):
--> 502         with urlopen(url) as resp:
    503             data = read_csv(resp, index_col=0, parse_dates=True,
    504                             header=None, skiprows=1, names=["DATE", name],

/usr/lib/python2.7/contextlib.pyc in __enter__(self)
     15     def __enter__(self):
     16         try:
---> 17             return self.gen.next()
     18         except StopIteration:
     19             raise RuntimeError("generator didn't yield")

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/common.pyc in urlopen(*args, **kwargs)
     49     @contextmanager
     50     def urlopen(*args, **kwargs):
---> 51         with closing(_urlopen(*args, **kwargs)) as f:
     52             yield f
     53 

/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    152     else:
    153         opener = _opener
--> 154     return opener.open(url, data, timeout)
    155 
    156 def install_opener(opener):

/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    429             req = meth(req)
    430 
--> 431         response = self._open(req, data)
    432 
    433         # post-process response

/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
    447         protocol = req.get_type()
    448         result = self._call_chain(self.handle_open, protocol, protocol +
--> 449                                   '_open', req)
    450         if result:
    451             return result

/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    407             func = getattr(handler, meth_name)
    408 
--> 409             result = func(*args)
    410             if result is not None:
    411                 return result

/usr/lib/python2.7/urllib2.pyc in http_open(self, req)
   1225 
   1226     def http_open(self, req):
-> 1227         return self.do_open(httplib.HTTPConnection, req)
   1228 
   1229     http_request = AbstractHTTPHandler.do_request_

/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
   1195         except socket.error, err: # XXX what error?
   1196             h.close()
-> 1197             raise URLError(err)
   1198         else:
   1199             try:

URLError: <urlopen error [Errno 111] Connection refused>

In [33]: gdp.ix['2013-01-01']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-33-4fda7a1657eb> in <module>()
----> 1 gdp.ix['2013-01-01']

NameError: name 'gdp' is not defined

# Multiple series:
In [34]: inflation = web.DataReader(["CPIAUCSL", "CPILFESL"], "fred", start, end)
---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-34-c7b3843168f6> in <module>()
----> 1 inflation = web.DataReader(["CPIAUCSL", "CPILFESL"], "fred", start, end)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in DataReader(name, data_source, start, end, retry_count, pause)
     95                                retry_count=retry_count, pause=pause)
     96     elif data_source == "fred":
---> 97         return get_data_fred(name, start, end)
     98     elif data_source == "famafrench":
     99         return get_data_famafrench(name)

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_data_fred(name, start, end)
    511                               "a valid FRED series.".format(name))
    512             raise
--> 513     df = concat([fetch_data(url, n) for url, n in zip(urls, names)],
    514                 axis=1, join='outer')
    515     return df

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in fetch_data(url, name)
    500 
    501     def fetch_data(url, name):
--> 502         with urlopen(url) as resp:
    503             data = read_csv(resp, index_col=0, parse_dates=True,
    504                             header=None, skiprows=1, names=["DATE", name],

/usr/lib/python2.7/contextlib.pyc in __enter__(self)
     15     def __enter__(self):
     16         try:
---> 17             return self.gen.next()
     18         except StopIteration:
     19             raise RuntimeError("generator didn't yield")

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/common.pyc in urlopen(*args, **kwargs)
     49     @contextmanager
     50     def urlopen(*args, **kwargs):
---> 51         with closing(_urlopen(*args, **kwargs)) as f:
     52             yield f
     53 

/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    152     else:
    153         opener = _opener
--> 154     return opener.open(url, data, timeout)
    155 
    156 def install_opener(opener):

/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    429             req = meth(req)
    430 
--> 431         response = self._open(req, data)
    432 
    433         # post-process response

/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
    447         protocol = req.get_type()
    448         result = self._call_chain(self.handle_open, protocol, protocol +
--> 449                                   '_open', req)
    450         if result:
    451             return result

/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    407             func = getattr(handler, meth_name)
    408 
--> 409             result = func(*args)
    410             if result is not None:
    411                 return result

/usr/lib/python2.7/urllib2.pyc in http_open(self, req)
   1225 
   1226     def http_open(self, req):
-> 1227         return self.do_open(httplib.HTTPConnection, req)
   1228 
   1229     http_request = AbstractHTTPHandler.do_request_

/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
   1195         except socket.error, err: # XXX what error?
   1196             h.close()
-> 1197             raise URLError(err)
   1198         else:
   1199             try:

URLError: <urlopen error [Errno 111] Connection refused>

In [35]: inflation.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-35-7b22de5ccbc6> in <module>()
----> 1 inflation.head()

NameError: name 'inflation' is not defined

Fama/French

Dataset names are listed at Fama/French Data Library.

In [36]: import pandas.io.data as web

In [37]: ip = web.DataReader("5_Industry_Portfolios", "famafrench")
---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-37-5a3e59d5b894> in <module>()
----> 1 ip = web.DataReader("5_Industry_Portfolios", "famafrench")

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in DataReader(name, data_source, start, end, retry_count, pause)
     97         return get_data_fred(name, start, end)
     98     elif data_source == "famafrench":
---> 99         return get_data_famafrench(name)
    100 
    101 

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/data.pyc in get_data_famafrench(name)
    523     zip_file_path = '{0}/{1}_TXT.zip'.format(_FAMAFRENCH_URL, name)
    524 
--> 525     with urlopen(zip_file_path) as url:
    526         raw = url.read()
    527 

/usr/lib/python2.7/contextlib.pyc in __enter__(self)
     15     def __enter__(self):
     16         try:
---> 17             return self.gen.next()
     18         except StopIteration:
     19             raise RuntimeError("generator didn't yield")

/build/pandas-8apiR_/pandas-0.17.1/debian/python-pandas/usr/lib/python2.7/dist-packages/pandas/io/common.pyc in urlopen(*args, **kwargs)
     49     @contextmanager
     50     def urlopen(*args, **kwargs):
---> 51         with closing(_urlopen(*args, **kwargs)) as f:
     52             yield f
     53 

/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    152     else:
    153         opener = _opener
--> 154     return opener.open(url, data, timeout)
    155 
    156 def install_opener(opener):

/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    429             req = meth(req)
    430 
--> 431         response = self._open(req, data)
    432 
    433         # post-process response

/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
    447         protocol = req.get_type()
    448         result = self._call_chain(self.handle_open, protocol, protocol +
--> 449                                   '_open', req)
    450         if result:
    451             return result

/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    407             func = getattr(handler, meth_name)
    408 
--> 409             result = func(*args)
    410             if result is not None:
    411                 return result

/usr/lib/python2.7/urllib2.pyc in http_open(self, req)
   1225 
   1226     def http_open(self, req):
-> 1227         return self.do_open(httplib.HTTPConnection, req)
   1228 
   1229     http_request = AbstractHTTPHandler.do_request_

/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
   1195         except socket.error, err: # XXX what error?
   1196             h.close()
-> 1197             raise URLError(err)
   1198         else:
   1199             try:

URLError: <urlopen error [Errno 111] Connection refused>

In [38]: ip[4].ix[192607]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-38-1cb52c705436> in <module>()
----> 1 ip[4].ix[192607]

NameError: name 'ip' is not defined

World Bank

pandas users can easily access thousands of panel data series from the World Bank’s World Development Indicators by using the wb I/O functions.

Indicators

Either from exploring the World Bank site, or using the search function included, every world bank indicator is accessible.

For example, if you wanted to compare the Gross Domestic Products per capita in constant dollars in North America, you would use the search function:

In [1]: from pandas.io import wb

In [2]: wb.search('gdp.*capita.*const').iloc[:,:2]
Out[2]:
                     id                                               name
3242            GDPPCKD             GDP per Capita, constant US$, millions
5143     NY.GDP.PCAP.KD                 GDP per capita (constant 2005 US$)
5145     NY.GDP.PCAP.KN                      GDP per capita (constant LCU)
5147  NY.GDP.PCAP.PP.KD  GDP per capita, PPP (constant 2005 internation...

Then you would use the download function to acquire the data from the World Bank’s servers:

In [3]: dat = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX'], start=2005, end=2008)

In [4]: print(dat)
                      NY.GDP.PCAP.KD
country       year
Canada        2008  36005.5004978584
              2007  36182.9138439757
              2006  35785.9698172849
              2005  35087.8925933298
Mexico        2008  8113.10219480083
              2007  8119.21298908649
              2006  7961.96818458178
              2005  7666.69796097264
United States 2008  43069.5819857208
              2007  43635.5852068142
              2006   43228.111147107
              2005  42516.3934699993

The resulting dataset is a properly formatted DataFrame with a hierarchical index, so it is easy to apply .groupby transformations to it:

In [6]: dat['NY.GDP.PCAP.KD'].groupby(level=0).mean()
Out[6]:
country
Canada           35765.569188
Mexico            7965.245332
United States    43112.417952
dtype: float64

Now imagine you want to compare GDP to the share of people with cellphone contracts around the world.

In [7]: wb.search('cell.*%').iloc[:,:2]
Out[7]:
                     id                                               name
3990  IT.CEL.SETS.FE.ZS  Mobile cellular telephone users, female (% of ...
3991  IT.CEL.SETS.MA.ZS  Mobile cellular telephone users, male (% of po...
4027      IT.MOB.COV.ZS  Population coverage of mobile cellular telepho...

Notice that this second search was much faster than the first one because pandas now has a cached list of available data series.

In [13]: ind = ['NY.GDP.PCAP.KD', 'IT.MOB.COV.ZS']
In [14]: dat = wb.download(indicator=ind, country='all', start=2011, end=2011).dropna()
In [15]: dat.columns = ['gdp', 'cellphone']
In [16]: print(dat.tail())
                        gdp  cellphone
country   year
Swaziland 2011  2413.952853       94.9
Tunisia   2011  3687.340170      100.0
Uganda    2011   405.332501      100.0
Zambia    2011   767.911290       62.0
Zimbabwe  2011   419.236086       72.4

Finally, we use the statsmodels package to assess the relationship between our two variables using ordinary least squares regression. Unsurprisingly, populations in rich countries tend to use cellphones at a higher rate:

In [17]: import numpy as np
In [18]: import statsmodels.formula.api as smf
In [19]: mod = smf.ols("cellphone ~ np.log(gdp)", dat).fit()
In [20]: print(mod.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:              cellphone   R-squared:                       0.297
Model:                            OLS   Adj. R-squared:                  0.274
Method:                 Least Squares   F-statistic:                     13.08
Date:                Thu, 25 Jul 2013   Prob (F-statistic):            0.00105
Time:                        15:24:42   Log-Likelihood:                -139.16
No. Observations:                  33   AIC:                             282.3
Df Residuals:                      31   BIC:                             285.3
Df Model:                           1
===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept      16.5110     19.071      0.866      0.393       -22.384    55.406
np.log(gdp)     9.9333      2.747      3.616      0.001         4.331    15.535
==============================================================================
Omnibus:                       36.054   Durbin-Watson:                   2.071
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              119.133
Skew:                          -2.314   Prob(JB):                     1.35e-26
Kurtosis:                      11.077   Cond. No.                         45.8
==============================================================================

Country Codes

New in version 0.15.1.

The country argument accepts a string or list of mixed two or three character ISO country codes, as well as dynamic World Bank exceptions to the ISO standards.

For a list of the the hard-coded country codes (used solely for error handling logic) see pandas.io.wb.country_codes.

Problematic Country Codes & Indicators

Note

The World Bank’s country list and indicators are dynamic. As of 0.15.1, wb.download() is more flexible. To achieve this, the warning and exception logic changed.

The world bank converts some country codes in their response, which makes error checking by pandas difficult. Retired indicators still persist in the search.

Given the new flexibility of 0.15.1, improved error handling by the user may be necessary for fringe cases.

To help identify issues:

There are at least 4 kinds of country codes:

  1. Standard (2/3 digit ISO) - returns data, will warn and error properly.
  2. Non-standard (WB Exceptions) - returns data, but will falsely warn.
  3. Blank - silently missing from the response.
  4. Bad - causes the entire response from WB to fail, always exception inducing.

There are at least 3 kinds of indicators:

  1. Current - Returns data.
  2. Retired - Appears in search results, yet won’t return data.
  3. Bad - Will not return data.

Use the errors argument to control warnings and exceptions. Setting errors to ignore or warn, won’t stop failed responses. (ie, 100% bad indicators, or a single “bad” (#4 above) country code).

See docstrings for more info.

Google Analytics

The ga module provides a wrapper for Google Analytics API to simplify retrieving traffic data. Result sets are parsed into a pandas DataFrame with a shape and data types derived from the source table.

Configuring Access to Google Analytics

The first thing you need to do is to setup accesses to Google Analytics API. Follow the steps below:

  1. In the Google Developers Console
    1. enable the Analytics API
    2. create a new project
    3. create a new Client ID for an “Installed Application” (in the “APIs & auth / Credentials section” of the newly created project)
    4. download it (JSON file)
  2. On your machine
    1. rename it to client_secrets.json
    2. move it to the pandas/io module directory

The first time you use the read_ga() function, a browser window will open to ask you to authentify to the Google API. Do proceed.

Using the Google Analytics API

The following will fetch users and pageviews (metrics) data per day of the week, for the first semester of 2014, from a particular property.

import pandas.io.ga as ga
ga.read_ga(
    account_id  = "2360420",
    profile_id  = "19462946",
    property_id = "UA-2360420-5",
    metrics     = ['users', 'pageviews'],
    dimensions  = ['dayOfWeek'],
    start_date  = "2014-01-01",
    end_date    = "2014-08-01",
    index_col   = 0,
    filters     = "pagePath=~aboutus;ga:country==France",
)

The only mandatory arguments are metrics, dimensions and start_date. We strongly recommend that you always specify the account_id, profile_id and property_id to avoid accessing the wrong data bucket in Google Analytics.

The index_col argument indicates which dimension(s) has to be taken as index.

The filters argument indicates the filtering to apply to the query. In the above example, the page URL has to contain aboutus AND the visitors country has to be France.

Detailed information in the following: