The Datasets Package¶
statsmodels
provides data sets (i.e. data and meta-data) for use in
examples, tutorials, model testing, etc.
Using Datasets from Stata¶
Using Datasets from R¶
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset
function. The actual data is accessible by the data
attribute. For example:
In [1]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/api.py in <module>()
13 from .discrete.discrete_model import (Poisson, Logit, Probit,
14 MNLogit, NegativeBinomial)
---> 15 from .tsa import api as tsa
16 from .duration.hazard_regression import PHReg
17 from .nonparametric import api as nonparametric
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
2 from .arima_model import ARMA, ARIMA
3 from . import vector_ar as var
4 from .arima_process import arma_generate_sample, ArmaProcess
5 from .vector_ar.var_model import VAR
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/ar_model.py in <module>()
14 cache_readonly, cache_writable)
15 from statsmodels.tools.numdiff import approx_fprime, approx_hess
---> 16 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
17 import statsmodels.base.wrapper as wrap
18 from statsmodels.tsa.vector_ar import util
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from .kalmanfilter import KalmanFilter
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
31 from numpy.linalg import inv, pinv
32 from statsmodels.tools.tools import chain_dot
---> 33 from . import kalman_loglike
34
35 #Fast filtering and smoothing for multivariate state space models
ImportError: cannot import name kalman_loglike
In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
NameError: name 'sm' is not defined
In [3]: print duncan_prestige.__doc__
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-9b4cf6ceaa3f> in <module>()
----> 1 print duncan_prestige.__doc__
NameError: name 'duncan_prestige' is not defined
In [4]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)
NameError: name 'duncan_prestige' is not defined
R Datasets Function Reference¶
Available Datasets¶
Usage¶
Load a dataset:
In [5]: import statsmodels.api as sm
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-5-6030a6549dc0> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/api.py in <module>()
13 from .discrete.discrete_model import (Poisson, Logit, Probit,
14 MNLogit, NegativeBinomial)
---> 15 from .tsa import api as tsa
16 from .duration.hazard_regression import PHReg
17 from .nonparametric import api as nonparametric
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/api.py in <module>()
----> 1 from .ar_model import AR
2 from .arima_model import ARMA, ARIMA
3 from . import vector_ar as var
4 from .arima_process import arma_generate_sample, ArmaProcess
5 from .vector_ar.var_model import VAR
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/ar_model.py in <module>()
14 cache_readonly, cache_writable)
15 from statsmodels.tools.numdiff import approx_fprime, approx_hess
---> 16 from statsmodels.tsa.kalmanf.kalmanfilter import KalmanFilter
17 import statsmodels.base.wrapper as wrap
18 from statsmodels.tsa.vector_ar import util
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/kalmanf/__init__.py in <module>()
----> 1 from .kalmanfilter import KalmanFilter
/builddir/build/BUILD/statsmodels-0.6.1/statsmodels/tsa/kalmanf/kalmanfilter.py in <module>()
31 from numpy.linalg import inv, pinv
32 from statsmodels.tools.tools import chain_dot
---> 33 from . import kalman_loglike
34
35 #Fast filtering and smoothing for multivariate state space models
ImportError: cannot import name kalman_loglike
In [6]: data = sm.datasets.longley.load()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-6daf677753dc> in <module>()
----> 1 data = sm.datasets.longley.load()
NameError: name 'sm' is not defined
The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data
attribute.
In [7]: data.data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-42500bbde965> in <module>()
----> 1 data.data
NameError: name 'data' is not defined
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [8]: data.endog[:5]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-ecf121fa201d> in <module>()
----> 1 data.endog[:5]
NameError: name 'data' is not defined
In [9]: data.exog[:5,:]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-eb86cb28e7fa> in <module>()
----> 1 data.exog[:5,:]
NameError: name 'data' is not defined
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [10]: data.endog_name
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-78ac46fd3666> in <module>()
----> 1 data.endog_name
NameError: name 'data' is not defined
In [11]: data.exog_name
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-53b38d63b171> in <module>()
----> 1 data.exog_name
NameError: name 'data' is not defined
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [12]: type(data.data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-2a4072828d02> in <module>()
----> 1 type(data.data)
NameError: name 'data' is not defined
In [13]: type(data.raw_data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-55b385c14017> in <module>()
----> 1 type(data.raw_data)
NameError: name 'data' is not defined
In [14]: data.names
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-bb6578e2a1cd> in <module>()
----> 1 data.names
NameError: name 'data' is not defined
Loading data as pandas objects¶
For many users it may be preferable to get the datasets as a pandas DataFrame or
Series object. Each of the dataset modules is equipped with a load_pandas
method which returns a Dataset
instance with the data readily available as pandas objects:
In [15]: data = sm.datasets.longley.load_pandas()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-dd9cc940a6dd> in <module>()
----> 1 data = sm.datasets.longley.load_pandas()
NameError: name 'sm' is not defined
In [16]: data.exog
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-16-a6a50950081b> in <module>()
----> 1 data.exog
NameError: name 'data' is not defined
In [17]: data.endog
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-17-5f625520ab35> in <module>()
----> 1 data.endog
NameError: name 'data' is not defined
The full DataFrame is available in the data
attribute of the Dataset object
In [18]: data.data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-42500bbde965> in <module>()
----> 1 data.data
NameError: name 'data' is not defined
With pandas integration in the estimation classes, the metadata will be attached to model results:
Extra Information¶
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']
Additional information¶
- The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
- To add datasets, see the notes on adding a dataset.