DataArray accessor
cerbere
provides an accessor to the xarray DataArray
class, called
cb
. This accessor offers a set of attributes and methods that enrich
those provided natively by xarray.
Standard attributes
The cb
accessor gives access to standard variable attributes, based on CF
and other conventions, through a set of properties. For instance:
# create a DataArray
In [1]: da = xr.DataArray(data=100., dims=[],
...: attrs={'standard_name': 'water_temperature'})
...:
# import cerbere
In [2]: import cerbere
# get the standard_name attribute with the `standard_name` property of
# the cerbere `cb` accessor
In [3]: stdname = da.cb.standard_name
# or set this attribute when creating a DataArray
In [4]: da.cb.standard_name = 'sea_surface_temperature'
Using named attributes, instead of a free dictionary as in the attrs
property of xarray DataArray
class, helps improving the consistency of
datasets (avoiding using different names or variants, typo errors,…) for
data producers, and the code genericity (the caller is expecting fixed
properties) for data users.
Science dtype
The science_dtype
attribute preserves the “true” data type of the quantity
stored in an array. It was introduced as a workaround to deal with xarray
dtype changes, as illustrated below:
# let's init a DataArray from an integer numpy array.
In [5]: da = xr.DataArray(np.arange(10, dtype=np.int32))
# The created DataArray is of the same type as the initial numpy array
In [6]: da
Out[6]:
<xarray.DataArray (dim_0: 10)>
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
Dimensions without coordinates: dim_0
Now let’s create a DataArray from a numpy MaskedArray instead, and sea what happens to the array dtype and masked values:
In [7]: arr = np.ma.masked_greater(np.arange(10, dtype=np.int32), 5)
In [8]: arr.set_fill_value(999)
In [9]: da = xr.DataArray(arr)
# The created DataArray was changed to float and the masked values to NaNs
In [10]: da
Out[10]:
<xarray.DataArray (dim_0: 10)>
array([ 0., 1., 2., 3., 4., 5., nan, nan, nan, nan])
Dimensions without coordinates: dim_0
Converting back to a numpy MaskedArray will still return a float array, the original dtype of the array (int32) was lost:
In [11]: da.to_masked_array()
Out[11]:
masked_array(data=[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, --, --, --, --],
mask=[False, False, False, False, False, False, True, True,
True, True],
fill_value=1e+20)
cerbere
provides a DataArray constructor that will prevent this, by
storing the original dtype (in science_dtype
) and fill value (999):
# creates the DataArray with the cerbere constructor
In [12]: da = cerbere.new_array(arr)
# using the cerbere to_masked_array accessor function instead
In [13]: da.cb.to_masked_array()
Out[13]:
masked_array(data=[0, 1, 2, 3, 4, 5, --, --, --, --],
mask=[False, False, False, False, False, False, True, True,
True, True],
fill_value=999,
dtype=int32)
subsetting with isel
cerbere
extends the isel
methods of xarray,
trough a redefinition of these methods in cb
accessor, providing
additional arguments.
When extracting a subset from a DataArray beyond its limits, padding can be applied to return a new DataArray of the expected size. Let’s look at this example:
# let's init a DataArray from an integer numpy array.
In [14]: da = xr.DataArray(np.arange(10, dtype=np.int32), dims=['lat'])
# extracting a subset within the array limits. The output array has an
# expected size of 5
In [15]: subset = da.isel(lat=slice(0, 5))
In [16]: subset
Out[16]:
<xarray.DataArray (lat: 5)>
array([0, 1, 2, 3, 4], dtype=int32)
Dimensions without coordinates: lat
# now extracting a subset beyond the array limits. xarray automatically
# trims the output array which has now a size of 2
In [17]: subset = da.isel(lat=slice(8, 13))
In [18]: subset
Out[18]:
<xarray.DataArray (lat: 2)>
array([8, 9], dtype=int32)
Dimensions without coordinates: lat
# using now the cerbere isel method, we get an output dataset of size 5
# with padded values beyond the initial array limit
In [19]: subset = da.cb.isel(lat=slice(8, 13), padding=True)
In [20]: subset
Out[20]:
<xarray.DataArray (lat: 5)>
array([ 8., 9., nan, nan, nan])
Dimensions without coordinates: lat
# this works with negative indices too
In [21]: subset = da.cb.isel(lat=slice(-2, 3), padding=True)
In [22]: subset
Out[22]:
<xarray.DataArray (lat: 5)>
array([nan, nan, 0., 1., 2.])
Dimensions without coordinates: lat
Note that when padding, the array dtype is changed here to float as xarray
would normally do with a numpy MaskedArray (see Science dtype section above).
This can be avoided by preserving the original array dtype (NaNs are then
replaced with fill values) using as_science_dtype
keyword, or returning the
result as numpy MaskedArray using as_masked_array
keyword:
# preserving the original data type
In [23]: subset = da.cb.isel(lat=slice(-2, 3), padding=True, as_science_dtype=True)
In [24]: subset
Out[24]:
<xarray.DataArray (lat: 5)>
array([nan, nan, 0., 1., 2.])
Dimensions without coordinates: lat
# returning the result as a MaskedArray
In [25]: subset = da.cb.isel(lat=slice(-2, 3), padding=True, as_masked_array=True)
In [26]: subset
Out[26]:
masked_array(data=[--, --, 0, 1, 2],
mask=[ True, True, False, False, False],
fill_value=-2147483648,
dtype=int32)