DataArray accessor

cerbere provides an accessor to the xarray DataArray class, called cb. This accessor offers a set of attributes and methods that enrich those provided natively by xarray.

Standard attributes

The cb accessor gives access to standard variable attributes, based on CF and other conventions, through a set of properties. For instance:

# create a DataArray
In [1]: da = xr.DataArray(data=100., dims=[],
   ...:     attrs={'standard_name': 'water_temperature'})
   ...: 

# import cerbere
In [2]: import cerbere

# get the standard_name attribute with the `standard_name` property of
# the cerbere `cb` accessor
In [3]: stdname = da.cb.standard_name

# or set this attribute when creating a DataArray
In [4]: da.cb.standard_name = 'sea_surface_temperature'

Using named attributes, instead of a free dictionary as in the attrs property of xarray DataArray class, helps improving the consistency of datasets (avoiding using different names or variants, typo errors,…) for data producers, and the code genericity (the caller is expecting fixed properties) for data users.

Science dtype

The science_dtype attribute preserves the “true” data type of the quantity stored in an array. It was introduced as a workaround to deal with xarray dtype changes, as illustrated below:

# let's init a DataArray from an integer numpy array.
In [5]: da = xr.DataArray(np.arange(10, dtype=np.int32))

# The created DataArray is of the same type as the initial numpy array
In [6]: da
Out[6]: 
<xarray.DataArray (dim_0: 10)>
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
Dimensions without coordinates: dim_0

Now let’s create a DataArray from a numpy MaskedArray instead, and sea what happens to the array dtype and masked values:

In [7]: arr = np.ma.masked_greater(np.arange(10, dtype=np.int32), 5)

In [8]: arr.set_fill_value(999)

In [9]: da = xr.DataArray(arr)

# The created DataArray was changed to float and the masked values to NaNs
In [10]: da
Out[10]: 
<xarray.DataArray (dim_0: 10)>
array([ 0.,  1.,  2.,  3.,  4.,  5., nan, nan, nan, nan])
Dimensions without coordinates: dim_0

Converting back to a numpy MaskedArray will still return a float array, the original dtype of the array (int32) was lost:

In [11]: da.to_masked_array()
Out[11]: 
masked_array(data=[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, --, --, --, --],
             mask=[False, False, False, False, False, False,  True,  True,
                    True,  True],
       fill_value=1e+20)

cerbere provides a DataArray constructor that will prevent this, by storing the original dtype (in science_dtype) and fill value (999):

 # creates the DataArray with the cerbere constructor
In [12]: da = cerbere.new_array(arr)

 # using the cerbere to_masked_array accessor function instead
In [13]: da.cb.to_masked_array()
Out[13]: 
masked_array(data=[0, 1, 2, 3, 4, 5, --, --, --, --],
             mask=[False, False, False, False, False, False,  True,  True,
                    True,  True],
       fill_value=999,
            dtype=int32)

subsetting with isel

cerbere extends the isel methods of xarray, trough a redefinition of these methods in cb accessor, providing additional arguments.

When extracting a subset from a DataArray beyond its limits, padding can be applied to return a new DataArray of the expected size. Let’s look at this example:

 # let's init a DataArray from an integer numpy array.
In [14]: da = xr.DataArray(np.arange(10, dtype=np.int32), dims=['lat'])

 # extracting a subset within the array limits. The output array has an
 # expected size of 5
In [15]: subset = da.isel(lat=slice(0, 5))

In [16]: subset
Out[16]: 
<xarray.DataArray (lat: 5)>
array([0, 1, 2, 3, 4], dtype=int32)
Dimensions without coordinates: lat

 # now extracting a subset beyond the array limits. xarray automatically
 # trims the output array which has now a size of 2
In [17]: subset = da.isel(lat=slice(8, 13))

In [18]: subset
Out[18]: 
<xarray.DataArray (lat: 2)>
array([8, 9], dtype=int32)
Dimensions without coordinates: lat

 # using now the cerbere isel method, we get an output dataset of size 5
 # with padded values beyond the initial array limit
In [19]: subset = da.cb.isel(lat=slice(8, 13), padding=True)

In [20]: subset
Out[20]: 
<xarray.DataArray (lat: 5)>
array([ 8.,  9., nan, nan, nan])
Dimensions without coordinates: lat

 # this works with negative indices too
In [21]: subset = da.cb.isel(lat=slice(-2, 3), padding=True)

In [22]: subset
Out[22]: 
<xarray.DataArray (lat: 5)>
array([nan, nan,  0.,  1.,  2.])
Dimensions without coordinates: lat

Note that when padding, the array dtype is changed here to float as xarray would normally do with a numpy MaskedArray (see Science dtype section above). This can be avoided by preserving the original array dtype (NaNs are then replaced with fill values) using as_science_dtype keyword, or returning the result as numpy MaskedArray using as_masked_array keyword:

 # preserving the original data type
In [23]: subset = da.cb.isel(lat=slice(-2, 3), padding=True, as_science_dtype=True)

In [24]: subset
Out[24]: 
<xarray.DataArray (lat: 5)>
array([nan, nan,  0.,  1.,  2.])
Dimensions without coordinates: lat

 # returning the result as a MaskedArray
In [25]: subset = da.cb.isel(lat=slice(-2, 3), padding=True, as_masked_array=True)

In [26]: subset
Out[26]: 
masked_array(data=[--, --, 0, 1, 2],
             mask=[ True,  True, False, False, False],
       fill_value=-2147483648,
            dtype=int32)