Dataframe#

Helper functions for work with pandas.DataFrame

gordo_client.dataframe.dataframe_from_dict(data: dict) → DataFrame#

The inverse procedure done by multi_lvl_column_dataframe_from_dict() Reconstructed a pandas.MultiIndex column dataframe from a previously serialized one.

Expects data to be a nested dictionary where each top level key has a value capable of being loaded from pandas.DataFrame.from_dict()

Parameters:: data – Data to be loaded into a MultiIndex column dataframe
Return type:: MultiIndex column dataframe.

Examples

>>> serialized = {
... 'feature0': {'sub-feature-0': {'2019-01-01': 0, '2019-02-01': 4},
...              'sub-feature-1': {'2019-01-01': 1, '2019-02-01': 5}},
... 'feature1': {'sub-feature-0': {'2019-01-01': 2, '2019-02-01': 6},
...              'sub-feature-1': {'2019-01-01': 3, '2019-02-01': 7}}
... }
>>> dataframe_from_dict(serialized)  
                feature0                    feature1
       sub-feature-0 sub-feature-1 sub-feature-0 sub-feature-1
2019-01-01             0             1             2             3
2019-02-01             4             5             6             7

gordo_client.dataframe.dataframe_from_parquet_bytes(buf: bytes) → DataFrame#

Convert bytes representing a parquet table into a pandas dataframe.

Parameters:: buf – Bytes representing a parquet table. Can be the direct result from gordo.server.utils.dataframe_into_parquet_bytes
Return type:: pandas.DataFrame

gordo_client.dataframe.dataframe_into_parquet_bytes(df: DataFrame, compression: str = 'snappy') → bytes#

Convert a dataframe into bytes representing a parquet table.

Parameters:

df – DataFrame to be compressed
compression – Compression to use, passed to pyarrow.parquet.write_table()

Return type:

bytes

gordo_client.dataframe.dataframe_to_dict(df: DataFrame) → dict#

Convert a dataframe can have a pandas.MultiIndex as columns into a dict.

Each key is the top level column name, and the value is the array of columns under the top level name. If it’s a simple dataframe, pandas.core.DataFrame.to_dict() will be used.

This allows json.dumps() to be performed, where pandas.DataFrame.to_dict() would convert such a multi-level column dataframe into keys of tuple() objects, which are not json serializable. However this ends up working with pandas.DataFrame.from_dict()

Parameters:: df – Dataframe expected to have columns of type pandas.MultiIndex 2 levels deep.
Return type:: List of records representing the dataframe in a ‘flattened’ form.

Examples

>>> import pprint
>>> import pandas as pd
>>> import numpy as np
>>> columns = pd.MultiIndex.from_tuples((f"feature{i}", f"sub-feature-{ii}") for i in range(2) for ii in range(2))
>>> index = pd.date_range('2019-01-01', '2019-02-01', periods=2)
>>> df = pd.DataFrame(np.arange(8).reshape((2, 4)), columns=columns, index=index)
>>> df  
                feature0                    feature1
           sub-feature-0 sub-feature-1 sub-feature-0 sub-feature-1
2019-01-01             0             1             2             3
2019-02-01             4             5             6             7
>>> serialized = dataframe_to_dict(df)
>>> pprint.pprint(serialized)
{'feature0': {'sub-feature-0': {'2019-01-01': 0, '2019-02-01': 4},
              'sub-feature-1': {'2019-01-01': 1, '2019-02-01': 5}},
 'feature1': {'sub-feature-0': {'2019-01-01': 2, '2019-02-01': 6},
              'sub-feature-1': {'2019-01-01': 3, '2019-02-01': 7}}}