cma.experimentation.DataDict

class documentation

class DataDict(defaultdict):

Constructor: DataDict(dict_)

A (default) dictionary like parameter_value: list_of_measures,

e.g. with float parameter or dimension as key and runtimes as value. This class is used under the hood in the Results class.

A main functionality is the method clean, which joins all entries which have almost equal keys. This allows to have a float parameter as key.

This class provides simple computations on this kind of data, like x, y = .xy_arrays() == sorted(keys), sp1(values).

If the dictionary values are not lists, one may get rather unexpected results or exceptions.

Details: this class allows to use float values as keys when clean_key and set_clean are used to access the data in the dict. Inheriting from defaultdict with list as default value, the syntax:

data = DataDict()
data[first_key] += [first_data_point]

without initialization of the key value works perfectly fine.

Caveat: small values are considered as the same key, even if they are close to zero. Either use a different comparison via the equal keyword parameter, or use 1 / key_value or log(key_value)`.

TODO: consider numpy.allclose for almost equal comparison?

Method	`__init__`	Use `dict(dict_.data or dict_)`, and `dict_.meta_data` for initialization.
Method	`__repr__`	Undocumented
Method	`aggregated`	WIP return a `dict` with `(agg(data), number_of_samples)` per key.
Method	`argmin`	slack_index_shift can be +-1, looking to the right/left of argmin
Method	`clean`	merge keys which have almost the same value
Method	`clean_key`	set similar key values all to be `key`, return `key`.
Method	`get_near`	get the merged values list of all nearby keys.
Method	`percentile`	TODO:review percentile based on bootstrapping
Method	`samples`	return a `dict` with the number of values per key
Method	`set_clean`	join all entries with similar `key` and return the new value, a joined list of all respective values.
Method	`test`	return p-value of the `mannwhitneyu` test
Method	`tests`	return p-values of the `mannwhitneyu` test of entries adjacent with `gap`
Method	`update`	update data lists from a `dict` of lists (and only a `dict`)
Method	`xy_arrays`	return two arrays ready to be plotted like `plot(*xy_arrays)`.
Instance Variable	`meta_data`	Undocumented
Property	`successes`	return a class instance with attributes `x` (i.e. keys), `n`, `nsucc`, and `rate` as arrays.
Method	`_near_key`	return a key in self which is `equal` to `key` and otherwise `key`.

def __init__(self, dict_=None): ¶

Use dict(dict_.data or dict_), and dict_.meta_data for initialization.

Details: dict_.meta_data are assigned as a reference.

def __repr__(self): ¶

Undocumented

def aggregated(self, agg=sp1, relative=False, by=None): ¶

WIP return a dict with (agg(data), number_of_samples) per key.

For example, to get the 10%tile:

.aggregated(lambda x: np.percentile(x, 10))

def argmin(self, agg=sp1, slack=1.0, slack_index_shift=+1): ¶

slack_index_shift can be +-1, looking to the right/left of argmin

def clean(self, equal=(lambda x, y: x - 1e-06 < y < x + 1e-06)): ¶

merge keys which have almost the same value

def clean_key(self, key, equal=(lambda x, y: x - 1e-06 < y < x + 1e-06)): ¶

set similar key values all to be key, return key.

Use method set_clean to access and change the clean-key dictionary value more conveniently.

def get_near(self, key, equal=(lambda x, y: x - 1e-06 < y < x + 1e-06)): ¶

get the merged values list of all nearby keys.

Caveat: the returned value is a new list

See Also
`clean`, `set_clean`.

def percentile(self, prctile, agg=sp1, samples=100): ¶

TODO:review percentile based on bootstrapping

def samples(self): ¶

return a dict with the number of values per key

def set_clean(self, key): ¶

join all entries with similar key and return the new value, a joined list of all respective values.

This is the same as clean_key which however returns the new key, not the values.

Example:

data.set_clean(key) += [new_data_point]

# same as
data[data.clean_key(key)] += [new_data_point]

# or more explicite, however with a different order of the data
data[key] += [new_data_point]
data.clean_key(key)  # joins data, however in the "wrong" order

# similar as
data[key] += [new_data_point]
data.clean()  # cleans *all* keys

def test(self, key, key2=None, method='auto'): ¶

return p-value of the mannwhitneyu test

def tests(self, gap=1, method='auto'): ¶

return p-values of the mannwhitneyu test of entries adjacent with gap

def update(self, dict_): ¶

update data lists from a dict of lists (and only a dict)

def xy_arrays(self, agg=sp1, type_=np.asarray): ¶

return two arrays ready to be plotted like plot(*xy_arrays).

The x-array contains the sorted keys, the y-array contains the respectively aggregated values.

For example to be used like:

``plot(*self.xy_arrays())``.

Parameter agg determines the function to aggregate data values, by default sp1 which is the mean corrected for missing data. To show dispersion, we can use agg=lambda x: np.percentile(x, 10) and ..., 90).

meta_data = ¶

Undocumented

@property
successes = ¶

return a class instance with attributes x (i.e. keys), n, nsucc, and rate as arrays.

TODO: consider using cma.utilities.utils.DictClass2 instead namedtuple?

def _near_key(self, key, equal=(lambda x, y: x - 1e-06 < y < x + 1e-06), exclude=None): ¶

return a key in self which is equal to key and otherwise key.