Example Datasets¶

The package contains a few static datasets that are intended to serve as toy examples.

Warning

This section may be subject of larger changes. It is possible that in the future the datasets will instead be provided by JuliaML/MLDatasets.jl instead.

Fisher’s Iris data set¶

The Iris data set has become one of the most recognizable machine learning example datasets. It was originally published by Ronald Fisher [FISHER1936] and contains the 4 different kind of measurements (that we call features) for 150 observations of a plant called Iris. The interesting property of the dataset is that it includes these measurements for 3 different species of Iris (50 observations each) and is thus a dataset that is commonly used to showcase classification or clustering algorithms.

load_iris([n]) → Tuple¶

Loads the first n observations from the Iris flower data set introduced by Ronald Fisher (1936).

Parameters:	n (Int) – default `150`. Specifies how many of the total 150 observations should be returned (in their native order).
Returns:	A tuple of three arrays as the following code snipped shows. The 4 by `n` matrix `X` contains the numeric measurements, in which each individual column denotes an observation. The vector `y` contains the class labels as strings. The optional vector `names` contains the names of the features (i.e. rows of `X`) X, y, names = load_iris(n)

Check out the wikipedia entry for more information about the dataset.

[FISHER1936]

Fisher, Ronald A. “The use of multiple measurements in taxonomic problems.” Annals of eugenics 7.2 (1936): 179-188.

Noisy Line Example¶

This refers to a static pre-defined toy dataset. In order to generate a noisy line using some parameters take a look at noisy_function().

https://cloud.githubusercontent.com/assets/10854026/13020766/75b321d4-d1d7-11e5-940d-25974efa0710.png

load_line() → Tuple¶

Loads an artificial example dataset for a noisy line. It is particularly useful to explain under- and overfitting.

Returns:	The vector `x` contains 11 equally spaced points between 0 and 1. The vector `y` contains `x ./ 2 + 1` plus some gaussian noise. The optional vector `names` contains descriptive names for `x` and `y`. x, y, names = load_line()

Noisy Sin Example¶

This refers to a static pre-defined toy dataset. In order to generate a noisy sin using some parameters take a look at noisy_sin().

https://cloud.githubusercontent.com/assets/10854026/13020842/eb6f2f30-d1d7-11e5-8a2c-a264fc14c861.png

load_sin() → Tuple¶

Loads an artificial example dataset for a noisy sin. It is particularly useful to explain under- and overfitting.

Returns:	The vector `x` contains equally spaced points between 0 and 2π. The vector `y` contains `sin(x)` plus some gaussian noise. The optional vector `names` contains descriptive names for `x` and `y`. x, y, names = load_sin()

Noisy Polynome Example¶

This refers to a static pre-defined toy dataset. In order to generate a noisy polynome using some parameters take a look at noisy_poly().

https://cloud.githubusercontent.com/assets/10854026/13020955/9628c120-d1d8-11e5-91f3-c16367de5aad.png

load_poly() → Tuple¶

Loads an artificial example dataset for a noisy quadratic function.

Returns:	It is particularly useful to explain under- and overfitting. The vector `x` contains 50 points between 0 and 4. The vector `y` contains `2.6 * x^2 + .8 * x` plus some gaussian noise. The optional vector `names` contains descriptive names for `x` and `y`. x, y, names = load_poly()