# Data Access Pattern¶

Tip

This section just serves as a very concise overview of the available functionality that is provided by MLDataPattern.jl. Take a look at the full documentation for a far more detailed treatment.

If there is one requirement that almost all machine learning experiments have in common, it is that they have to interact with “data” in one way or the other. After all, the goal is for a program to learn from the implicit information contained in that data. Consequently, it is of no surprise that over time a number of particularly useful pattern emerged for how to utilize this data effectively. For instance, we learned that we should leave a subset of the available data out of the training process in order to spot and subsequently prevent over-fitting.

## Terms and Definitions¶

In the context of this package we differentiate between two
categories of data sources based on some useful properties. A
“data source”, by the way, is simply any Julia type that can
provide data. We need not be more precise with this definition,
since it is of little practical consequence. The definitions that
matter are for the two sub-categories of data sources that this
package can actually interact with: **Data Containers** and
**Data Iterators**. These abstractions will allow us to interact
with many different types of data using a coherent and
non-invasive interface.

- Data Container
For a data source to belong in this category it needs to be able to provide two things:

- The total number of observations \(N\), that the data source contains.
- A way to query a specific observation or sequence of observations. This must be done using indices, where every observation has a unique index \(i \in I\) assigned from the set of indices \(I = \{1, 2, ..., N\}\).

- Data Iterator
To belong to this group, a data source must implement Julia’s iterator interface. The data source may or may not know the total amount of observations it can provide, which means that knowing \(N\) is not necessary.

The key requirement for a iteration-based data source is that every iteration consistently returns either a single observation or a batch of observations.

The more flexible of the two categories are what we call data
containers. A good example for such a type is a plain Julia
`Array`

or a `DataFrame`

. Well, almost. To be considered a
data container, the type has to implement the required interface.
In particular, a data container has to implement the functions
`getobs()`

and `nobs()`

. For convenience both of those
implementations are already provided for `Array`

and
`DataFrame`

out of the box. Thus on package import each of
these types becomes a data container type. For more details on
the required interface take a look at the section on Data
Container in the `MLDataPattern`

documentation.

## Working with Data Container¶

Consider the following toy feature matrix `X`

, which has 2 rows
and 6 columns. We can use `nobs()`

to query the number of
observations it contains, and `getobs()`

to query one or more
specific observation(s).

```
julia> X = rand(2, 6)
2×6 Array{Float64,2}:
0.226582 0.933372 0.505208 0.0443222 0.812814 0.11202
0.504629 0.522172 0.0997825 0.722906 0.245457 0.000341996
julia> nobs(X)
6
julia> getobs(X, 2) # query the second observation
2-element Array{Float64,1}:
0.933372
0.522172
julia> getobs(X, [4, 1]) # create a batch with observation 4 and 1
2×2 Array{Float64,2}:
0.0443222 0.226582
0.722906 0.504629
```

As you may have noticed, the two functions make a pretty strong
assumption about how to interpret the shape of `X`

. In
particular, they assume that each column denotes a single
observation. This may not be what we want. Given that `X`

has
two dimensions that we could assign meaning to, we should have
the opportunity to choose which dimension enumerates the
observations. After all, we can think of `X`

as a data
container that has 6 observations with 2 features each, or as a
data container that has 2 observations with 6 features each. To
allow for that choice, all relevant functions accept the optional
parameter `obsdim`

. For more information take a look at the
section on Observation Dimension.

```
julia> nobs(X, obsdim = 1)
2
julia> getobs(X, 2, obsdim = 1)
6-element Array{Float64,1}:
0.504629
0.522172
0.0997825
0.722906
0.245457
0.000341996
```

While arrays are very useful to work with, there are not the only
type of data container that is supported by this package.
Consider the following toy `DataFrame`

.

```
julia> df = DataFrame(x1 = rand(4), x2 = rand(4))
4×2 DataFrames.DataFrame
│ Row │ x1 │ x2 │
├─────┼──────────┼───────────┤
│ 1 │ 0.226582 │ 0.505208 │
│ 2 │ 0.504629 │ 0.0997825 │
│ 3 │ 0.933372 │ 0.0443222 │
│ 4 │ 0.522172 │ 0.722906 │
julia> nobs(df)
4
julia> getobs(df, 2)
1×2 DataFrames.DataFrame
│ Row │ x1 │ x2 │
├─────┼──────────┼───────────┤
│ 1 │ 0.504629 │ 0.0997825 │
```

## Subsetting and Shuffling¶

Every data container can be subsetted manually using the
low-level function `datasubset()`

. Its signature is identical
to `getobs()`

, but instead of copying the data it returns a
lazy subset. A lot of the higher-level functions use
`datasubset()`

internally to provide their functionality.
This allows for delaying the actual data access until the data is
actually needed. For arrays the returned subset is in the form of
a `SubArray`

. For more information take a look at the section
on Data Subsets.

```
julia> datasubset(X, 2)
2-element SubArray{Float64,1,Array{Float64,2},Tuple{Colon,Int64},true}:
0.933372
0.522172
julia> datasubset(X, [4, 1])
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
0.0443222 0.226582
0.722906 0.504629
julia> datasubset(X, 2, obsdim = 1)
6-element SubArray{Float64,1,Array{Float64,2},Tuple{Int64,Colon},true}:
0.504629
0.522172
0.0997825
0.722906
0.245457
0.000341996
```

This is of course also true for any `DataFrame`

, in which case
the function returns a `SubDataFrame`

.

```
julia> datasubset(df, 2)
1×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x1 │ x2 │
├─────┼──────────┼───────────┤
│ 1 │ 0.504629 │ 0.0997825 │
julia> datasubset(df, [4, 1])
2×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x1 │ x2 │
├─────┼──────────┼──────────┤
│ 1 │ 0.522172 │ 0.722906 │
│ 2 │ 0.226582 │ 0.505208 │
```

Note that a data subset doesn’t strictly have to be a true
“subset” of the data set. For example, the function
`shuffleobs()`

returns a lazy data subset, which contains
exactly the same observations, but in a randomly permuted order.

```
julia> shuffleobs(X)
2×6 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
0.0443222 0.812814 0.226582 0.11202 0.505208 0.933372
0.722906 0.245457 0.504629 0.000341996 0.0997825 0.522172
julia> shuffleobs(df)
4×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x1 │ x2 │
├─────┼──────────┼───────────┤
│ 1 │ 0.226582 │ 0.505208 │
│ 2 │ 0.933372 │ 0.0443222 │
│ 3 │ 0.522172 │ 0.722906 │
│ 4 │ 0.504629 │ 0.0997825 │
```

Since this function is non-deterministic, it raises the question
of what to do when our data set is made up of multiple variables.
It is not uncommon, for example, that the targets of a labeled
data set are stored in a separate `Vector`

. To support such a
scenario, all relevant functions also accept a `Tuple`

as the
data argument. If that is the case, then all elements of the
given tuple will be processed in the exact same manner. The
return value will then again be a tuple with the individual
results. As you can see in the following code snippet, the
observation-link between `x`

and `y`

is preserved after the
shuffling. For more information about grouping data containers in
a `Tuple`

, take a look at the section on Tuples and Labeled
Data.

```
julia> x = collect(1:6);
julia> y = [:a, :b, :c, :d, :e, :f];
julia> xs, ys = shuffleobs((x, y))
([6,1,4,5,3,2],Symbol[:f,:a,:d,:e,:c,:b])
```

## Splitting into Train / Test¶

A common requirement in a machine learning experiment is to split
the data set into a training and a test portion. While we could
already do this manually using `datasubset()`

, this package
also provides a high-level convenience function `splitobs()`

.

```
julia> y1, y2 = splitobs(y, at = 0.6)
(Symbol[:a,:b,:c,:d],Symbol[:e,:f])
julia> train, test = splitobs(df)
(3×2 DataFrames.SubDataFrame{UnitRange{Int64}}
│ Row │ x1 │ x2 │
├─────┼──────────┼───────────┤
│ 1 │ 0.226582 │ 0.505208 │
│ 2 │ 0.504629 │ 0.0997825 │
│ 3 │ 0.933372 │ 0.0443222 │,
1×2 DataFrames.SubDataFrame{UnitRange{Int64}}
│ Row │ x1 │ x2 │
├─────┼──────────┼──────────┤
│ 1 │ 0.522172 │ 0.722906 │)
```

As we can see in the example above, the function `splitobs()`

performs a static “split” of the given data at the relative
position `at`

, and returns the result in the form of two data
subsets. It is also possible to specify multiple fractions, which
will cause the function to perform additional splits.

```
julia> y1, y2, y3 = splitobs(y, at = (0.5, 0.3))
(Symbol[:a,:b,:c],Symbol[:d,:e],Symbol[:f])
```

Of course, a simple static split isn’t always what we want. In
most situations we would rather partition the data set into two
disjoint subsets using random assignment. We can do this by
combining `splitobs()`

with `shuffleobs()`

. Since neither
of which copies actual data we do not pay any significant
performance penalty for nesting “subsetting” functions.

```
julia> y1, y2 = splitobs(shuffleobs(y), at = 0.6)
(Symbol[:c,:e,:f,:a],Symbol[:b,:d])
julia> y1, y2, y3 = splitobs(shuffleobs(y), at = (0.5, 0.3))
(Symbol[:b,:f,:e],Symbol[:d,:a],Symbol[:c])
```

It is also possible to call `splitobs()`

with two data
containers grouped in a `Tuple`

. While this is especially
useful for working with labeled data, neither implies the other.
That means that one can use tuples to group together unlabeled
data, or have a labeled data container that is not a tuple (see
Labeled Data Container for some examples). For instance, since
the function `splitobs()`

performs a static split, it doesn’t
actually care if the given `Tuple`

describes a labeled data
set. In fact, it makes no difference.

```
julia> X = rand(2, 6)
2×6 Array{Float64,2}:
0.226582 0.933372 0.505208 0.0443222 0.812814 0.11202
0.504629 0.522172 0.0997825 0.722906 0.245457 0.000341996
julia> y = ["a", "a", "b", "b", "b", "b"]
6-element Array{String,1}:
"a"
"a"
"b"
"b"
"b"
"b"
julia> (X1, y1), (X2, y2) = splitobs((X, y), at = 0.5);
julia> y1, y2
(String["a","a","b"],String["b","b","b"])
```

## Stratified Sampling¶

Usually it is a good idea to make sure that we actively try to preserve the class distribution for every data subset. This will help to make sure that the data subsets are similar in structure and more likely to be representative of the full data set.

```
julia> (X1, y1), (X2, y2) = stratifiedobs((X, y), p = 0.5);
julia> y1, y2
(String["b","a","b"],String["b","b","a"])
```

Note how both, `y1`

and `y2`

, contain twice as many `"b"`

as `"a"`

, just like `y`

does. For more information on
stratified sampling, take a look at Stratified Sampling

## Over- and Undersampling¶

On the other hand, some functions require the presence of targets
to perform their respective tasks. In such a case, it is always
assumed that the last tuple element contains the targets. Two
such functions are `undersample()`

and `oversample()`

,
which can be used to re-sample a labeled data container in such a
way, that the resulting class distribution is uniform.

```
julia> undersample(y)
4-element SubArray{String,1,Array{String,1},Tuple{Array{Int64,1}},false}:
"a"
"b"
"b"
"a"
julia> Xnew, ynew = undersample((X, y), shuffle = false)
([0.226582 0.933372 0.812814 0.11202; 0.504629 0.522172 0.245457 0.000341996],
String["a","b","b","a"])
julia> Xnew, ynew = oversample((X, y), shuffle = true)
([0.11202 0.933372 … 0.505208 0.0443222; 0.000341996 0.522172 … 0.0997825 0.722906],
String["a","b","a","a","b","a","b","b"])
```

If need be, all functions that require a labeled data container
accept a target-extraction-function as an optional first
parameter. If such a function is provided, it will be applied to
each observation individually. In the following example the
function `indmax`

will be applied to each column slice of `Y`

in order to derive a class label, which is then used for
down-sampling. For more information take a look at the section on
Labeled Data Container.

```
julia> Y = [1. 0. 0. 0. 0. 1.; 0. 1. 1. 1. 1. 0.]
2×6 Array{Float64,2}:
1.0 0.0 0.0 0.0 0.0 1.0
0.0 1.0 1.0 1.0 1.0 0.0
julia> Xnew, Ynew = undersample(indmax, (X, Y));
julia> Ynew
2×4 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
1.0 0.0 0.0 1.0
0.0 1.0 1.0 0.0
```

Special support is provided for `DataFrame`

where the first
parameter can also be a `Symbol`

that denotes which column
contains the targets.

```
julia> df = DataFrame(x1 = rand(5), x2 = rand(5), y = [:a,:a,:b,:a,:b])
5×3 DataFrames.DataFrame
│ Row │ x1 │ x2 │ y │
├─────┼──────────┼───────────┼───┤
│ 1 │ 0.226582 │ 0.0997825 │ a │
│ 2 │ 0.504629 │ 0.0443222 │ a │
│ 3 │ 0.933372 │ 0.722906 │ b │
│ 4 │ 0.522172 │ 0.812814 │ a │
│ 5 │ 0.505208 │ 0.245457 │ b │
julia> undersample(:y, df)
4×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ x1 │ x2 │ y │
├─────┼──────────┼───────────┼───┤
│ 1 │ 0.226582 │ 0.0997825 │ a │
│ 2 │ 0.933372 │ 0.722906 │ b │
│ 3 │ 0.522172 │ 0.812814 │ a │
│ 4 │ 0.505208 │ 0.245457 │ b │
```

## K-Folds Repartitioning¶

This package also provides functions to perform re-partitioning strategies. These result in vector-like views that can be iterated over, in which each element is a different partition of the original data. Note again that all partitions are just lazy subsets, which means that no data is copied. For more information take a look at Repartitioning Strategies.

```
julia> x = collect(1:10);
julia> folds = kfolds(x, k = 5)
5-fold MLDataPattern.FoldsView of 10 observations:
data: 10-element Array{Int64,1}
training: 8 observations/fold
validation: 2 observations/fold
obsdim: :last
julia> train, val = folds[1] # access first fold
([3,4,5,6,7,8,9,10],[1,2])
```

## Data Views and Iterators¶

Such “views” also exist for other purposes. For example, the
function `obsview()`

will create a decorator around some data
container, that makes the given data container appear as a vector
of individual observations. This “vector” can then be indexed
into or iterated over.

```
julia> X = rand(2, 6)
2×6 Array{Float64,2}:
0.226582 0.933372 0.505208 0.0443222 0.812814 0.11202
0.504629 0.522172 0.0997825 0.722906 0.245457 0.000341996
julia> ov = obsview(X)
6-element obsview(::Array{Float64,2}, ObsDim.Last()) with element type SubArray{...}:
[0.226582,0.504629]
[0.933372,0.522172]
[0.505208,0.0997825]
[0.0443222,0.722906]
[0.812814,0.245457]
[0.11202,0.000341996]
```

Similarly, the function `batchview()`

creates a decorator
that makes the given data container appear as a vector of equally
sized mini-batches.

```
julia> bv = batchview(X, size = 2)
3-element batchview(::Array{Float64,2}, 2, 3, ObsDim.Last()) with element type SubArray{...}
[0.226582 0.933372; 0.504629 0.522172]
[0.505208 0.0443222; 0.0997825 0.722906]
[0.812814 0.11202; 0.245457 0.000341996]
```

A third but conceptually different kind of view is provided by
`slidingwindow()`

. This function is particularly useful for
preparing sequence data for various training tasks. For more
information take a look at the section on Data Views.

```
julia> data = split("The quick brown fox jumps over the lazy dog")
9-element Array{SubString{String},1}:
"The"
"quick"
"brown"
"fox"
"jumps"
"over"
"the"
"lazy"
"dog"
julia> A = slidingwindow(i->i+2, data, 2, stride=1)
7-element slidingwindow(::##9#10, ::Array{SubString{String},1}, 2, stride = 1) with element type Tuple{...}:
(["The", "quick"], "brown")
(["quick", "brown"], "fox")
(["brown", "fox"], "jumps")
(["fox", "jumps"], "over")
(["jumps", "over"], "the")
(["over", "the"], "lazy")
(["the", "lazy"], "dog")
julia> A = slidingwindow(i->[i-2:i-1; i+1:i+2], data, 1)
5-element slidingwindow(::##11#12, ::Array{SubString{String},1}, 1) with element type Tuple{...}:
(["brown"], ["The", "quick", "fox", "jumps"])
(["fox"], ["quick", "brown", "jumps", "over"])
(["jumps"], ["brown", "fox", "over", "the"])
(["over"], ["fox", "jumps", "the", "lazy"])
(["the"], ["jumps", "over", "lazy", "dog"])
```

Aside from data containers, there is also another sub-category of
data sources, called **data iterators**, that can not be indexed
into. For example the following code creates an object that when
iterated over, continuously and indefinitely samples a random
observation (with replacement) from the given data container.

```
julia> iter = RandomObs(X)
RandomObs(::Array{Float64,2}, ObsDim.Last())
Iterator providing Inf observations
```

To give a second example for a data iterator, the type
`RandomBatches`

generates randomly sampled mini-batches
for a fixed size. For more information on that topic, take a look
at the section on Data Iterators.

```
julia> iter = RandomBatches(X, size = 10)
RandomBatches(::Array{Float64,2}, 10, ObsDim.Last())
Iterator providing Inf batches of size 10
julia> iter = RandomBatches(X, count = 50, size = 10)
RandomBatches(::Array{Float64,2}, 10, 50, ObsDim.Last())
Iterator providing 50 batches of size 10
```

## Putting it all together¶

Let us round out this introduction by taking a look at a “hello world” example (with little explanation) to get a feeling for how to combine the various functions of this package in a typical ML scenario.

```
# X is a matrix; Y is a vector
X, Y = rand(4, 150), rand(150)
# The iris dataset is ordered according to their labels,
# which means that we should shuffle the dataset before
# partitioning it into training- and test-set.
Xs, Ys = shuffleobs((X, Y))
# Notice how we use tuples to group data.
# We leave out 15 % of the data for testing
(cv_X, cv_Y), (test_X, test_Y) = splitobs((Xs, Ys); at = 0.85)
# Next we partition the data using a 10-fold scheme.
# Notice how we do not need to splat train into X and Y
for (train, (val_X, val_Y)) in kfolds((cv_X, cv_Y); k = 10)
for epoch = 1:100
# Iterate over the data using mini-batches of 5 observations each
for (batch_X, batch_Y) in eachbatch(train, size = 5)
# ... train supervised model on minibatches here
end
end
end
```

In the above code snippet, the inner loop for `eachbatch()`

is the only place where data other than indices is actually being
copied. That is because `cv_X`

, `test_X`

, `val_X`

, etc. are
all array views of type `SubArray`

(the same applies to all the
Y’s of course). In contrast to this, `batch_X`

and `batch_Y`

will be of type `Array`

. Naturally, array views only work for
arrays, but we provide a generalization of such a data subset for
any type of data container.

Furthermore both, `batch_X`

and `batch_Y`

, will be the same
instances each iteration with only their values changed. In other
words, they both are preallocated buffers that will be reused
each iteration and filled with the data for the current batch.
Naturally, one is not required to work with buffers like this, as
stateful iterators can have undesired side-effects when used
without care. For example `collect(eachbatch(X))`

would result
in an array that has the exact same batch in each position.
Oftentimes, though, reusing buffers is preferable. This package
provides different alternatives for different use-cases.