# Getting Started¶

MLDataUtils is intended an end-user friendly interface to all the data related functionality in the JuliaML ecosystem. These include MLLabelUtils.jl and MLDataPattern.jl. Aside from reexporting their functionality, MLDataUtils also provides some additional glue code to improve the end-user experience.

## Installation¶

To install MLDataUtils.jl, start up Julia and type the following code-snipped into the REPL. It makes use of the native Julia package manger.

```
Pkg.add("MLDataUtils")
```

Additionally, for example if you encounter any sudden issues, or in the case you would like to contribute to the package, you can manually choose to be on the latest (untagged) version.

```
Pkg.checkout("MLDataUtils")
```

## Beginner Tutorial¶

To get a better feeling for what you can to with this package,
let us take a look at a simple machine learning experiment. More
concretely, let us use this package to implement a linear
soft-margin classifier (often referred to as a “linear support
vector machine”) that can distinguish between the two species of
Iris `setosa`

and `versicolor`

using the *sepal length* and
the *sepal width* as features. To that end we will use u subset
of the famous **Iris flower dataset** published by Ronald Fisher
[FISHER1936]. Check out the wikipedia entry for more
information about the dataset.

As a first step, let us load the iris data set using the function
`load_iris()`

. This function accepts an optional integer
argument that can be used to specify how many observations should
be loaded. Because we are only interested in two of the three
classes, we will only load the first `100`

observations.

```
using MLDataUtils
X, Y, fnames = load_iris(100);
```

As we can see, the function `load_iris()`

returns three
variables.

The variable

`X`

contains all our**features**, sometimes called*independent variables*or*predictors*. Therefore`X`

is often referred to as*feature matrix*. Each column of the matrix corresponds to a single**observation**, or*sample*. For this data set it means that every observation has 4 features each. These features represent some quantitative information known about the corresponding observation. In fact, they are distance measurements in centimeters (more on that later).julia> X 4×100 Array{Float64,2}: 5.1 4.9 4.7 4.6 5.0 5.4 4.6 … 5.0 5.6 5.7 5.7 6.2 5.1 5.7 3.5 3.0 3.2 3.1 3.6 3.9 3.4 2.3 2.7 3.0 2.9 2.9 2.5 2.8 1.4 1.4 1.3 1.5 1.4 1.7 1.4 3.3 4.2 4.2 4.2 4.3 3.0 4.1 0.2 0.2 0.2 0.2 0.2 0.4 0.3 1.0 1.3 1.2 1.3 1.3 1.1 1.3 julia> getobs(X, 2) # query second observations 4-element Array{Float64,1}: 4.9 3.0 1.4 0.2

The variable

`Y`

contains the**labels**(also often called*classes*or*categories*) of each observation. These terms are usually used in the context of predicting categorical variables, such as we do in this example. The more general term for`Y`

, which also includes the case of numerical outcomes, is**targets**,*responses*, or*dependent variables*. For this particular example,`Y`

denotes our classification targets in the form of a string vector where each element is one of two possible strings,`"setosa"`

and`"versicolor"`

.julia> Y 100-element Array{String,1}: "setosa" "setosa" "setosa" ⋮ "versicolor" "versicolor" "versicolor" julia> label(Y) 2-element Array{String,1}: "setosa" "versicolor"

The variable

`fnames`

is really just for convenience, and denotes short descriptive names for the four different features. Here we can see that the four features are distance measurements of various widths and heights.julia> fnames 4-element Array{String,1}: "Sepal length" "Sepal width" "Petal length" "Petal width"

Together, `X`

and `Y`

represent our data set. Both variables
contain 100 observations. More importantly, the individual
elements of the two variables are linked together through the
corresponding observation-index. For example, the following code
snippet shows how to access the 30-th observation of the full
data set.

```
julia> getobs((X, Y), 30)
([4.7,3.2,1.6,0.2],"setosa")
```

This link is important and has to be preserved. See the section on Tuples and Labeled Data from the MLDataPattern documentation for more information.

Note

As you may have noticed we chose to work with the Iris data in
the form of a `Matrix`

and a `Vector`

, instead of
something like a `DataFrame`

. The reason for this is simply
didactic convenience. In case you prefer working with a
`DataFrame`

, however, note that most of the functions that
this package provides can also deal with `DataFrames`

equally well. You can use the RDatasets package to
load the iris data in `DataFrame`

form.

```
julia> using RDatasets
julia> iris = dataset("datasets", "iris")
150×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ "setosa" │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ "setosa" │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ "setosa" │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ "setosa" │
⋮
│ 147 │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ "virginica" │
│ 148 │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ "virginica" │
│ 149 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ "virginica" │
│ 150 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ "virginica" │
julia> getobs(iris, 30)
1×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1 │ 4.7 │ 3.2 │ 1.6 │ 0.2 │ "setosa" │
```

The first thing we will do for our experiment, is restrict the
features in `X`

to just `"Sepal length"`

and ```
"Sepal
width"
```

. We do this for the sole reason of convenient
visualisation in a 2D plot. This will make this little tutorial a
lot more intuitive. Furthermore, we will add a row of ones to the
matrix. This will serve as a feature that all observations share,
and thus allow the model to learn an offset that applies to all
observations equally.

```
julia> X = vcat(X[1:2,:], ones(1,100))
3×100 Array{Float64,2}:
5.1 4.9 4.7 4.6 5.0 5.4 4.6 … 5.0 5.6 5.7 5.7 6.2 5.1 5.7
3.5 3.0 3.2 3.1 3.6 3.9 3.4 2.3 2.7 3.0 2.9 2.9 2.5 2.8
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
julia> fnames = fnames[1:2]
2-element Array{String,1}:
"Sepal length"
"Sepal width"
```

Alright, now we have our complete example data set! While this is
just a part of the Iris flower data, it will be the full data set
in the context of this specific tutorial. Let us use the
Plots.jl package to
visualize it. To do that, we can use the function
`labelmap()`

to loop through all the classes and their
observations. This way we can plot the observations with
different colors and labels.

```
using Plots
pyplot()
# Create empty plot with xlabel and ylabel
plt = plot(xguide = fnames[1], yguide = fnames[2])
# Loop through labels and their indices and plot the points
for (lbl, idx) in labelmap(Y)
scatter!(plt, X[1,idx], X[2,idx], label = lbl)
end
plt
```

The resulting plot can be seen in Fig. 1. As we can see, the classes seem decently well separated. Our goal is now to write a program that can learn how to separate those classes itself. That is, given the coordinates of an observation, predict which class that observation belongs to. While this is the ultimate goal of this tutorial, let’s not get ahead of ourselves just yet.

Figure 1. The full example data set colored according to the class label |

Tip

You may have noted how we used the function `labelmap()`

in a `for`

loop. This is convenient, because it returns a
dictionary that has one key-value pair per label, where each
key is a label, and each value is an array of all the
observation-indices that belong to that label.

```
julia> labelmap(Y)
Dict{String,Array{Int64,1}} with 2 entries:
"setosa" => [1,2,3,4,5,6,7,8,9,10 … 41,42,43,44,45,46,47,48,49,50]
"versicolor" => [51,52,53,54,55,56,57,58,59,60 … 91,92,93,94,95,96,97,98,99,100]
```

Before we can train anything, we have to first think about how a solution should be represented. In other words, how does the prediction work? We have three numbers as input for a single observation, and we would like an output that we can easily interpret as one of the two classes. What we are looking for is an appropriate prediction model.

A **prediction model** is a family of functions that restricts
the potential solution to a specific formalism / representation.
Often such a model will also pose a restriction on the complexity
of the solution, which further limits the search space of
potential solutions. So in a way a prediction model can be
thought of as the manifestation of our assumptions about the
problem, because it restricts the solution to a specific family
of functions.

For our example, we will choose a linear model. Given that
restriction, we can now represent any solution as just three
numbers: the **coefficients**, denoted as \(\theta \in
\mathbb{R}^3\). What does that mean? Well, remember how a
prediction model is a family of functions. For our example it is
the family of linear functions with three coefficients (because
we have three features). More formally, a linear prediction model
\(h_\theta\) for three features, is the family of functions
that map an input from the feature space \(X \in
\mathbb{R}^3\) to the real numbers \(\mathbb{R}\) using the
some fixed coefficients \(\theta \in \mathbb{R}^3\).

The first question one might ask is, why isn’t \(\theta\)
simply a parameter of \(h\), instead of an odd-looking
subscript. Well, remember how \(h_\theta\) is a *family* of
functions, not a function. That means that in order to have an
actual **prediction function**, we first need to choose three
numbers for the coefficient vector \(\theta\). Think of these
numbers as hard coded information for that function. In other
words, they are not parameters of that function, instead once
chosen they are an intrinsic part of that function. The goal of a
learning algorithm is then to find the “best” function from that
family, by systematically trying out different \(\theta\).

Let us see what this means in terms of actual code. First, let’s
define our prediction *model*.

```
immutable LinearFunction
θ::Vector{Float64}
end
(h::LinearFunction)(x::AbstractVector) = dot(x, h.θ)
```

We explicitly said *model*, yet the type is called
`LinearFunction`

. This is no accident. The prediction model in
this case is the type, while an instance of this type is a
concrete prediction function.

```
julia> LinearFunction # the type is the "family of functions"
LinearFunction
julia> h = LinearFunction([1., 1., 1.]) # an instance is a "function"
LinearFunction([1.0,1.0,1.0])
```

We can now use `h`

just like we would use any other Julia
function. For example we can pass the first observation of our
data set to it. We can query the first observation using the
function `getobs()`

. That said, `h`

doesn’t know nor care
if we pass an actual observation from `X`

to it. What matters
is that it has the right structure (i.e. three numeric features).
That is a good thing, because in general we want to learn `h`

in such a way that we can use it for new data points that weren’t
known before.

```
julia> h(getobs(X, 1))
9.6
julia> h([1.0, 1.0, 1.0]) # made up observation
3.0
```

Note that the number we get as output does not mean anything yet. We haven’t even specified how we want to interpret the output of our linear function. We only defined its representation and how it works.

Now we have to think about **interpretation**. A useful way to
think about the output is in terms of a separating *point*; yes,
point, not line. We just saw in our last code snippet how the
output of a prediction function is a real number. What if we say,
that we would like to interpret this output in the following way.
Let \(class\) be our decision function that we use to classify
the output of the prediction function \(h\). Furthermore, let
said output of the prediction function be denoted as
\(\hat{y}\) (pronounced “why hat”).

What \(class\) does is impose a decision boundary at
\(0\). If the output of our prediction function is greater
than \(0\), we will interpret it as a prediction for the
class `"versicolor"`

, while if the output is smaller than
\(0\), we will interpret it as a prediction for the class
`"setosa"`

. This is called a **margin-based** interpretation of
the output. We can implement the function `class`

using
`convertlabel()`

to transform a number from a margin-based
interpretation to our problem-specific representation.

```
julia> const class_labels = ["versicolor", "setosa"]
julia> class(yhat) = convertlabel(class_labels, yhat, LabelEnc.MarginBased())
julia> class(h([1, 1, 1])) # try it out
"versicolor"
julia> class(0.5)
"versicolor"
julia> class(-0.1)
"setosa"
```

Tip

Using `convertlabel()`

like this is really just a
convenient shortcut for a two step process. Usually what one
does is to first classify the output according to its
interpretation, which in this case is “margin-based”. We can
do this using the function `classify()`

, which transforms
the output to the correct label of the same interpretation.

```
julia> classify(0.3, LabelEnc.MarginBased())
1.0
```

The output of `classify()`

is then either the positive or
the negative label of the given label encoding. The next step
is to convert from one label encoding to another using the
function `convertlabel()`

.

```
julia> convertlabel(["positive", "negative"], 1.0, LabelEnc.MarginBased())
"positive"
```

Given that this is such a common use case, it is possible to
perform both steps at once by using the pre-classified
prediction \(\hat{y}\) in `convertlabel()`

directly.

How can we visualize this? Well if we think about it we could just plot a contour surface where we compute the output of our prediction function for a large grid of input numbers. The line, where this contour surface is zero, is then our decision boundary for that specific prediction function, where each side corresponds to one predicted class label.

Warning

Wait a second, now it is a “line”? Didn’t we say “point” a few paragraphs ago? Yes and yes. While the decision boundary is a point in the output space, it manifests as a hyper-plane (here a line) in the input space. Since our plot will be in input space (the x-axis and y-axis are our features), it will be a line.

Let’s do it. Consider the prediction function for some fixed coefficient vector \(\theta\). Here we “cheated” and chose a somehow known set of numbers that give a good solution to our prediction problem.

```
θ = [1.15, -1.0, -3]
h = LinearModel(θ)
contour!(deepcopy(plt), 4:0.5:7, 2:0.5:5, (x1, x2) -> h([x1, x2, 1]),
fill=true, levels=-7:7, fillalpha=0.5, color=:bluesreds)
```

The resulting plot can be seen in Fig. 2. Of special interest is
the contour line with value \(0\), because that is what is
known as the *separating hyperplane*. It is the decision boundary
where everything on the red side will be classified as
`"versicolor"`

and everything on the blue side will be
classified as `"setosa"`

. Note how for the chosen set of
coefficients, all the observations would be classified correctly.

Figure 2. The prediction surface for a good set of manually chosen coefficients. |

The emphasis of the caption in Fig. 2 is very important. The plot
shows the contours of the **prediction surface**, and not the
cost/error surface. At this point in the tutorial we don’t even
know what a “error surface” is supposed to be.

Tip

If you are curious about the influence of the three coefficients in \(\theta\) on the separating hyperplane, try exploring their values in a Jupyter notebook using the great package Interact.jl.

```
using Interact
gr()
@manipulate for θ₁ = 0.5:0.05:2,
θ₂ = -3.0:0.05:-1.0,
θ₃ = -3:0.05:0
h = LinearModel([θ₁, θ₂, θ₃])
contour!(deepcopy(plt), 4:0.5:7, 2:0.5:5, (x1, x2) -> h([x1, x2, 1]),
colorbar=false, zlims=(-6,6), levels=3, color=:greens)
end
```

You will see that their relationship to the line is quite unintuitive, because they aren’t the coefficients that denote the line itself. Instead they describe an off-setted vector (\(\theta_3\) is the offset) that is normal to the displayed line. This is very unlike linear regression.

Warning

**TO BE CONTINUED**

This tutorial is still a work in progress.

## How to ... ?¶

Chances are you ended up here with a very specific use-case in mind. This section outlines a number of different but common scenarios and explains how this package can be utilized to solve them. Before we get started, however, we need to bring MLDataUtils into scope. Once installed the package can be imported just as any other Julia package.

```
using MLDataUtils
```

- [docs] Infer which encoding some classification targets use.

```
julia> enc = labelenc([-1,1,1,-1,1])
MLLabelUtils.LabelEnc.MarginBased{Int64}()
```

- [docs] Assert if some classification targets are of the encoding I need them in.

```
julia> islabelenc([0,1,1,0,1], LabelEnc.MarginBased)
false
```

- [docs] Convert targets into a specific encoding that my model requires.

```
julia> convertlabel(LabelEnc.OneOfK{Float32}, [-1,1,-1,1,1,-1])
2×6 Array{Float32,2}:
0.0 1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 0.0 0.0 1.0
```

- [docs] Work with matrices in which the user can choose of the rows or the columns denote the observations.

```
julia> convertlabel(LabelEnc.OneOfK{Float32}, Int8[-1,1,-1,1,1,-1], obsdim = 1)
6×2 Array{Float32,2}:
0.0 1.0
1.0 0.0
0.0 1.0
1.0 0.0
1.0 0.0
0.0 1.0
```

- [docs] Group observations according to their class-label.

```
julia> labelmap([0, 1, 1, 0, 0])
Dict{Int64,Array{Int64,1}} with 2 entries:
0 => [1,4,5]
1 => [2,3]
```

- [docs] Classify model predictions into class labels appropriate for the encoding of the targets.

```
julia> classify(-0.3, LabelEnc.MarginBased())
-1.0
```

- [docs] Create a lazy data subset of some data.

```
julia> X = rand(2, 6)
2×6 Array{Float64,2}:
0.226582 0.933372 0.505208 0.0443222 0.812814 0.11202
0.504629 0.522172 0.0997825 0.722906 0.245457 0.000341996
julia> datasubset(X, 2:3)
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
0.933372 0.505208
0.522172 0.0997825
```

- [docs] Shuffle the observations of a data container.

```
julia> shuffleobs(X)
2×6 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
0.505208 0.812814 0.11202 0.0443222 0.933372 0.226582
0.0997825 0.245457 0.000341996 0.722906 0.522172 0.504629
```

- [docs] Split data into train/test subsets.

```
julia> train, test = splitobs(X, at = 0.7);
julia> train
2×4 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
0.226582 0.933372 0.505208 0.0443222
0.504629 0.522172 0.0997825 0.722906
julia> test
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
0.812814 0.11202
0.245457 0.000341996
```

- [docs] Partition data into train/test subsets using stratified sampling.

```
julia> train, test = stratifiedobs([:a,:a,:b,:b,:b,:b], p = 0.5)
(Symbol[:b,:b,:a],Symbol[:b,:a,:b])
julia> train
3-element SubArray{Symbol,1,Array{Symbol,1},Tuple{Array{Int64,1}},false}:
:b
:b
:a
julia> test
3-element SubArray{Symbol,1,Array{Symbol,1},Tuple{Array{Int64,1}},false}:
:b
:a
:b
```

- [docs] Group multiple variables together and treat them as a single data set.

```
julia> shuffleobs(([1,2,3], [:a,:b,:c]))
([3,1,2],Symbol[:c,:a,:b])
```

- [docs] Support my own custom user-defined data container type.

```
julia> using DataTables, LearnBase
julia> LearnBase.nobs(dt::AbstractDataTable) = nrow(dt)
julia> LearnBase.getobs(dt::AbstractDataTable, idx) = dt[idx,:]
julia> LearnBase.datasubset(dt::AbstractDataTable, idx, ::ObsDim.Undefined) = view(dt, idx)
```

- [docs] Over- or undersample an imbalanced labeled data set.

```
julia> undersample([:a,:b,:b,:a,:b,:b])
4-element SubArray{Symbol,1,Array{Symbol,1},Tuple{Array{Int64,1}},false}:
:a
:b
:b
:a
```

- [docs] Repartition a data container using a k-folds scheme.

```
julia> folds = kfolds([1,2,3,4,5,6,7,8,9,10], k = 5)
5-element MLDataPattern.FoldsView{Tuple{SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},false},SubArray{Int64,1,Array{Int64,1},Tuple{UnitRange{Int64}},true}},Array{Int64,1},LearnBase.ObsDim.Last,Array{Array{Int64,1},1},Array{UnitRange{Int64},1}}:
([3,4,5,6,7,8,9,10],[1,2])
([1,2,5,6,7,8,9,10],[3,4])
([1,2,3,4,7,8,9,10],[5,6])
([1,2,3,4,5,6,9,10],[7,8])
([1,2,3,4,5,6,7,8],[9,10])
```

- [docs] Iterate over my data one observation or batch at a time.

```
julia> obsview(([1 2 3; 4 5 6], [:a, :b, :c]))
3-element MLDataPattern.ObsView{Tuple{SubArray{Int64,1,Array{Int64,2},Tuple{Colon,Int64},true},SubArray{Symbol,0,Array{Symbol,1},Tuple{Int64},false}},Tuple{Array{Int64,2},Array{Symbol,1}},Tuple{LearnBase.ObsDim.Last,LearnBase.ObsDim.Last}}:
([1,4],:a)
([2,5],:b)
([3,6],:c)
```

## Getting Help¶

To get help on specific functionality you can either look up the
information here, or if you prefer you can make use of Julia’s
native doc-system.
The following example shows how to get additional information on
`DataSubset`

within Julia’s REPL:

```
?DataSubset
```

If you find yourself stuck or have other questions concerning the
package you can find us at gitter or the *Machine Learning*
domain on discourse.julialang.org

If you encounter a bug or would like to participate in the further development of this package come find us on Github.