Getting Started¶
MLDataUtils is intended an end-user friendly interface to all the data related functionality in the JuliaML ecosystem. These include MLLabelUtils.jl and MLDataPattern.jl. Aside from reexporting their functionality, MLDataUtils also provides some additional glue code to improve the end-user experience.
Installation¶
To install MLDataUtils.jl, start up Julia and type the following code-snipped into the REPL. It makes use of the native Julia package manger.
Pkg.add("MLDataUtils")
Additionally, for example if you encounter any sudden issues, or in the case you would like to contribute to the package, you can manually choose to be on the latest (untagged) version.
Pkg.checkout("MLDataUtils")
Beginner Tutorial¶
To get a better feeling for what you can to with this package,
let us take a look at a simple machine learning experiment. More
concretely, let us use this package to implement a linear
soft-margin classifier (often referred to as a “linear support
vector machine”) that can distinguish between the two species of
Iris setosa
and versicolor
using the sepal length and
the sepal width as features. To that end we will use u subset
of the famous Iris flower dataset published by Ronald Fisher
[FISHER1936]. Check out the wikipedia entry for more
information about the dataset.
As a first step, let us load the iris data set using the function
load_iris()
. This function accepts an optional integer
argument that can be used to specify how many observations should
be loaded. Because we are only interested in two of the three
classes, we will only load the first 100
observations.
using MLDataUtils
X, Y, fnames = load_iris(100);
As we can see, the function load_iris()
returns three
variables.
The variable
X
contains all our features, sometimes called independent variables or predictors. ThereforeX
is often referred to as feature matrix. Each column of the matrix corresponds to a single observation, or sample. For this data set it means that every observation has 4 features each. These features represent some quantitative information known about the corresponding observation. In fact, they are distance measurements in centimeters (more on that later).julia> X 4×100 Array{Float64,2}: 5.1 4.9 4.7 4.6 5.0 5.4 4.6 … 5.0 5.6 5.7 5.7 6.2 5.1 5.7 3.5 3.0 3.2 3.1 3.6 3.9 3.4 2.3 2.7 3.0 2.9 2.9 2.5 2.8 1.4 1.4 1.3 1.5 1.4 1.7 1.4 3.3 4.2 4.2 4.2 4.3 3.0 4.1 0.2 0.2 0.2 0.2 0.2 0.4 0.3 1.0 1.3 1.2 1.3 1.3 1.1 1.3 julia> getobs(X, 2) # query second observations 4-element Array{Float64,1}: 4.9 3.0 1.4 0.2
The variable
Y
contains the labels (also often called classes or categories) of each observation. These terms are usually used in the context of predicting categorical variables, such as we do in this example. The more general term forY
, which also includes the case of numerical outcomes, is targets, responses, or dependent variables. For this particular example,Y
denotes our classification targets in the form of a string vector where each element is one of two possible strings,"setosa"
and"versicolor"
.julia> Y 100-element Array{String,1}: "setosa" "setosa" "setosa" ⋮ "versicolor" "versicolor" "versicolor" julia> label(Y) 2-element Array{String,1}: "setosa" "versicolor"
The variable
fnames
is really just for convenience, and denotes short descriptive names for the four different features. Here we can see that the four features are distance measurements of various widths and heights.julia> fnames 4-element Array{String,1}: "Sepal length" "Sepal width" "Petal length" "Petal width"
Together, X
and Y
represent our data set. Both variables
contain 100 observations. More importantly, the individual
elements of the two variables are linked together through the
corresponding observation-index. For example, the following code
snippet shows how to access the 30-th observation of the full
data set.
julia> getobs((X, Y), 30)
([4.7,3.2,1.6,0.2],"setosa")
This link is important and has to be preserved. See the section on Tuples and Labeled Data from the MLDataPattern documentation for more information.
Note
As you may have noticed we chose to work with the Iris data in
the form of a Matrix
and a Vector
, instead of
something like a DataFrame
. The reason for this is simply
didactic convenience. In case you prefer working with a
DataFrame
, however, note that most of the functions that
this package provides can also deal with DataFrames
equally well. You can use the RDatasets package to
load the iris data in DataFrame
form.
julia> using RDatasets
julia> iris = dataset("datasets", "iris")
150×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ "setosa" │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ "setosa" │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ "setosa" │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ "setosa" │
⋮
│ 147 │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ "virginica" │
│ 148 │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ "virginica" │
│ 149 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ "virginica" │
│ 150 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ "virginica" │
julia> getobs(iris, 30)
1×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1 │ 4.7 │ 3.2 │ 1.6 │ 0.2 │ "setosa" │
The first thing we will do for our experiment, is restrict the
features in X
to just "Sepal length"
and "Sepal
width"
. We do this for the sole reason of convenient
visualisation in a 2D plot. This will make this little tutorial a
lot more intuitive. Furthermore, we will add a row of ones to the
matrix. This will serve as a feature that all observations share,
and thus allow the model to learn an offset that applies to all
observations equally.
julia> X = vcat(X[1:2,:], ones(1,100))
3×100 Array{Float64,2}:
5.1 4.9 4.7 4.6 5.0 5.4 4.6 … 5.0 5.6 5.7 5.7 6.2 5.1 5.7
3.5 3.0 3.2 3.1 3.6 3.9 3.4 2.3 2.7 3.0 2.9 2.9 2.5 2.8
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
julia> fnames = fnames[1:2]
2-element Array{String,1}:
"Sepal length"
"Sepal width"
Alright, now we have our complete example data set! While this is
just a part of the Iris flower data, it will be the full data set
in the context of this specific tutorial. Let us use the
Plots.jl package to
visualize it. To do that, we can use the function
labelmap()
to loop through all the classes and their
observations. This way we can plot the observations with
different colors and labels.
using Plots
pyplot()
# Create empty plot with xlabel and ylabel
plt = plot(xguide = fnames[1], yguide = fnames[2])
# Loop through labels and their indices and plot the points
for (lbl, idx) in labelmap(Y)
scatter!(plt, X[1,idx], X[2,idx], label = lbl)
end
plt
The resulting plot can be seen in Fig. 1. As we can see, the classes seem decently well separated. Our goal is now to write a program that can learn how to separate those classes itself. That is, given the coordinates of an observation, predict which class that observation belongs to. While this is the ultimate goal of this tutorial, let’s not get ahead of ourselves just yet.
Figure 1. The full example data set colored according to the class label |
Tip
You may have noted how we used the function labelmap()
in a for
loop. This is convenient, because it returns a
dictionary that has one key-value pair per label, where each
key is a label, and each value is an array of all the
observation-indices that belong to that label.
julia> labelmap(Y)
Dict{String,Array{Int64,1}} with 2 entries:
"setosa" => [1,2,3,4,5,6,7,8,9,10 … 41,42,43,44,45,46,47,48,49,50]
"versicolor" => [51,52,53,54,55,56,57,58,59,60 … 91,92,93,94,95,96,97,98,99,100]
Before we can train anything, we have to first think about how a solution should be represented. In other words, how does the prediction work? We have three numbers as input for a single observation, and we would like an output that we can easily interpret as one of the two classes. What we are looking for is an appropriate prediction model.
A prediction model is a family of functions that restricts the potential solution to a specific formalism / representation. Often such a model will also pose a restriction on the complexity of the solution, which further limits the search space of potential solutions. So in a way a prediction model can be thought of as the manifestation of our assumptions about the problem, because it restricts the solution to a specific family of functions.
For our example, we will choose a linear model. Given that restriction, we can now represent any solution as just three numbers: the coefficients, denoted as \(\theta \in \mathbb{R}^3\). What does that mean? Well, remember how a prediction model is a family of functions. For our example it is the family of linear functions with three coefficients (because we have three features). More formally, a linear prediction model \(h_\theta\) for three features, is the family of functions that map an input from the feature space \(X \in \mathbb{R}^3\) to the real numbers \(\mathbb{R}\) using the some fixed coefficients \(\theta \in \mathbb{R}^3\).
The first question one might ask is, why isn’t \(\theta\) simply a parameter of \(h\), instead of an odd-looking subscript. Well, remember how \(h_\theta\) is a family of functions, not a function. That means that in order to have an actual prediction function, we first need to choose three numbers for the coefficient vector \(\theta\). Think of these numbers as hard coded information for that function. In other words, they are not parameters of that function, instead once chosen they are an intrinsic part of that function. The goal of a learning algorithm is then to find the “best” function from that family, by systematically trying out different \(\theta\).
Let us see what this means in terms of actual code. First, let’s define our prediction model.
immutable LinearFunction
θ::Vector{Float64}
end
(h::LinearFunction)(x::AbstractVector) = dot(x, h.θ)
We explicitly said model, yet the type is called
LinearFunction
. This is no accident. The prediction model in
this case is the type, while an instance of this type is a
concrete prediction function.
julia> LinearFunction # the type is the "family of functions"
LinearFunction
julia> h = LinearFunction([1., 1., 1.]) # an instance is a "function"
LinearFunction([1.0,1.0,1.0])
We can now use h
just like we would use any other Julia
function. For example we can pass the first observation of our
data set to it. We can query the first observation using the
function getobs()
. That said, h
doesn’t know nor care
if we pass an actual observation from X
to it. What matters
is that it has the right structure (i.e. three numeric features).
That is a good thing, because in general we want to learn h
in such a way that we can use it for new data points that weren’t
known before.
julia> h(getobs(X, 1))
9.6
julia> h([1.0, 1.0, 1.0]) # made up observation
3.0
Note that the number we get as output does not mean anything yet. We haven’t even specified how we want to interpret the output of our linear function. We only defined its representation and how it works.
Now we have to think about interpretation. A useful way to think about the output is in terms of a separating point; yes, point, not line. We just saw in our last code snippet how the output of a prediction function is a real number. What if we say, that we would like to interpret this output in the following way. Let \(class\) be our decision function that we use to classify the output of the prediction function \(h\). Furthermore, let said output of the prediction function be denoted as \(\hat{y}\) (pronounced “why hat”).
What \(class\) does is impose a decision boundary at
\(0\). If the output of our prediction function is greater
than \(0\), we will interpret it as a prediction for the
class "versicolor"
, while if the output is smaller than
\(0\), we will interpret it as a prediction for the class
"setosa"
. This is called a margin-based interpretation of
the output. We can implement the function class
using
convertlabel()
to transform a number from a margin-based
interpretation to our problem-specific representation.
julia> const class_labels = ["versicolor", "setosa"]
julia> class(yhat) = convertlabel(class_labels, yhat, LabelEnc.MarginBased())
julia> class(h([1, 1, 1])) # try it out
"versicolor"
julia> class(0.5)
"versicolor"
julia> class(-0.1)
"setosa"
Tip
Using convertlabel()
like this is really just a
convenient shortcut for a two step process. Usually what one
does is to first classify the output according to its
interpretation, which in this case is “margin-based”. We can
do this using the function classify()
, which transforms
the output to the correct label of the same interpretation.
julia> classify(0.3, LabelEnc.MarginBased())
1.0
The output of classify()
is then either the positive or
the negative label of the given label encoding. The next step
is to convert from one label encoding to another using the
function convertlabel()
.
julia> convertlabel(["positive", "negative"], 1.0, LabelEnc.MarginBased())
"positive"
Given that this is such a common use case, it is possible to
perform both steps at once by using the pre-classified
prediction \(\hat{y}\) in convertlabel()
directly.
How can we visualize this? Well if we think about it we could just plot a contour surface where we compute the output of our prediction function for a large grid of input numbers. The line, where this contour surface is zero, is then our decision boundary for that specific prediction function, where each side corresponds to one predicted class label.
Warning
Wait a second, now it is a “line”? Didn’t we say “point” a few paragraphs ago? Yes and yes. While the decision boundary is a point in the output space, it manifests as a hyper-plane (here a line) in the input space. Since our plot will be in input space (the x-axis and y-axis are our features), it will be a line.
Let’s do it. Consider the prediction function for some fixed coefficient vector \(\theta\). Here we “cheated” and chose a somehow known set of numbers that give a good solution to our prediction problem.
θ = [1.15, -1.0, -3]
h = LinearModel(θ)
contour!(deepcopy(plt), 4:0.5:7, 2:0.5:5, (x1, x2) -> h([x1, x2, 1]),
fill=true, levels=-7:7, fillalpha=0.5, color=:bluesreds)
The resulting plot can be seen in Fig. 2. Of special interest is
the contour line with value \(0\), because that is what is
known as the separating hyperplane. It is the decision boundary
where everything on the red side will be classified as
"versicolor"
and everything on the blue side will be
classified as "setosa"
. Note how for the chosen set of
coefficients, all the observations would be classified correctly.
Figure 2. The prediction surface for a good set of manually chosen coefficients. |
The emphasis of the caption in Fig. 2 is very important. The plot shows the contours of the prediction surface, and not the cost/error surface. At this point in the tutorial we don’t even know what a “error surface” is supposed to be.
Tip
If you are curious about the influence of the three coefficients in \(\theta\) on the separating hyperplane, try exploring their values in a Jupyter notebook using the great package Interact.jl.
using Interact
gr()
@manipulate for θ₁ = 0.5:0.05:2,
θ₂ = -3.0:0.05:-1.0,
θ₃ = -3:0.05:0
h = LinearModel([θ₁, θ₂, θ₃])
contour!(deepcopy(plt), 4:0.5:7, 2:0.5:5, (x1, x2) -> h([x1, x2, 1]),
colorbar=false, zlims=(-6,6), levels=3, color=:greens)
end
You will see that their relationship to the line is quite unintuitive, because they aren’t the coefficients that denote the line itself. Instead they describe an off-setted vector (\(\theta_3\) is the offset) that is normal to the displayed line. This is very unlike linear regression.
Warning
TO BE CONTINUED
This tutorial is still a work in progress.
How to … ?¶
Chances are you ended up here with a very specific use-case in mind. This section outlines a number of different but common scenarios and explains how this package can be utilized to solve them. Before we get started, however, we need to bring MLDataUtils into scope. Once installed the package can be imported just as any other Julia package.
using MLDataUtils
- [docs] Infer which encoding some classification targets use.
julia> enc = labelenc([-1,1,1,-1,1])
MLLabelUtils.LabelEnc.MarginBased{Int64}()
- [docs] Assert if some classification targets are of the encoding I need them in.
julia> islabelenc([0,1,1,0,1], LabelEnc.MarginBased)
false
- [docs] Convert targets into a specific encoding that my model requires.
julia> convertlabel(LabelEnc.OneOfK{Float32}, [-1,1,-1,1,1,-1])
2×6 Array{Float32,2}:
0.0 1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 0.0 0.0 1.0
- [docs] Work with matrices in which the user can choose of the rows or the columns denote the observations.
julia> convertlabel(LabelEnc.OneOfK{Float32}, Int8[-1,1,-1,1,1,-1], obsdim = 1)
6×2 Array{Float32,2}:
0.0 1.0
1.0 0.0
0.0 1.0
1.0 0.0
1.0 0.0
0.0 1.0
- [docs] Group observations according to their class-label.
julia> labelmap([0, 1, 1, 0, 0])
Dict{Int64,Array{Int64,1}} with 2 entries:
0 => [1,4,5]
1 => [2,3]
- [docs] Classify model predictions into class labels appropriate for the encoding of the targets.
julia> classify(-0.3, LabelEnc.MarginBased())
-1.0
- [docs] Create a lazy data subset of some data.
julia> X = rand(2, 6)
2×6 Array{Float64,2}:
0.226582 0.933372 0.505208 0.0443222 0.812814 0.11202
0.504629 0.522172 0.0997825 0.722906 0.245457 0.000341996
julia> datasubset(X, 2:3)
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
0.933372 0.505208
0.522172 0.0997825
- [docs] Shuffle the observations of a data container.
julia> shuffleobs(X)
2×6 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,Array{Int64,1}},false}:
0.505208 0.812814 0.11202 0.0443222 0.933372 0.226582
0.0997825 0.245457 0.000341996 0.722906 0.522172 0.504629
- [docs] Split data into train/test subsets.
julia> train, test = splitobs(X, at = 0.7);
julia> train
2×4 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
0.226582 0.933372 0.505208 0.0443222
0.504629 0.522172 0.0997825 0.722906
julia> test
2×2 SubArray{Float64,2,Array{Float64,2},Tuple{Colon,UnitRange{Int64}},true}:
0.812814 0.11202
0.245457 0.000341996
- [docs] Partition data into train/test subsets using stratified sampling.
julia> train, test = stratifiedobs([:a,:a,:b,:b,:b,:b], p = 0.5)
(Symbol[:b,:b,:a],Symbol[:b,:a,:b])
julia> train
3-element SubArray{Symbol,1,Array{Symbol,1},Tuple{Array{Int64,1}},false}:
:b
:b
:a
julia> test
3-element SubArray{Symbol,1,Array{Symbol,1},Tuple{Array{Int64,1}},false}:
:b
:a
:b
- [docs] Group multiple variables together and treat them as a single data set.
julia> shuffleobs(([1,2,3], [:a,:b,:c]))
([3,1,2],Symbol[:c,:a,:b])
- [docs] Support my own custom user-defined data container type.
julia> using DataTables, LearnBase
julia> LearnBase.nobs(dt::AbstractDataTable) = nrow(dt)
julia> LearnBase.getobs(dt::AbstractDataTable, idx) = dt[idx,:]
julia> LearnBase.datasubset(dt::AbstractDataTable, idx, ::ObsDim.Undefined) = view(dt, idx)
- [docs] Over- or undersample an imbalanced labeled data set.
julia> undersample([:a,:b,:b,:a,:b,:b])
4-element SubArray{Symbol,1,Array{Symbol,1},Tuple{Array{Int64,1}},false}:
:a
:b
:b
:a
- [docs] Repartition a data container using a k-folds scheme.
julia> folds = kfolds([1,2,3,4,5,6,7,8,9,10], k = 5)
5-element MLDataPattern.FoldsView{Tuple{SubArray{Int64,1,Array{Int64,1},Tuple{Array{Int64,1}},false},SubArray{Int64,1,Array{Int64,1},Tuple{UnitRange{Int64}},true}},Array{Int64,1},LearnBase.ObsDim.Last,Array{Array{Int64,1},1},Array{UnitRange{Int64},1}}:
([3,4,5,6,7,8,9,10],[1,2])
([1,2,5,6,7,8,9,10],[3,4])
([1,2,3,4,7,8,9,10],[5,6])
([1,2,3,4,5,6,9,10],[7,8])
([1,2,3,4,5,6,7,8],[9,10])
- [docs] Iterate over my data one observation or batch at a time.
julia> obsview(([1 2 3; 4 5 6], [:a, :b, :c]))
3-element MLDataPattern.ObsView{Tuple{SubArray{Int64,1,Array{Int64,2},Tuple{Colon,Int64},true},SubArray{Symbol,0,Array{Symbol,1},Tuple{Int64},false}},Tuple{Array{Int64,2},Array{Symbol,1}},Tuple{LearnBase.ObsDim.Last,LearnBase.ObsDim.Last}}:
([1,4],:a)
([2,5],:b)
([3,6],:c)
Getting Help¶
To get help on specific functionality you can either look up the
information here, or if you prefer you can make use of Julia’s
native doc-system.
The following example shows how to get additional information on
DataSubset
within Julia’s REPL:
?DataSubset
If you find yourself stuck or have other questions concerning the package you can find us at gitter or the Machine Learning domain on discourse.julialang.org
If you encounter a bug or would like to participate in the further development of this package come find us on Github.