Label Encodings

Tip

This section just serves as a very concise overview of the available functionality that is provided by MLLabelUtils.jl. Take a look at the full documentation for a far more detailed treatment.

It is a common requirement in Machine Learning experiments to encode the classification targets of some supervised dataset in a very specific way. There are multiple conventions that all have their own merits and reasons to exist. Some models, such as the probabilistic version of logistic regression, require the targets in the form of numbers in the set \(\{1,0\}\). On the other hand, margin-based classifier, such as SVMs, expect the targets to be in the set \(\{1,−1\}\).

This package provides the functionality needed to deal will these different scenarios in an efficient, consistent, and convenient manner. In particular, the utilized back-end MLLabelUtils.jl is designed with package developers in mind, that require their classification targets to be in a specific format. To that end, the core goal is to provide all the tools needed to deal with classification targets of arbitrary format. This includes asserting if the targets are of a desired encoding, inferring the concrete encoding the targets are in and how many classes they represent, and converting from their native encoding to the desired one.

Working with Targets

For starters, the library provides a few utility functions to compute various properties of the target array. These include the number of labels (see nlabel()), the labels themselves (see label()), and a mapping from label to the elements of the target array (see labelmap() and labelfreq()).

julia> true_targets = [0, 1, 1, 0, 0];

julia> label(true_targets)
2-element Array{Int64,1}:
 1
 0

julia> nlabel(true_targets)
2

julia> labelmap(true_targets)
Dict{Int64,Array{Int64,1}} with 2 entries:
  0 => [1,4,5]
  1 => [2,3]

julia> labelfreq(true_targets)
Dict{Int64,Int64} with 2 entries:
  0 => 3
  1 => 2

Tip

Because labelfreq() utilizes a Dict to store its result, it is straight forward to visualize the class distribution (using the absolute frequencies) right in the REPL using the UnicodePlots.jl package.

julia> using UnicodePlots
julia> barplot(labelfreq([:yes,:no,:no,:maybe,:yes,:yes]), symb="#")
#        ┌────────────────────────────────────────┐
#    yes │##################################### 3 │
#  maybe │############ 1                          │
#     no │######################### 2             │
#        └────────────────────────────────────────┘

Deriving and Asserting Encodings

If you find yourself writing some custom function that is intended to train some specific supervised model, chances are that you want to assert if the given targets are in the correct encoding that the model requires. We provide a few functions for such a scenario, namely labelenc() and islabelenc().

julia> true_targets = [0, 1, 1, 0, 0];

julia> labelenc(true_targets) # determine encoding using heuristics
MLLabelUtils.LabelEnc.ZeroOne{Int64,Float64}(0.5)

julia> islabelenc(true_targets, LabelEnc.ZeroOne)
true

julia> islabelenc(true_targets, LabelEnc.ZeroOne(Int))
true

julia> islabelenc(true_targets, LabelEnc.ZeroOne(Float32))
false

julia> islabelenc(true_targets, LabelEnc.MarginBased)
false

Converting between Encodings

In the case that it turns out the given targets are in the wrong encoding you may want to convert them into the format you require. For that purpose we expose the function convertlabel().

julia> true_targets = [0, 1, 1, 0, 0];

julia> convertlabel(LabelEnc.MarginBased, true_targets)
5-element Array{Int64,1}:
 -1
  1
  1
 -1
 -1

julia> convertlabel(LabelEnc.MarginBased(Float64), true_targets)
5-element Array{Float64,1}:
 -1.0
  1.0
  1.0
 -1.0
 -1.0

julia> convertlabel([:yes,:no], true_targets)
5-element Array{Symbol,1}:
 :no
 :yes
 :yes
 :no
 :no

julia> convertlabel(LabelEnc.OneOfK, true_targets)
2×5 Array{Int64,2}:
 0  1  1  0  0
 1  0  0  1  1

julia> convertlabel(LabelEnc.OneOfK{Bool}, true_targets)
2×5 Array{Bool,2}:
 false   true   true  false  false
  true  false  false   true   true

julia> convertlabel(LabelEnc.OneOfK{Float64}, true_targets, obsdim=1)
5×2 Array{Float64,2}:
 0.0  1.0
 1.0  0.0
 1.0  0.0
 0.0  1.0
 0.0  1.0

It may be interesting to point out explicitly that we provide LabelEnc.OneVsRest to conveniently convert a multi-class problem into a two-class problem.

julia> convertlabel(LabelEnc.OneVsRest(:yes), [:yes,:no,:no,:maybe,:yes,:yes])
6-element Array{Symbol,1}:
 :yes
 :not_yes
 :not_yes
 :not_yes
 :yes
 :yes

julia> convertlabel(LabelEnc.ZeroOne, [:yes,:no,:no,:maybe,:yes,:yes], LabelEnc.OneVsRest(:yes))
6-element Array{Float64,1}:
 1.0
 0.0
 0.0
 0.0
 1.0
 1.0

Classifying Predictions

Some encodings come with an implicit contract of how the raw predictions of some model should look like and how to classify a raw prediction into a predicted class-label. For that purpose we provide the function classify() and its mutating version classify!().

For LabelEnc.ZeroOne the convention is that the raw prediction is between 0 and 1 and represents a degree of certainty that the observation is of the positive class. That means that in order to classify a raw prediction to either positive or negative, one needs to define a “threshold” parameter, which determines at which degree of certainty a prediction is “good enough” to classify as positive.

julia> classify(0.3f0, 0.5); # equivalent to below
julia> classify(0.3f0, LabelEnc.ZeroOne) # preserves type
0.0f0

julia> classify(0.3f0, LabelEnc.ZeroOne(0.5)) # defaults to Float64
0.0

julia> classify(0.3f0, LabelEnc.ZeroOne(Int,0.2))
1

julia> classify.([0.3,0.5], LabelEnc.ZeroOne(Int,0.4))
2-element Array{Int64,1}:
 0
 1

For LabelEnc.MarginBased on the other hand the decision boundary is predefined at 0, meaning that any raw prediction greater than or equal to zero is considered a positive prediction, while any negative raw prediction is considered a negative prediction.

julia> classify(0.3f0, LabelEnc.MarginBased) # preserves type
1.0f0

julia> classify(-0.3f0, LabelEnc.MarginBased()) # defaults to Float64
-1.0

julia> classify.([-2.3,6.5], LabelEnc.MarginBased(Int))
2-element Array{Int64,1}:
 -1
  1

The encoding LabelEnc.OneOfK is special in that it is matrix-based and thus there exists the concept of ObsDim, i.e. the freedom to choose which array dimension denotes the observations. The classified prediction will be the index of the largest element of an observation. By default the “obsdim” is defined as the last array dimension.

julia> pred_output = [0.1 0.4 0.3 0.2; 0.8 0.3 0.6 0.2; 0.1 0.3 0.1 0.6]
3×4 Array{Float64,2}:
 0.1  0.4  0.3  0.2
 0.8  0.3  0.6  0.2
 0.1  0.3  0.1  0.6

julia> classify(pred_output, LabelEnc.OneOfK)
4-element Array{Int64,1}:
 2
 1
 2
 3

julia> classify(pred_output', LabelEnc.OneOfK, obsdim=1) # note the transpose
4-element Array{Int64,1}:
 2
 1
 2
 3

julia> classify([0.1,0.2,0.6,0.1], LabelEnc.OneOfK) # single observation
3