Naive Bayes Classifiers are formidable because they can learn so much about a dataset based on relatively scarce information. In this module, we will build one from scratch, using (mostly) methods from Julia’s standard library.
The logic under Naive Bayes Classifiers (NBC) is really simple: there is a sample (an instance), defined by a series of values (the features), called $\mathbf{x}$. These values are
raw_seeds_file = download(
"https://archive.ics.uci.edu/ml/" *
"machine-learning-databases/00236/seeds_dataset.txt",
)
"/tmp/jl_UfOVqRMxq5"
The original dataset has a few issues, notably related to the fact that some
rows have two tabulation characters where they should only have one. In order
to correct this, we will replace every instance of \t\t
with \t
, then save
and load the correct dataset – do check out the documentation for open
and
write
if you need a refresher on how they operate.
seeds_file = tempname()
open(seeds_file, "w") do f
return write(f, replace(String(read(raw_seeds_file)), "\t\t" => "\t"))
end
9287
Now that we do have the dataset in a temporary file, we can use
DelimitedFiles
to load it, and specify that the field separator is \t
:
using DelimitedFiles
seeds = readdlm(seeds_file, '\t')
seeds[1:3, :]
3×8 Matrix{Float64}:
15.26 14.84 0.871 5.763 3.312 2.221 5.22 1.0
14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1.0
14.29 14.09 0.905 5.291 3.337 2.699 4.825 1.0
The first seven columns are features, and the last column is the class. The features are well documented on the dataset page. It does make sense to extract the features as their own matrix, and the classes as their own vector:
features = seeds[:, 1:(end - 1)];
classes = Int.(vec(seeds[:, end]));
With this is mind, we can start thinking about the way to setup our distributions to build our NBC. We might fit the “correct” distribution for each instance. For the purpose of this exercise, we will simply assume that we can get a good summary distribution by assuming that we know the mean and standard deviation for each column (within each cultivar), and use a Normal distribution.
We know the total number of dimensions, so we can initialize an empty array of
Normal{Float64
distributions:
using Distributions
D = Matrix{Normal{Float64}}(undef, size(features, 2), length(unique(classes)))
size(D)
(7, 3)
With this done, we can start updating the values of our Normal distributions:
using Statistics
for class in unique(classes)
subfeatures = features[findall(isequal(class), classes), :]
μ = mean(subfeatures; dims = 1)
σ = std(subfeatures; dims = 1)
for feature in 1:size(features, 2)
D[feature, class] = Normal(μ[feature], σ[feature])
end
end
We can have a look at, e.g., the first four features for the first cultivar:
D[1:4, 1]
4-element Vector{Distributions.Normal{Float64}}:
Distributions.Normal{Float64}(μ=14.334428571428573, σ=1.215703572415347)
Distributions.Normal{Float64}(μ=14.294285714285714, σ=0.5765830669784657)
Distributions.Normal{Float64}(μ=0.8800699999999998, σ=0.016190929201432423)
Distributions.Normal{Float64}(μ=5.508057142857142, σ=0.23150802945440865)
And just like this, we are almost ready to run our analysis. Recall from the very brief introduction to NBC that the point is to (i) measure the pdf of a point in the distribution of its feature within the class, (ii) multiply these values across features, and (iii) take the argmax across classes to get an answer.
We can grab a point to test:
test_idx = rand(1:size(features, 1))
x = features[test_idx, :]
y = classes[test_idx]
println("Point $(test_idx) with class $(y)")
Point 206 with class 3
We can check the score that we would get for this test point on class 3 with a really nifty one-liner:
pdf.(D[:, 3], x) |> prod
3.026334444506711
In order to scale this to the entire matrix D
, we can use the mapslices
function, which will apply a function to slices of an array:
scores = vec(mapslices(d -> prod(pdf.(d, x)), D; dims = 1))
3-element Vector{Float64}:
0.018044932974258788
2.471794431573645e-19
3.026334444506711
These scores do not sum to one (because we do not really use the Bayes formula). In this situation, it is probably a good idea to pass the output through the softmax function:
softmax(v::Vector{T}) where {T <: Real} = exp.(v) ./ sum(exp.(v))
scores |> softmax
3-element Vector{Float64}:
0.044974453133697086
0.04417017057781453
0.9108553762884883
Remember that the information we want to get is our argmax, which is to say:
scores |> softmax |> argmax
3
The output of argmax
is the predicted class of our sample.
Of course, we can now perform this process on all points in the dataset. Before we do so, we will create a confusion table, in which we will track the predicted v. observed scores:
C = zeros(Int64, length(unique(classes)), length(unique(classes)))
3×3 Matrix{Int64}:
0 0 0
0 0 0
0 0 0
And we loop
for i in axes(features, 1)
local x, y, scores
x = features[i, :]
y = classes[i]
scores = vec(mapslices(d -> prod(pdf.(d, x)), D; dims = 1))
ŷ = argmax(softmax(scores))
C[y, ŷ] += 1
end
We can check the performance of the NBC approach for this dataset:
C
3×3 Matrix{Int64}:
59 3 8
5 65 0
3 0 67
Check the accuracy
using LinearAlgebra
accuracy = sum(diag(C)) / sum(C)
println("Leave-One-Out (not really) accuracy: $(round(Int64, accuracy*100))%")
Leave-One-Out (not really) accuracy: 91%