How To Craft A Deep Learning Model in Java (Without Sorcery, aka Frameworks)

Cleuton Sampaio, M.Sc.

To really learn neural networks, you need to strip the frameworks from your code, in order to understand how they work under the hood.

Nowadays we have several specialized neural network frameworks. There are frameworks specific to one programming language, and general frameworks that have bindings for other languages, such as Tensorflow.

The example I am going to present to you has the sole purpose of demonstrating the main techniques of neural networks and deep learning, such as: Layers, Activation Function, Hyperparameters, Forward propagation, Backpropagation etc.

But this example is not intended for optimal performance or accuracy. This needs to be clear before you proceed.

I will not bother you again with details about neural networks. Basically, I will create a MLP (Multilayer Perceptron), capable of classify iris flowers, using the four features described in the IRIS Dataset. Something like this:

We may have several output neurons (or nodes) if the classification problem we want to solve is multiclass. An example of this is the IRIS dataset, a classificatory survey that has divided IRIS flower species into 3 categories according to 4 characteristics.

To classify flowers, we need the nonlinearity provided by an MLP.

This is the full source code, on github.

I also created a spreadsheet to understand and test the model. It's very complicated, but by double clicking on the cells, you can see the calculations I've done.

We need our model to "learn" the weights to use to estimate a flower's class based on its 4 characteristics. This learning is done by this code snippet from the IRISClassifier file:

		int irisElementos = 150;
		int categorias = 3;
		int variaveis = 4;
		int epochs = 1000;
		double learningRate = 0.01;
		
		double [][] iris = loadIris(irisElementos, categorias, variaveis);		
		
		model.fit(iris, 120, epochs,learningRate);

Sorry it's in Portuguese, but I'll provide a translation:

irisElementos: iris number of samples (150);
categorias: iris flower categories (3);
variaveis: iris flower features (4);

The fit() method trains the model using the test data (120 records out of a total of 150). But which model? The model is created just before we call fit() and I was inspired by the Keras model to create this API:

model.layers.add(new Layer(4,null,model)); // Input layer não tem activation
model.layers.add(new Layer(8,new Sigmoid(),model));
model.layers.add(new Layer(3,new Sigmoid(),model));

Note that the first layer do not have an activation function.

A 4-node input layer (four input variables), an 8-node hidden layer using Sigmoid as the activation function, and a 3-node output layer (three classes), also using Sigmoid.

I am using MSE - Mean squared error as a function of cost, and Gradient Descent as a learning method.

Forward propagation

The training consists of repeating the network calculation several times (epochs), taking each record and getting an output.

For each record, we observe the differences between the estimated value by the model and the actual value, and we accumulate to calculate the MSE.

For example, the input for node b1 (netb1) is the combination of input node values multiplied by their weights, plus the bias weight:

netb1 = sum(ai * wi) + bw1
model.layers.add(new Layer(3,new Sigmoid(),model));

Backpropagation

We need to calculate the error and adjust the node (and bias) weights to make our model "learn". How do we do it? We correct each weight according to its "responsibility" in the final error. More responsible weights receive greater correction.

In fact, we want to optimize the cost function by finding its minimum value (preferably global):

A quick translation:

Perda: Loss;
Mínimo local: Local minimum;
Mínimo global: Global minimum;
Pesos: Weights;

Our cost function is the MSE, so we want to get the weight values that provide the lowest possible MSE.

The method of calculating backpropagation is to find the gradient (the rate of change of each weight as a function of output), and to modify the weights according to the learning rate and the gradient itself. Little by little we change the weights until the training is over. We may have to find the global minimum or not. But we stop when we consider the error (the MSE value) reasonable.

Partial Derivatives - We need to calculate the partial derivatives of the error with respect to each net weight.

The Java code does this in the Model class, inside the backPropagation() method:

...
for (int l=(indiceUltima-1); l>=0; l--) {
	Layer layer = this.layers.get(l);
	Layer proxima = this.layers.get(l+1);
	for (Node node : layer.nodes) {
		if (l == (indiceUltima - 1)) {
			for (Sinapse sinapse : node.sinapses) {
				double erro = outputErrors[sinapse.finalNode.nodeNumber-1];
				sinapse.gradient = erro * proxima.activation.calcularDerivada(sinapse.finalNode.value) * node.value;
			}
		}
		else {
			// sum the deltaz
			for (Sinapse sinapse : node.sinapses) {
				double valorFinal = 0.0;
				for (Sinapse s2 : sinapse.finalNode.sinapses) {
					double deltaz = outputErrors[s2.finalNode.nodeNumber-1]*outputs[s2.finalNode.nodeNumber-1]*(1-outputs[s2.finalNode.nodeNumber-1]);
					valorFinal += (deltaz * s2.weight);
				}
				sinapse.gradient = valorFinal * proxima.activation.calcularDerivada(sinapse.finalNode.value) * node.value;
			}					
		}
	}
	// bias weight
	if (l == (indiceUltima - 1)) {
		for (Sinapse sinapse : layer.bias.sinapses) {
			double erro = sinapse.finalNode.value - target[sinapse.finalNode.nodeNumber-1];
			sinapse.gradient = erro * layer.activation.calcularDerivada(sinapse.finalNode.value);
		}
	}
	else {
		for (Sinapse sinapse : layer.bias.sinapses) {
			double valorFinal = 0.0;
			for (Sinapse s2 : sinapse.finalNode.sinapses) {
				double deltaz = outputErrors[s2.finalNode.nodeNumber-1]*outputs[s2.finalNode.nodeNumber-1]*(1-outputs[s2.finalNode.nodeNumber-1]);
				valorFinal += (deltaz * s2.weight);
			}
			sinapse.gradient = valorFinal * proxima.activation.calcularDerivada(sinapse.finalNode.value);
		}				
	}
}
// Update weights
for (int la=0;la<this.layers.size()-1;la++) {
	Layer layer = this.layers.get(la);
	for (Node node : layer.nodes) {
		for (Sinapse sinapse : node.sinapses) {
			sinapse.weight = sinapse.weight - learningRate * sinapse.gradient;
		}
	}
}
...

I know ... It's complex because every layer needs to be treated independently. Note that to calculate layer weights, I use the derivative of the activation function. This is why every my activation function class (Sigmoid and ReLU) knows how to calculate its own derivative.

Results

The model converges quickly, reaching very good accuracy (around 100%). As I randomize the order of records, each training can work differently. I got some training errors here, just to show you:

main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 989 MSE: 0.043820135018546216
[main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 990 MSE: 0.04380586269505675
[main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 991 MSE: 0.04379162632120411
[main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 992 MSE: 0.04377742576230514
[main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 993 MSE: 0.04376326088432783
[main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 994 MSE: 0.043749131553887405
[main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 995 MSE: 0.04373503763824272
[main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 996 MSE: 0.04372097900529245
[main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 997 MSE: 0.04370695552357118
[main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 998 MSE: 0.04369296706224569
[main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: 999 MSE: 0.04367901349111168
Entrada: [4.6, 3.6, 1.0, 0.2, 1.0, 0.0, 0.0]
Calculado: [0.9676715531587202, 0.04892961607756042, 0.0028958463500566075]
Entrada: [6.3, 3.3, 6.0, 2.5, 0.0, 0.0, 1.0]
Calculado: [0.003301264975616561, 0.043796515686338924, 0.9664163220261917]
Entrada: [4.9, 3.1, 1.5, 0.1, 1.0, 0.0, 0.0]
Calculado: [0.9613944104625847, 0.05702128231090371, 0.003092538946809996]
Entrada: [5.6, 2.7, 4.2, 1.3, 0.0, 1.0, 0.0]
Calculado: [0.027063443180807525, 0.8924148666249649, 0.06834712010378666]
Entrada: [5.5, 4.2, 1.4, 0.2, 1.0, 0.0, 0.0]
Calculado: [0.9681229230438231, 0.04813698554996353, 0.0028608703410195335]
Entrada: [5.2, 4.1, 1.5, 0.1, 1.0, 0.0, 0.0]
Calculado: [0.9675876845548526, 0.0479936971402869, 0.00290996625033745]
Entrada: [5.6, 3.0, 4.1, 1.3, 0.0, 1.0, 0.0]
Calculado: [0.03623797729642797, 0.9342440617137804, 0.034642417058752914]
Entrada: [5.1, 3.8, 1.5, 0.3, 1.0, 0.0, 0.0]
Calculado: [0.9660137302474517, 0.04951629593304033, 0.002979306468215438]
Entrada: [7.9, 3.8, 6.4, 2.0, 0.0, 0.0, 1.0]
Calculado: [0.011578929459422252, 0.508196282619192, 0.4821865786994383]
Entrada: [4.9, 2.4, 3.3, 1.0, 0.0, 1.0, 0.0]
Calculado: [0.04884019203370984, 0.9410220061917559, 0.025492474895976246]
Entrada: [4.9, 2.5, 4.5, 1.7, 0.0, 0.0, 1.0]
Calculado: [0.004324982037369483, 0.07272569633631648, 0.9425723048380175]
Entrada: [4.6, 3.2, 1.4, 0.2, 1.0, 0.0, 0.0]
Calculado: [0.9625727172147658, 0.05442856232056892, 0.003092228077310065]
Entrada: [4.8, 3.0, 1.4, 0.1, 1.0, 0.0, 0.0]
Calculado: [0.9615319363567879, 0.05725363687320617, 0.0030838556858076792]
Entrada: [4.8, 3.1, 1.6, 0.2, 1.0, 0.0, 0.0]
Calculado: [0.9577173406305236, 0.06143355127699689, 0.0032019979452733364]
Entrada: [6.5, 3.0, 5.2, 2.0, 0.0, 0.0, 1.0]
Calculado: [0.0061258823668374875, 0.16047995808216792, 0.8615163025239576]
Entrada: [7.4, 2.8, 6.1, 1.9, 0.0, 0.0, 1.0]
Calculado: [0.0052531316342232905, 0.10401458592015204, 0.925512168147189]
Entrada: [5.2, 3.5, 1.5, 0.2, 1.0, 0.0, 0.0]
Calculado: [0.9649622435472344, 0.05169358058627654, 0.0029911741695141923]
Entrada: [4.4, 3.0, 1.3, 0.2, 1.0, 0.0, 0.0]
Calculado: [0.9612837335587132, 0.05660770801265668, 0.0031282220962347857]
Entrada: [5.4, 3.0, 4.5, 1.5, 0.0, 1.0, 0.0]
Calculado: [0.014445624822966404, 0.647255672733979, 0.28381235845137126]
Entrada: [4.8, 3.4, 1.6, 0.2, 1.0, 0.0, 0.0]
Calculado: [0.9623350162252382, 0.054035630261833896, 0.0031084726270826915]
Entrada: [4.7, 3.2, 1.3, 0.2, 1.0, 0.0, 0.0]
Calculado: [0.9639896332797553, 0.053174883840758407, 0.0030324148857225814]
Entrada: [6.2, 2.2, 4.5, 1.5, 0.0, 1.0, 0.0]
Calculado: [0.012296002495372998, 0.5008620135510218, 0.5077911287012173]
Entrada: [6.1, 2.9, 4.7, 1.4, 0.0, 1.0, 0.0]
Calculado: [0.023456020883534715, 0.8623336250347347, 0.09727948855000569]
Entrada: [7.3, 2.9, 6.3, 1.8, 0.0, 0.0, 1.0]
Calculado: [0.0050819166784951166, 0.09748757227638581, 0.9306699555178484]
Entrada: [6.3, 2.8, 5.1, 1.5, 0.0, 0.0, 1.0]
Calculado: [0.011596224350588174, 0.485448746623515, 0.5130889451215297]
Entrada: [6.2, 2.8, 4.8, 1.8, 0.0, 0.0, 1.0]
Calculado: [0.00893328236272492, 0.334086150824026, 0.6704820968813707]
Entrada: [5.9, 3.0, 5.1, 1.8, 0.0, 0.0, 1.0]
Calculado: [0.005723107798993621, 0.1395352905294087, 0.8803030775031335]
Entrada: [6.4, 2.9, 4.3, 1.3, 0.0, 1.0, 0.0]
Calculado: [0.042183922253660654, 0.9574718464983897, 0.024434102016789052]
Entrada: [6.7, 3.0, 5.0, 1.7, 0.0, 1.0, 0.0]
Calculado: [0.018090291088175836, 0.7663488669666162, 0.18941319744715307]
Entrada: [6.0, 3.0, 4.8, 1.8, 0.0, 0.0, 1.0]
Calculado: [0.009271528709536842, 0.367959902898695, 0.6180135323215016]
Testes: 30 erros: 2 acurácia: 93.33333333333333%

Testes: tests;
Erros: Errors;
Acurácia: accuracy;
Entrada: Input;
Calculado: calculated;

Before checking the result I round each output value to zero or 1. The output value of Sigmoid is likely to be 1.

Conclusion

This is a fully Java-based MLP model with no framework or library for demonstration purposes. We can improve its accuracy by modifying the hyperparameters or by normalizing the values of the input variables, but I think it is fine for demonstration.

I do not recommend that you make a neural network model at hand. The hassle of designing and testing isn't worth it, and it makes your model adamant for data scientists. For example, what if we wanted to use the ADAM optimizer? Or if we wanted to use Categorical Cross Entropy as a cost function? What if we want to use Stochastic Gradient Descent? These modifications, common in data science work, would require a lot of programming effort to implement.

Another reason is the performance! It is known that running on GPU is the recommended way for Deep Learning, but do you know how parallel programming for GPU works?

We need to create Kernels that can be parallelized to the various GPU cores, either using CUDA (nvidia) or OpenCL (others). For this, our calculation must use matrix (linear algebra) and not procedural, as I did, with multiple loops and ifs, but this is not my goal here. I just want to show you the mechanics of training a neural network.

(Originally published here)