Cleuton Sampaio, M.Sc. To really learn neural networks, you need to strip the frameworks from your code, in order to understand how they work under the hood. Nowadays we have several specialized neural network frameworks. There are frameworks specific to one programming language, and general frameworks that have bindings for other languages, such as . Tensorflow The example I am going to present to you has the sole purpose of demonstrating the main techniques of neural networks and deep learning, such as: Layers, Activation Function, Hyperparameters, Forward propagation, Backpropagation etc. But this example is not intended for optimal performance or accuracy. This needs to be clear before you proceed. I will not bother you again with details about neural networks. Basically, I will create a ( ), capable of classify iris flowers, using the four features described in the . Something like this: MLP Multilayer Perceptron IRIS Dataset We may have several output neurons (or nodes) if the classification problem we want to solve is . An example of this is the IRIS dataset, a classificatory survey that has divided IRIS flower species into 3 categories according to 4 characteristics. multiclass To classify flowers, we need the provided by an MLP. nonlinearity This is the , on github. full source code I also created a to understand and test the model. It's very complicated, but by double clicking on the cells, you can see the calculations I've done. spreadsheet We need our model to "learn" the weights to use to estimate a flower's class based on its 4 characteristics. This learning is done by this code snippet from the file: IRISClassifier int irisElementos = ; int categorias = ; int variaveis = ; int epochs = ; double learningRate = ; double [][] iris = loadIris(irisElementos, categorias, variaveis); model.fit(iris, , epochs,learningRate); 150 3 4 1000 0.01 120 Sorry it's in Portuguese, but I'll provide a translation: irisElementos: iris number of samples (150); categorias: iris flower categories (3); variaveis: iris flower features (4); The fit() method trains the model using the test data (120 records out of a total of 150). But which model? The model is created just before we call fit() and I was inspired by the model to create this API: Keras model.layers.add( Layer( , ,model)); model.layers.add( Layer( , Sigmoid(),model)); model.layers.add( Layer( , Sigmoid(),model)); new 4 null // Input layer não tem activation new 8 new new 3 new Note that the first layer do not have an activation function. A 4-node input layer (four input variables), an 8-node hidden layer using as the activation function, and a 3-node output layer (three classes), also using Sigmoid. Sigmoid I am using - as a function of cost, and Gradient Descent as a learning method. MSE Mean squared error Forward propagation The training consists of repeating the network calculation several times (epochs), taking each record and getting an output. For each record, we observe the differences between the estimated value by the model and the actual value, and we accumulate to calculate the MSE. For example, the input for node b1 (netb1) is the combination of input node values multiplied by their weights, plus the bias weight: netb1 = sum(ai * wi) + bw1 model.layers.add( Layer( , Sigmoid(),model)); new 3 new Backpropagation We need to calculate the error and adjust the node (and bias) weights to make our model "learn". How do we do it? We correct each weight according to its "responsibility" in the final error. More responsible weights receive greater correction. In fact, we want to optimize the cost function by finding its minimum value (preferably global): A quick translation: Perda: Loss; Mínimo local: Local minimum; Mínimo global: Global minimum; Pesos: Weights; Our cost function is the , so we want to get the weight values that provide the lowest possible MSE. MSE The method of calculating is to find the gradient (the rate of change of each weight as a function of output), and to modify the weights according to the learning rate and the gradient itself. Little by little we change the weights until the training is over. We may have to find the global minimum or not. But we stop when we consider the error (the MSE value) reasonable. backpropagation - We need to calculate the partial derivatives of the error with respect to each net weight. Partial Derivatives The Java code does this in the Model class, inside the method: backPropagation() ... for (int l=(indiceUltima ); l>= ; l--) { Layer layer = .layers.get(l); Layer proxima = .layers.get(l+ ); (Node node : layer.nodes) { (l == (indiceUltima - )) { (Sinapse sinapse : node.sinapses) { double erro = outputErrors[sinapse.finalNode.nodeNumber ]; sinapse.gradient = erro * proxima.activation.calcularDerivada(sinapse.finalNode.value) * node.value; } } { (Sinapse sinapse : node.sinapses) { double valorFinal = ; (Sinapse s2 : sinapse.finalNode.sinapses) { double deltaz = outputErrors[s2.finalNode.nodeNumber ]*outputs[s2.finalNode.nodeNumber ]*( -outputs[s2.finalNode.nodeNumber ]); valorFinal += (deltaz * s2.weight); } sinapse.gradient = valorFinal * proxima.activation.calcularDerivada(sinapse.finalNode.value) * node.value; } } } (l == (indiceUltima - )) { (Sinapse sinapse : layer.bias.sinapses) { double erro = sinapse.finalNode.value - target[sinapse.finalNode.nodeNumber ]; sinapse.gradient = erro * layer.activation.calcularDerivada(sinapse.finalNode.value); } } { (Sinapse sinapse : layer.bias.sinapses) { double valorFinal = ; (Sinapse s2 : sinapse.finalNode.sinapses) { double deltaz = outputErrors[s2.finalNode.nodeNumber ]*outputs[s2.finalNode.nodeNumber ]*( -outputs[s2.finalNode.nodeNumber ]); valorFinal += (deltaz * s2.weight); } sinapse.gradient = valorFinal * proxima.activation.calcularDerivada(sinapse.finalNode.value); } } } (int la= ;la< .layers.size() ;la++) { Layer layer = .layers.get(la); (Node node : layer.nodes) { (Sinapse sinapse : node.sinapses) { sinapse.weight = sinapse.weight - learningRate * sinapse.gradient; } } } ... -1 0 this this 1 for if 1 for -1 else // sum the deltaz for 0.0 for -1 -1 1 -1 // bias weight if 1 for -1 else for 0.0 for -1 -1 1 -1 // Update weights for 0 this -1 this for for I know ... It's complex because every layer needs to be treated independently. Note that to calculate layer weights, I use the derivative of the activation function. This is why every my activation function class (Sigmoid and ) knows how to calculate its own derivative. ReLU Results The model converges quickly, reaching very good accuracy (around 100%). As I randomize the order of records, each training can work differently. I got some training errors here, just to show you: main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: [main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: [main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: [main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: [main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: [main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: [main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: [main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: [main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: [main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: [main] INFO com.neuraljava.samples.mlpgen.api.IrisClassifier - Epoch: MSE: Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Entrada: [ , , , , , , ] Calculado: [ , , ] Testes: erros: acurácia: % 989 0.043820135018546216 990 0.04380586269505675 991 0.04379162632120411 992 0.04377742576230514 993 0.04376326088432783 994 0.043749131553887405 995 0.04373503763824272 996 0.04372097900529245 997 0.04370695552357118 998 0.04369296706224569 999 0.04367901349111168 4.6 3.6 1.0 0.2 1.0 0.0 0.0 0.9676715531587202 0.04892961607756042 0.0028958463500566075 6.3 3.3 6.0 2.5 0.0 0.0 1.0 0.003301264975616561 0.043796515686338924 0.9664163220261917 4.9 3.1 1.5 0.1 1.0 0.0 0.0 0.9613944104625847 0.05702128231090371 0.003092538946809996 5.6 2.7 4.2 1.3 0.0 1.0 0.0 0.027063443180807525 0.8924148666249649 0.06834712010378666 5.5 4.2 1.4 0.2 1.0 0.0 0.0 0.9681229230438231 0.04813698554996353 0.0028608703410195335 5.2 4.1 1.5 0.1 1.0 0.0 0.0 0.9675876845548526 0.0479936971402869 0.00290996625033745 5.6 3.0 4.1 1.3 0.0 1.0 0.0 0.03623797729642797 0.9342440617137804 0.034642417058752914 5.1 3.8 1.5 0.3 1.0 0.0 0.0 0.9660137302474517 0.04951629593304033 0.002979306468215438 7.9 3.8 6.4 2.0 0.0 0.0 1.0 0.011578929459422252 0.508196282619192 0.4821865786994383 4.9 2.4 3.3 1.0 0.0 1.0 0.0 0.04884019203370984 0.9410220061917559 0.025492474895976246 4.9 2.5 4.5 1.7 0.0 0.0 1.0 0.004324982037369483 0.07272569633631648 0.9425723048380175 4.6 3.2 1.4 0.2 1.0 0.0 0.0 0.9625727172147658 0.05442856232056892 0.003092228077310065 4.8 3.0 1.4 0.1 1.0 0.0 0.0 0.9615319363567879 0.05725363687320617 0.0030838556858076792 4.8 3.1 1.6 0.2 1.0 0.0 0.0 0.9577173406305236 0.06143355127699689 0.0032019979452733364 6.5 3.0 5.2 2.0 0.0 0.0 1.0 0.0061258823668374875 0.16047995808216792 0.8615163025239576 7.4 2.8 6.1 1.9 0.0 0.0 1.0 0.0052531316342232905 0.10401458592015204 0.925512168147189 5.2 3.5 1.5 0.2 1.0 0.0 0.0 0.9649622435472344 0.05169358058627654 0.0029911741695141923 4.4 3.0 1.3 0.2 1.0 0.0 0.0 0.9612837335587132 0.05660770801265668 0.0031282220962347857 5.4 3.0 4.5 1.5 0.0 1.0 0.0 0.014445624822966404 0.647255672733979 0.28381235845137126 4.8 3.4 1.6 0.2 1.0 0.0 0.0 0.9623350162252382 0.054035630261833896 0.0031084726270826915 4.7 3.2 1.3 0.2 1.0 0.0 0.0 0.9639896332797553 0.053174883840758407 0.0030324148857225814 6.2 2.2 4.5 1.5 0.0 1.0 0.0 0.012296002495372998 0.5008620135510218 0.5077911287012173 6.1 2.9 4.7 1.4 0.0 1.0 0.0 0.023456020883534715 0.8623336250347347 0.09727948855000569 7.3 2.9 6.3 1.8 0.0 0.0 1.0 0.0050819166784951166 0.09748757227638581 0.9306699555178484 6.3 2.8 5.1 1.5 0.0 0.0 1.0 0.011596224350588174 0.485448746623515 0.5130889451215297 6.2 2.8 4.8 1.8 0.0 0.0 1.0 0.00893328236272492 0.334086150824026 0.6704820968813707 5.9 3.0 5.1 1.8 0.0 0.0 1.0 0.005723107798993621 0.1395352905294087 0.8803030775031335 6.4 2.9 4.3 1.3 0.0 1.0 0.0 0.042183922253660654 0.9574718464983897 0.024434102016789052 6.7 3.0 5.0 1.7 0.0 1.0 0.0 0.018090291088175836 0.7663488669666162 0.18941319744715307 6.0 3.0 4.8 1.8 0.0 0.0 1.0 0.009271528709536842 0.367959902898695 0.6180135323215016 30 2 93.33333333333333 Testes: tests; Erros: Errors; Acurácia: accuracy; Entrada: Input; Calculado: calculated; Before checking the result I round each output value to zero or 1. The output value of is likely to be 1. Sigmoid Conclusion This is a fully Java-based MLP model with no framework or library for demonstration purposes. We can improve its accuracy by modifying the hyperparameters or by normalizing the values of the input variables, but I think it is fine for demonstration. I do not recommend that you make a neural network model at hand. The hassle of designing and testing isn't worth it, and it makes your model adamant for data scientists. For example, what if we wanted to use the ? Or if we wanted to use as a cost function? What if we want to use Stochastic Gradient Descent? These modifications, common in data science work, would require a lot of programming effort to implement. ADAM optimizer Categorical Cross Entropy Another reason is the performance! It is known that running on GPU is the recommended way for Deep Learning, but do you know how parallel programming for GPU works? We need to create Kernels that can be parallelized to the various GPU cores, either using (nvidia) or (others). For this, our calculation must use matrix (linear algebra) and not procedural, as I did, with multiple loops and ifs, but this is not my goal here. I just want to show you the mechanics of training a neural network. CUDA OpenCL (Originally published ) here