In the previous , I discussed whether it is possible to use machine learning (in particular, face and mask detection) in the browser, approaches to detection, and optimization of all processes. article Today I want to give the technical details of the implementation. Technologies The primary language for development is TypeScript. The client application is written in React.js. The application uses several neural networks to detect different events: face detection, mask detection. Each model/network runs in a separate thread (Web Worker). Neural networks are launched using TensorFlow.js, and Web Assembly or WebGL is used as a backend, which allows you to execute code at speed close to native. The choice of this or that backend depends on the size of the model (small models work faster on WebAssembly), but you should always test and choose what is faster for a particular model. Receiving and displaying a video stream using WebRTC. The OpenCV.js library is used to work with images. The following approach was implemented: The main thread is only orchestrating all processes. It doesn't load the heavy OpenCV library and doesn't use TensorFlow.js. It gets images from the video stream and sends them for processing by web workers. A new image is not sent to the worker until it informs the main thread that the worker is free and can process the next image. Thus a queue is not created, and we process the last image each time. Initially, the image is sent for face recognition, in case the face is recognized; only then is the image sent for mask recognition. Each result of the worker is saved and can be displayed on the UI. Performance Receiving an image from a stream - 31 ms Face detection preprocessing - 0-1 ms Face detection - 51 ms Face detection post-processing - 8 ms Mask detection preprocessing - 2 ms Mask detection - 11 ms Mask detection post-processing - 0-1 ms Total: Face detection - + = 60 ms 31 ms 91 ms Mask detection - 14 ms In ~ 105 ms, we would know all the information from the image. Face detection preprocessing is getting an image from a stream and sending it to a web worker. Face detection post-processing - saving the result from the face detection worker and drawing it on the canvas. Mask detection preprocessing - preparing a canvas with an aligned face image and transferring it to the web worker. Mask detection post-processing - saving the results of mask detection. Each model (face detection and mask detection) runs in a separate web worker, which loads the necessary libraries (OpenCV.js, Tensorflow.js, models). We have 3 web workers: Face detection Mask detection Worker-helper that can transform images uses heavy methods from OpenCV and TensorFlow.js. For example, to build a calibration matrix for multiple cameras. Features and tricks that helped us in development and optimization Web workers and how to work with them A web worker is a way to run a script on a separate thread. They allow running heavy processes in parallel with the main thread without blocking the UI. The main thread executes the orchestration logic; all heavy computation is running in the web workers. Web workers are supported in almost all browsers. Features and limitations of web workers Features: Access only to a subset of JavaScript features Access to object navigator Read-only access to the object location Possibility to use XMLHttpRequest Possibility to use setTimeout() / clearTimeout() и setInterval() / clearInterval() Application Cache Importing external scripts using importScripts() Creating other web workers Limitations: No access to DOM No access to windows No access to the document No access to parent To provide communication between the main thread and the web workers and the event handler is used. postMessage onmessage If you look at the specification of the method, you will notice that it accepts not only data but also a second argument - a transferable object. postMessage() worker.postMessage(message, [transfer]); Let's see how using it will help us. A transferable interface is an object that can be passed between different execution contexts, such as the main thread and web workers. This interface is implemented in: ImageBitmap OffscreenCanvas ArrayBuffer MessagePort If we want to transfer 500 MB of data to the worker, we can do it without the second argument, but the difference will be in the time transfer and memory usage. Sending data without an argument will take 149 ms and 1042 MB for Google Chrome, in other browsers even more. When you use the transfer argument, it will take 1ms and will decrease memory consumption by 2 times! Since images are often transferred from the main thread to the web workers, it is important for us to do this as quickly and efficiently as possible for memory usage, and this feature helps us a lot with this. Using OffscreenCanvas The web worker does not have access to the DOM, so you cannot use canvas directly. comes to the rescue. OffscreenCanvas Advantages: Fully detached from the DOM It can be used both in the main thread and in web workers It has a transferable interface and does not load the main thread if rendering running in a web worker Advantages of using requestAnimationFrame allows you to receive images from the stream with maximum performance (60 FPS), and it is only limited by the camera's capability, not all cameras send video with such frequency. requestAnimationFrame The main advantages are: Browser optimizes requestAnimationFrame calls with other animations and drawings. Less power consumption, it's very important for mobile devices It works without a call stack and doesn’t create a call queue. Minimum call frequency 16.67 ms (1000 ms / 60 fps = 16.67 ms) Call frequency can be controlled manually Metrics of application At first, using stats.js seemed to be a good idea for displaying application metrics, but after the count of metrics became 20+, the main flow of the application began to slow down due to the specifics of the browser. Each metric uses a canvas on which draws a graph (data are received very often there), and the browser calls re-render with high frequency, which negatively affects the application. As a result, the metrics are underestimated. To avoid such a problem, it is better to abandon the use of "beauty" and simplify displaying the current and calculated average for the entire time by text. Updating a value in the DOM will be much faster than rendering graphics. Memory leaks control Quite often, during development, we encountered memory leaks on mobile devices, while on a desktop, it could work for a very long time. In web workers, it is impossible to know how much memory it actually consumes ( does not work in web workers). performance.memory Based on this, we provided for the launch of our application through web workers and completely in the main thread. By running all our detection models on the main thread, we can take the memory consumption metrics and see where the memory leak is and fix it. The main code of models in web workers We got acquainted with the main tricks that were used in the application; now we will look at the implementation. For working with web workers initially, was used. It's a very handy library that allows you to work with the worker as a class object without using the and methods and control the asynchronous code using async-await. All this was convenient until the application was launched on a tablet (Samsung Galaxy Tab S7), and suddenly it crashed after 2 minutes. comlink-loader onmessage postMessage After analyzing all the code, no memory leaks were found, except for a black box of this library for working with workers. For some reason, the launched Tensorflow.js models were not cleared and stored somewhere inside this library. It was decided to use a , which allows you to work with web workers from pure js without unnecessary layers. And this solved the problem; the application works for days without crashes. worker-loader Face detection worker Create web worker .faceDetectionWorker = workers.FaceRgbDetectionWorkerFactory.createWebWorker(); this Create a message handler from a worker in the main thread .faceDetectionWorker.onmessage = (event) => { (event.data.type === ) { .faceDetectionWorker.postMessage({ : , backend, streamSettings, faceDetectionSettings, : .imageRatio, }); } (event.data.type === ) { .isFaceWorkerInit = event.data.status; ( .isFaceWorkerInit && .isMaskWorkerInit) { .grabFrame(); } } (event.data.type === ) { .onFaceDetected(event); } { ( ); } }; this async if 'load' this type 'init' imageRatio this else if 'init' this // When both workers inited it is run processes to grab and process frames only if this this await this else if 'faceResults' this else throw new Error `Type= is not supported by RgbVideo for FaceRgbDatectionWorker` ${event.data.type} Sending an image for face processing .faceDetectionWorker.postMessage( { : , : .lastImage, : lastItem!.index, }, [ .lastImage], ); this type 'detectFace' originalImageToProcess this lastIndex this // transferable object Face detection web worker code The init method initializes all the models, libraries, and canvas that are needed to work with. init = (data) => { { backend, streamSettings, faceDetectionSettings, imageRatio } = data; flipHorizontal = streamSettings.flipHorizontal; faceMinWidth = faceDetectionSettings.faceMinWidth; faceMinWidthConversionFactor = faceDetectionSettings.faceMinWidthConversionFactor; predictionIOU = faceDetectionSettings.predictionIOU; recommendedLocation = faceDetectionSettings.useRecommendedLocation ? faceDetectionSettings.recommendedLocation : ; detectedFaceThumbnailSize = faceDetectionSettings.detectedFaceThumbnailSize; srcImageRatio = imageRatio; tfc.setBackend(backend); tfc.ready(); [blazeModel] = .all([ blazeface.load({ maxFaces: faceDetectionSettings.maxFaces, inputWidth: faceDetectionSettings.faceDetectionImageMinWidth, inputHeight: faceDetectionSettings.faceDetectionImageMinHeight, iouThreshold: faceDetectionSettings.iouThreshold, scoreThreshold: faceDetectionSettings.scoreThreshold, }), isOpenCvLoaded(), ]); faceDetection = FaceDetection(); originalImageToProcessCanvas = OffscreenCanvas(srcImageRatio.videoWidth, srcImageRatio.videoHeight); originalImageToProcessCanvasCtx = originalImageToProcessCanvas.getContext( ); resizedImageToProcessCanvas = OffscreenCanvas( srcImageRatio.faceDetectionImageWidth, srcImageRatio.faceDetectionImageHeight, ); resizedImageToProcessCanvasCtx = resizedImageToProcessCanvas.getContext( ); blazeModel; }; export const async const null await await const await Promise // The maximum number of faces returned by the model // The width of the input image // The height of the input image // The threshold for deciding whether boxes overlap too much // The threshold for deciding when to remove boxes based on score new new '2d' new '2d' return The method is waiting for openCV to load isOpenCvLoaded isOpenCvLoaded = { timeoutId; resolveOpenCvPromise = { (timeoutId) { clearTimeout(timeoutId); } { (cv && cv.Mat) { resolve(); } { timeoutId = setTimeout( { resolveOpenCvPromise(resolve); }, OpenCvLoadedTimeoutInMs); } } { timeoutId = setTimeout( { resolveOpenCvPromise(resolve); }, OpenCvLoadedTimeoutInMs); } }; ( { resolveOpenCvPromise(resolve); }); }; export const => () let const ( ) => resolve if try // eslint-disable-next-line no-undef if return else => () catch => () return new Promise ( ) => resolve Face detection method detectFace = (data, faceModel) => { { originalImageToProcess, lastIndex } = data; facesThumbnailsImageData = []; resizedImageToProcessCanvasCtx.drawImage( originalImageToProcess, , , srcImageRatio.faceDetectionImageWidth, srcImageRatio.faceDetectionImageHeight, ); resizedImageDataToProcess = resizedImageToProcessCanvasCtx.getImageData( , , srcImageRatio.faceDetectionImageWidth, srcImageRatio.faceDetectionImageHeight, ); predictions = faceModel.estimateFaces( resizedImageDataToProcess, returnTensors, flipHorizontal, annotateBoxes, ); predictions = faceDetection.normalizePredictions( predictions, returnTensors, annotateBoxes, srcImageRatio.faceDetectionImageRatio, ); predictions = faceDetection.filterPredictionsByFullLandmarks( predictions, srcImageRatio.videoWidth, srcImageRatio.videoHeight, ); predictions = faceDetection.filterPredictionsByMinWidth(predictions, faceMinWidth, faceMinWidthConversionFactor); predictions = faceDetection.filterPredictionsByRecommendedLocation(predictions, predictionIOU, recommendedLocation); (predictions && predictions.length > ) { originalImageToProcessCanvasCtx.drawImage(originalImageToProcess, , ); originalImageDataToProcess = originalImageToProcessCanvasCtx.getImageData( , , originalImageToProcess.width, originalImageToProcess.height, ); srcImageData = cv.matFromImageData(originalImageDataToProcess); { ( i = ; i < predictions.length; i++) { prediction = predictions[i]; facesOriginalLandmarks = .parse( .stringify(prediction.originalLandmarks)); (flipHorizontal) { ( j = ; j < facesOriginalLandmarks.length; j++) { facesOriginalLandmarks[j][ ] = srcImageRatio.videoWidth - facesOriginalLandmarks[j][ ]; } } dstImageData = cv.Mat(); { thumbnailSize = cv.Size(detectedFaceThumbnailSize, detectedFaceThumbnailSize); transformation = getOneToOneFaceTransformationByTarget(detectedFaceThumbnailSize); similarityTransformation = getSimilarityTransformation(facesOriginalLandmarks, transformation); similarityTransformationMatrix = cv.matFromArray( , , cv.CV_64F, similarityTransformation.data); { cv.warpPerspective( srcImageData, dstImageData, similarityTransformationMatrix, thumbnailSize, cv.INTER_LINEAR, cv.BORDER_CONSTANT, cv.Scalar( , , , ), ); facesThumbnailsImageData.push( ImageData( (dstImageData.data, dstImageData.cols, dstImageData.rows), detectedFaceThumbnailSize, detectedFaceThumbnailSize, ), ); } { similarityTransformationMatrix.delete(); similarityTransformationMatrix = ; } } { dstImageData.delete(); dstImageData = ; } } } { srcImageData.delete(); srcImageData = ; } } { resizedImageDataToProcess, predictions, facesThumbnailsImageData, lastIndex }; }; export const async let const // Resize original image to the recommended BlazeFace resolution 0 0 // Getting resized image let 0 0 // Detect faces by BlazeFace let await // The image to classify. Can be a tensor, DOM element image, video, or canvas // Whether to return tensors as opposed to values // Whether to flip/mirror the facial keypoints horizontally. Should be true for videos that are flipped by default (e.g. webcams) // Whether to annotate bounding boxes with additional properties such as landmarks and probability. Pass in `false` for faster inference if annotations are not needed // Normalize predictions // Filters initial predictions by the criteri that all landmarks should be in area of interest // Filters predictions by min face width // Filters predictions by recommended location // If there are any predictions it is started faces thumbnails extraction according to the configured size if 0 // Draw initial original image 0 0 const 0 0 // eslint-disable-next-line no-undef let try for let 0 const const JSON JSON if for let 0 0 0 // eslint-disable-next-line no-undef let new try // eslint-disable-next-line no-undef let new let // eslint-disable-next-line no-undef let // eslint-disable-next-line no-undef let 3 3 try // eslint-disable-next-line no-undef new 127 127 127 255 new new Uint8ClampedArray finally null finally null finally null return The input is an image and an index for face matching and mask detection in the future. Since blazeface accepts images with a maximum size of 128 px, the image from the camera must be reduced. Calling the method starts the image analysis using blazeface, and the predicted coordinates with the coordinates of the face, nose, ears, eyes, mouth area are returned to the main thread. faceModel.estimateFaces Before working with them, you need to restore the coordinates for the original image because we compressed it to 128 px. Now you can use these data to decide whether the face is in the desired area or not. What is the minimum face size you need for subsequent identification. The following code cuts the face out of the image and aligns it to identify the mask using openCV methods. Mask detection Model initialization and backend webAssembly init = (data) => { { backend, streamSettings, maskDetectionsSettings, imageRatio } = data; flipHorizontal = streamSettings.flipHorizontal; detectedMaskThumbnailSize = maskDetectionsSettings.detectedMaskThumbnailSize; srcImageRatio = imageRatio; tfc.setBackend(backend); tfc.ready(); [maskModel] = .all([ tfconv.loadGraphModel( , ), ]); detectedMaskThumbnailCanvas = OffscreenCanvas(detectedMaskThumbnailSize, detectedMaskThumbnailSize); detectedMaskThumbnailCanvasCtx = detectedMaskThumbnailCanvas.getContext( ); maskModel; }; export const async const await await const await Promise `/rgb_mask_classification_first/MobileNetV _ / /model.json` ${maskDetectionsSettings.mobileNetVersion} ${maskDetectionsSettings.mobileNetWeight} ${maskDetectionsSettings.mobileNetType} new '2d' return The mask detection requires the coordinates of eyes, ears, nose, mouth, and the aligned image which is returned by the face detection worker. .maskDetectionWorker.postMessage({ : , : lastItem!.data.predictions[ ], imageDataToProcess, : lastItem!.index, }); this type 'detectMask' prediction 0 lastIndex Detection method detectMask = (data, maskModel) => { { prediction, imageDataToProcess, lastIndex } = data; masksScores = []; maskLandmarks = .parse( .stringify(prediction.landmarks)); (flipHorizontal) { ( j = ; j < maskLandmarks.length; j++) { maskLandmarks[j][ ] = srcImageRatio.faceDetectionImageWidth - maskLandmarks[j][ ]; } } detectedMaskThumbnailCanvasCtx.putImageData(imageDataToProcess, , ); predictionTensor = tfc.tidy( { maskDetectionSnapshotFromPixels = tfc.browser.fromPixels(detectedMaskThumbnailCanvas); maskDetectionSnapshotFromPixelsFlot32 = tfc.cast(maskDetectionSnapshotFromPixels, ); expandedDims = maskDetectionSnapshotFromPixelsFlot32.expandDims( ); maskModel.predict(expandedDims); }); { masksScores.push(predictionTensor.dataSync()[ ].toFixed( )); } { predictionTensor.dispose(); predictionTensor = ; } { masksScores, lastIndex, }; }; export const async let const const JSON JSON if for let 0 0 0 // Draw thumbnail with mask 0 0 // Detect mask via NN let => () let let 'float32' let 0 return // Put mask detection result into the returned array try 0 4 finally null return The result of the neural network is the probability that there is a mask, which is returned from the worker. It helps to increase and decrease the threshold of mask detection. By , we can compare the face and the presence of a mask and display some information on a specific person on the screen. lastIndex Conclusion I hope this article will help you to learn about the possibilities of working with ML in the browser and ways to optimize it. Most applications can be optimized using the tricks described above.