The workflow for building machine learning models often ends at the evaluation stage: you have achieved an acceptable accuracy, and “ ” ta-da! Mission Accomplished. Beyond that, it might just be sufficient to get those nice-looking graphs for your paper or for your internal documentation. In fact, going the extra mile to put your model into production is not always needed. And even when it is, this task is delegated to a system administrator. However, nowadays, many researchers/engineers find themselves responsible for handling the complete flow from conceiving the models to serving them to the outside world. Whether it is a university project or a personal experiment, demonstrating our work is generally a great way to get a wider audience interested. Few people would make the extra effort to use a system for which the value is not instantly perceived. In this article, we’ll go together through this workflow; a process that I had to repeatedly do myself. The assumption is that you have already built a machine learning or deep learning model, using your favorite framework (scikit-learn, Keras, Tensorflow, PyTorch, etc.). Now you want to serve it to the world at scale via an API. sitting idly at your institution or in the cloud. This entails serving multiple requests at a time, spawning new processes as the load increases, and reducing the number of processes as load decreases. You also want the additional guarantee that your server would be restarted after unexpected system failures. By “at scale” , we’re not talking about an industrial scale of a huge company. The goal is to make the best out of that server with lots of CPUs and a large RAM, If that is what you have in mind, let’s go through it together. We’ll be considering the context of Python-based frameworks on Linux servers. Our setup will involve: : for managing package installation and creating an isolated Python 3 environment. Anaconda a high-level neural networks API, that is capable of running on top of , , or . Keras : TensorFlow CNTK Theano : a minimalistic python framework for building RESTful APIs. Despite being easy to use, Flask’s built-in server serves only one request at a time by default; hence on its own for deployment in production. Flask it is not suitable : the highly stable , which provides benefits such as load-balancing, SSL configuration, etc. nginx web server a highly configurable (Web Server Gateway Interface) that allows forking multiple workers to serve multiple requests at a time. uWSGI : WSGI server an system used in multiple Linux distributions to manage system processes after booting. systemd : init Nginx will be our interface to the internet, and it will be the one handling clients’ requests. Nginx has native support for the binary uWSGI protocol, and they communicate via Unix sockets. In turn, the uWSGI server will be invoking a callable object within our Flask application directly. That is the way that requests will be served. A few notes at the beginning of this tutorial: . For example, Keras can be easily replaced with PyTorch, Flask can be easily replaced with , and so on. - Most components above can be easily replaced by equivalent components with little to no change in the rest of the steps Bottle The typical case is having access to a server with a lot of CPU cores and trying to make use of these cores to serve the models. GPUs, on the other hand, are more expensive to get in large numbers. Moreover, depending on your application, the speed gain you get from using a GPU at prediction time might not be that significant (especially in NLP applications). - We will only consider the case of serving models over CPU. Setting up the Environment To begin with, we need to install the systemd and nginx packages: sudo apt-get install systemd nginx Next, we have to install Anaconda by following the instructions , which consist of downloading the executable, running it, and adding Anaconda to your system’s PATH. Below, we will assume that Anaconda is installed under the home directory. on the official site All the code and configuration files in this article are available from the accompanying . But make sure you follow the steps below to get the full workflow. Github repository github.com harkous/production_ml production_ml — Scaling Machine Learning Models in Production Next, let’s create the isolated Anaconda environment from the environment.yml file. Here is how this file looks like (it already contains several of the frameworks we’ll be using): name: production_ml_env channels: - conda-forge dependencies: - python=3.6 - keras - flask - uwsgi - numpy - pip - pip: - uwsgitop We run the following to create the environment: < = >conda env create --file environment.yml</ > code class "markup--code markup--pre-code" code When we want to activate this environment, we run: source production_ml_env activate By now, we have Keras installed, alongside flask, uwsgi, uwsgitop, etc. So we are ready to get started. Building the Flask Web App For the purposes of this tutorial, we will not dive deeply into how to build your ML model. Instead, we will adapt the example of topic classification using the bundled within Keras. This is the code for building the classifier: Reuters newswire dataset __future__ print_function os numpy np keras keras.datasets reuters keras.models Sequential keras.layers Dense, Dropout, Activation keras.preprocessing.text Tokenizer keras.callbacks ModelCheckpoint MODEL_DIR = max_words = batch_size = epochs = print( ) (x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_words, test_split= ) print(len(x_train), ) print(len(x_test), ) num_classes = np.max(y_train) + print(num_classes, ) print( ) tokenizer = Tokenizer(num_words=max_words) x_train = tokenizer.sequences_to_matrix(x_train, mode= ) x_test = tokenizer.sequences_to_matrix(x_test, mode= ) print( , x_train.shape) print( , x_test.shape) print( ) y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes) print( , y_train.shape) print( , y_test.shape) print( ) model = Sequential() model.add(Dense( , input_shape=(max_words,))) model.add(Activation( )) model.add(Dropout( )) model.add(Dense(num_classes)) model.add(Activation( )) model.compile(loss= , optimizer= , metrics=[ ]) os.path.exists( ): os.makedirs(MODEL_DIR) mcp = ModelCheckpoint(os.path.join(MODEL_DIR, ), monitor= , save_best_only= ) history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose= , validation_split= , callbacks=[mcp]) score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose= ) print( , score[ ]) print( , score[ ]) '''Trains and evaluate a simple MLP on the Reuters newswire topic classification task. ''' from import import import as import from import from import from import from import from import './models' 1000 32 5 'Loading data...' 0.2 'train sequences' 'test sequences' 1 'classes' 'Vectorizing sequence data...' 'binary' 'binary' 'x_train shape:' 'x_test shape:' 'Convert class vector to binary class matrix ' '(for use with categorical_crossentropy)' 'y_train shape:' 'y_test shape:' 'Building model...' 512 'relu' 0.5 'softmax' 'categorical_crossentropy' 'adam' 'accuracy' if not '' 'reuters_model.hdf5' "val_acc" True 1 0.1 1 'Test score:' 0 'Test accuracy:' 1 To replicate the setup we use here, simply run the following to train a model without a GPU: =-1<br>KERAS_BACKEND=theano python build_classifier.py export CUDA_VISIBLE_DEVICES This will create a model file reuters_model.hdf5 in the folder models. Now, we are ready to serve the model via Flask on port 4444. In the code below, we provide a single REST endpoint /predict that supports GET requests, where the text to classify is provided as a parameter. The returned JSON is of the form{"prediction": "N"}, where N is an integer representing the predicted class. flask Flask flask request keras.models load_model keras.datasets reuters keras.preprocessing.text Tokenizer, text_to_word_sequence flask jsonify os MODEL_DIR = max_words = app = Flask(__name__) print( ) model = load_model(os.path.join(MODEL_DIR, )) word_index = reuters.get_word_index() tokenizer = Tokenizer(num_words=max_words) word_sequence = text_to_word_sequence(text) indices_sequence = [[word_index[word] word word_index word word_sequence]] x = tokenizer.sequences_to_matrix(indices_sequence, mode= ) x : text = request.args.get( ) x = preprocess_text(text) y = model.predict(x) predicted_class = y[ ].argmax(axis= ) print(predicted_class) jsonify({ : str(predicted_class)}) : response = jsonify({ : }) response.status_code = response __name__ == : app.run(host= , port= ) from import from import from import from import from import from import import './models' 1000 "Loading model" 'reuters_model.hdf5' # we need the word index to map words to indices : def preprocess_text (text) if in else 0 for in 'binary' return @app.route('/predict', methods=['GET']) : def predict () try 'text' 0 -1 return 'prediction' except 'error' 'problem predicting' 400 return if "__main__" '0.0.0.0' 4444 To start the Flask application server, we run: app. python py Voila! Now we have the simple, lightweight server running. You can test the server with your favorite REST client (e.g., ) or by simply going to this URL in your web browser (replace by your server’s URL): :4444/predict?text=this is a news sample text about sports and football in specific Postman your_server_url http:// your_server_url You should get back a response as { : } <br> "class" "11" <br> Configuring the uWSGI Server Now, we are off to scaling our simple application server. uWSGI will be the key here. It communicates with our Flask application by invoking the callable object app in the file app.py. uWSGI includes most of the parallelization features we are after. Its configuration file looks as follows: [uwsgi] # placeholders that you have to change my_app_folder = /home/harkous/Development/production_ml my_user = harkous socket = %(my_app_folder)/production_ml.sock chdir = %(my_app_folder) file = app.py callable = app # environment variables env = CUDA_VISIBLE_DEVICES=-1 env = KERAS_BACKEND=theano env = PYTHONPATH=%(my_app_folder):$PYTHONPATH master = true processes = 5 # allows nginx (and all users) to read and write on this socket chmod-socket = 666 # remove the socket when the process stops vacuum = true # loads your application one time per worker # will very probably consume more memory, # but will run in a more consistent and clean environment. lazy-apps = true uid = %(my_user) gid = %(my_user) # uWSGI will kill the process instead of reloading it die-on-term = true # socket file for getting stats about the workers stats = %(my_app_folder)/stats.production_ml.sock # Scaling the server with the Cheaper subsystem # set cheaper algorithm to use, if not set default will be used cheaper-algo = spare # minimum number of workers to keep at all times cheaper = 5 # number of workers to spawn at startup cheaper-initial = 5 # maximum number of workers that can be spawned workers = 50 # how many workers should be spawned at a time cheaper-step = 3 Depending on your needs and file locations, you might need to modify/add too. On your side, you have to modify the option my_app_folder to be the folder of your own app directory and the option my_user to be your own username. other options One important section in the uwsgi.ini is the part where we use the in uWSGI, which allows us to run multiple workers in parallel to serve multiple concurrent requests. This is one of the cool features of uWSGI, where dynamically scaling up and down is attainable with a few parameters. With the above configuration, we will have at least 5 workers at all times. Upon load increase, Cheaper will allocate 3 additional workers at a time until all the requests find available workers. The maximum number of workers above is set to 50. subsystem Cheaper In your case, the best configuration options depend on the number of cores in the server, the total memory available, and the memory consumption of your application. Take a look at the for advanced deployment options. official docs Connecting uWSGI with nginx We’re almost there. If we start uWSGI now (which we’ll do in a while), it will take care of invoking the app from the file app.py, and we will benefit from all the scaling features it provides. However, in order to get REST requests from the internet and to pass them to the Flask app via uWSGI, we will be configuring nginx. Here is a barebone configuration file for nginx, with just the part we depend on for this application. Of course, nginx can be additionally used for configuring SSL or to serve static files, but that is out of the scope of this article. { ; YOUR_SERVER_NAME_OR_IP; / { uwsgi_params; unix:/home/harkous/Development/production_ml/production_ml.sock; } } server listen 4444 # change this to your server name or IP server_name location include # change this to the location of the uWSGI socket file (set in uwsgi.ini) uwsgi_pass We place this file in /etc/nginx/sites-available/nginx_production_ml (you will need sudo access for that). Then, to enable this nginx configuration, we link it to the sites-enabled directory: <code = >sudo ln -s nginx nginx_production_ml nginx code> class "markup--code markup--pre-code" /etc/ /sites-available/ /etc/ /sites-enabled</ We restart nginx: sudo nginx restart service Configuring the systemd Service Finally, we will launch the uWSGI server we configured earlier. However, in order to ensure that our server does not die forever after system restarts or unexpected failures, we will launch it as a systemd service. Here is our service configuration file, which we place in the /etc/systemd/system directory using: sudo vi /etc/systemd/system/production_ml.service [Unit] Description=uWSGI instance to serve production_ml service [Service] User=harkous Group=harkous WorkingDirectory=/home/harkous/Development/production_ml/ ExecStart=/home/harkous/anaconda3/envs/production_ml_env/bin/uwsgi --ini /home/harkous/Development/production_ml/uwsgi.ini Restart=on-failure [Install] WantedBy=multi-user.target We start the service with: sudo systemctl start < = >production_ml</ >.service code class "markup--code markup--pre-code" code To allow this service to start when the machine is rebooted: sudo systemctl <code = >production_ml</code>.service enable class "markup--code markup--pre-code" At this stage, our service should start successfully. If we update the service later, we simply have to restart it: sudo systemctl restart < = >production_ml</ >.service code class "markup--code markup--pre-code" code Monitoring the Service To keep an eye on the service and see the load per worker, we can use . In uwsgi.ini, we have already configured a stats socket within our application folder. To see the stats, execute the following command in that folder: uwsgitop < class= >uwsgitop stats </code> code "markup--code markup--pre-code" .production_ml .sock Here is an example of the workers in action, with additional workers that have already been spawned. To simulate such a heavy load on your side, even with simple tasks, you can artificially add a time.sleep(3) in the prediction code. One way to send concurrent requests to your server is using curl (remember to replace YOUR_SERVER_NAME_OR_IP by your server’s URL or IP address. url= i {0..10} curl -s & #!/usr/bin/env bash "http://YOUR_SERVER_NAME_OR_IP:4444/predict?text=this%20is%20a%20news%20sample%20text%20about%20sports,%20and%20football%20in%20specific" # add more URLs here for in do # run the curl job in the background so we can start another job # and disable the progress bar (-s) echo "fetching " $url $url done wait #wait for all background jobs to terminate In order to monitor the log of the application itself, we can make use of journalctl: < class= >sudo journalctl -u production_ml -f</code> code "markup--code markup--pre-code" .service Your output should look like this: Final Notes If you have reached this stage, and your application has run successfully, then this article has served its purpose. Some additional notes are worth mentioning though at this stage: To keep this article general enough, we have used the lazy-apps mode in uwsgi, which loads the application one time per worker. , this will require O(n) time to load (where n is the number of workers). It also likely requires more memory but results in a clean environment per worker. By default, uWSGI loads the whole application differently. It starts with one process; then it forks itself multiple times for additional workers. This results in more memory savings. However, this does not play well with all of the ML frameworks. For example, the TensorFlow backend in Keras fails without the lazy-apps mode (e.g., check , , and ). The best could be to try first without lazy-apps = true, and shifting to it if you encounter similar issues. lazy-apps mode in uwsgi: According to the docs this this this Because uWSGI invokes app as a callable, the parameters of the app itself should not be passed via the command line. You’re better off using a configuration file with the likes of to read such parameters. - Parameters of the Flask App: configparser - The guide above does not discuss the case of multiple servers. Luckily enough, this can be achieved without a significant change in our setup. Benefiting from the , you can setup multiple machines, each with the uWSGI setup we described above. Then, you can configure nginx to route the requests to the different servers. nginx comes with multiple methods to distribute the load, ranging from a simple round-robin to accounting for the number of connections or the average latency. Scaling across multiple servers: load balancing feature in nginx The above guide uses port 4444 for illustration purposes. Feel free to adapt to your own ports. And make sure that you open these ports in the firewall or ask your institution’s administrators to do so. - Port selection: We have been permissive in the socket permissions by giving write access to all users. Feel free to also adjust these permissions to your purposes and to run the service with a different user/group. Make sure that your nginx and uWSGI can still talk to each other successfully after your changes. - Socket permissions: So that’s it! I hope this guide and will be helpful for all those trying to deploy their models into production as part of a web application or as an API. If you have any feedback, feel free to drop a comment below. the associated repository And thanks for reading! You might also be interested in checking my other articles on my Medium page: medium.com Hamza Harkous — Medium Read writing from Hamza Harkous on Medium. Postdoc at EPFL, Switzerland; working at the intersection of Privacy, NLP… … or in seeing what I tweet about: twitter.com Hamza Harkous (@hamzaharkous) | Twitter The latest Tweets from Hamza Harkous (@hamzaharkous). Postdoc @EPFL; working at the intersection of Privacy, NLP, and…