Synopsis Combinatorial mutagenesis (CM) is an established approach to protein in pharma and industrial settings. As an extremely laborious process, relies on human intuition (rational engineering inspired by existing 3D structures of a protein target) or environmental pressure (directed evolution) to guide the development of new functional variants (mutants) of desired protein target with the goal of enhancing specific property; e.g. thermal stability, solubility or aggregation propensity. engineering CM Time inefficiencies, human labor and material costs of CM may translate into major issues in big Pharma: the average drug costs $2.6 B with a 5% success rate for small-molecule drugs and a 13% success rate for protein therapeutic. Ultimately, $77B in revenue lost (2011–2012) due to late-stage terminations of drug candidates. Idea The aim of this project was to develop an automated pipeline for rapid, AI-powered assessment of small peptide developability, as a function of structural disorder and its relationship to protein aggregation behaviour. Execution We have gathered all known to be expressed; 9-amino acid protein fragments (3,900,078 out of 512,000,000,000 theoretically possible combinations), 5-amino acid (3,200,000) fragments; and 3-amino acid (8000) fragments. We have used our proprietary (AI) structural disorder predictors (trained on to predict residual disorder probability for each molecule and cross-correlate it with known solubility data. dspp- keras https://github.com/PeptoneInc/dspp-keras) Our models were trained on AWS instances with custom Deep Learning Ubuntu 16.04LTS versions and equipped with accelerator cards. p3.2xlarge NVIDIA Tesla V100 SMX2 We have used nodes with 36-core Xenon Platinum processors to benchmark our calculations against. c5d.18xlarge We have run proprietary tSNE algorithms, which were developed specifically for NVIDIA GPUs; Tesla V100- SXM2 available on p3.2xlarge nodes. The tSNE procedures were written using CUDA 9.0 libraries with a support for Compute Capability 7.0. The algorithms allowed us to perform [4,000,000 x 50] classification problem calculations in under 2h time, achieving 200x to 1000x performance gain with respect to state of the art CPU-only nodes. With tSNE calculations done for 7.1M+ peptides, we have performed data clustering utilizing Facebook AI libraries ( ) compiled with a support for NVIDIA Volta-architecture GPUs. faiss Subsequently, we have made an interactive, massively parallel visualization of the data graphs, which runs under the control on and utilizes . Kubernetes EC2 instances The frontend of the graph visualization (still under development) uses . We are using GPUs to inspect 100k+ node-graphs in real time. WebGL NVIDIA Titan Xp Why does it matter? With this data in hand, our clients will be able to make rapid and accurate research decisions about commercial developability of a given protein fragment lead and possible upfront R&D capital that needs to be invested. Why is it unique? We are the first to offer an accurate and rapid prediction of protein properties, which are of fundamental importance for protein solubility engineering and commercial developability assessment on a such scale (the underlying data graph contains ~4M molecules). network Through this project we have assessed the horizontal scalability of our AI platform and found out that given existing AWS and NVIDIA solutions we can easily apply our approach to protein families as big as 1,000,000,000 (billion) molecules. What’s next We are aiming to assess the relationships among the 122M known and annotated proteins.