Cloud News

Microsoft embraces Kubernetes to auto-scale deep learning training 

1 Mins read

Microsoft recently collaborated with a San Jon-based startup Litbit, to build a new auto-scaling system using Kubernetes for deep learning training.

Litbit provides a platform to package, automate, and magnify internet of things (IoT) skills into AI personas with superhuman senses.

For scalability in running containerized distributed deep learning, Microsoft has opted for Kubernetes due to its expertise in cluster management technology.

Using the new project, organizations can augment IoT data (sight, sound, and touch sensors) into conscious personas that can learn, think and do many helpful things. It will help them to identify the capabilities of their employees for specific situations.

The new project can be used to run several machine learning workloads by creating a Kubernetes cluster with GPU support on Microsoft Azure. Microsoft is using acs-engine, an open source tool, to generate ARM template and GPU supported Kubernetes cluster.

“Some of these training jobs (e.g., Spark ML) make heavy use of CPUs, while others (e.g., TensorFlow) make heavy use of GPUs. In the latter case, some jobs retrain a single layer of the neural net and finish very quickly, while others need to train an entire new neural net and can take several hours to days,” mentioned Microsoft in a blog post.

The uses and applications of AI personas vary depending on the use cases and purposes, and might lead to bursty and unpredictable training loads.

To address the bursty demands cost-effectively, Litbit has generalized some instructions which explain how the auto-scaling can be done using acs-engine auto-scaler for different types of virtual machines.

“This solution is ideal for use cases where you need to scale different types of VMs up and down based on demand,” Microsoft added.

Earlier this month, Microsoft also announced the ONNX (Open Neural Network Exchange) format with the support of Microsoft Cognitive Toolkit to make artificial intelligence (AI) more accessible to developers.

Also read: Microsoft announces public preview of application protection tool- Adaptive Application Controls

Litbit has been using the project for last four months, and Microsoft has now helped Litbit to further develop its platform to scale up to 40 nodes at a time.

Leave a Reply

Your email address will not be published. Required fields are marked *

nine × 1 =