Partner SW Engineering Manager

Partner SW Engineering Manager

Partner SW Engineering Manager

Job Overview

Location
London, England
Job Type
Full Time Job
Job ID
39195
Date Posted
1 year ago
Recruiter
John Apl
Job Views
65

Job Description

Singularity is an AI aware Azure PaaS service that is on its first iteration of providing a reliable, high utilization, and SotA capabilities for running AI applications at extra large scale. We have seen the advances in Deep Learning coming from transformers, GPT-3 and more recently Microsoft’s Turing models. The next innovations in AI will come from understanding the advances in deep learning and seeing how to make them practical with platform innovation. This is what Singularity is about. For this we partner with leading research in AI paradigms, and systems.

Azure Singularity team is looking for an engineering manager and technical leader to operate and re-invent the largest deep-learning infrastructure service at Microsoft.  In this role you will be responsible for building and leading a new team to bring the latest innovations in AI Infrastructure onto the Singularity Kubernetes platform, while maintaining service SLA in its current implementation and deeply engage with customers to help them transition to the new platform.

You will partner with top engineering talent within Singularity and across Azure to put together Kubernetes based cluster orchestration, to provide operating system and containerization support, to enable AI languages and run-times, and other aspects necessary to bring distributed deep learning training and inferencing to life.  In addition, you will own infrastructure components required to build, deploy, monitor and service highly available and scalable Kubernetes (K8) clusters under your care.  You will lead development and customer support from the frontline and establish architecture, service excellence guidelines and a high-quality bar.

Candidates must have a track record for driving small to mid-size teams to successfully deliver on team’s goals. In addition, you need to deal well with ambiguity and help define clear goals for your team and keep them in focus.

Who We Are

We are the engineers on Singularity. We believe that building a planet-scale AI Supercomputer from the ground-up which addresses the fundamental pain-points of data scientists and AI practitioners and takes AI to the unprecedented scale is an opportunity of a lifetime. If you share the same dream as us, come join us! 

What Is Singularity?

High scale AI workloads are always testing the limits of the infrastructure stack. Large-scale model training and inferencing with huge data volumes of training data on hundreds-thousands of GPUs make it a true engineering challenge. Singularity is a globally distributed, multi-tenant service that provides robust, cost-effective and competitive AI infrastructure (compute, networking and storage) for AI training and inferencing. By abstracting workloads from underlying infrastructure, Singularity creates a shared pool of resources that can be dynamically provisioned for full utilization of expensive GPU compute, and enabling data scientists to productively build, scale, experiment, and iterate their models on top of a robust, performant, scalable and cost-effective distributed infrastructure built for AI. In Singularity, we are constantly seeking to apply the best ideas from AI, ML, distributed systems, distributed databases, machine learning, information retrieval, networking, and security.

Responsibilities

Lead, hire, and grow a team of around 30 engineers that will help deliver on key business results.

Grow the charter from initial execution on live service and K8 to other areas beyond

Define the technical direction of the team and drive execution with special emphasis on ensuring quality and performance through sound engineering processes.

Establish service SLA, define service excellence goals, and support customers and livesite in a dev ops model.

Deliver a robust container orchestration platform for Singularity on Kubernetes.

Deliver node management, fault detection and node repair as a service to improve job/model reliability.

Build change management systems that orchestrate and automatically ensure the safety and correctness of any change made to the production system.

Deliver world-class monitoring systems and telemetry pipelines to enhance service and job observability for both end-users and operators.

Codify security and compliance requirements by building and strengthening system defenses against malicious attacks and exploits.

Use data-driven and machine learning approaches to build quality and operational insights; leverage insights to drive quality and operational excellence across pre and post production pipelines.

Design and implement performance and scalability infrastructure that focuses on methodically calibrating data at scale to ensure meaningful characterizations and comparisons.

Leverage performance and profiling tools to identify hot spots and bottlenecks across hardware and software boundaries: from CPU, GPU, microcode, OS, networking to product code and drive end-to-end job performance.

Qualifications

Required Qualifications:

15+ years of experience with coding in one of C#, Java, C or C++.

Experience working with the Linux operation system and Kubernetes cluster orchestration.

Experience with improving service operations or engineering fundamentals.

Excellent collaboration skills.

A Master’s degree (or Bachelor’s degree with 5+ years of work experience equivalent) in computer science or a related field.

At least 5 years of experience building and shipping production software or services.

 

Preferred Qualifications:

Experience in development in the Kubernetes ecosystem

Experience in using / extending PyTorch / TensorFlow

Experience in building large scale cloud services, distributed systems, or operating systems

Experience programming GPUs (graphics processing units), CUDA/cuDNN/NCCL

 

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check. This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

 

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances.  We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.

 

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

 

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.

Job ID: 39195

Similar Jobs

Enterprise Holdings

Full Time Job

Partner sw engineering manager Partner sw engineering manager

Make eye contact and greets all customers; identify and attend to customer by na...

Full Time Job

Beam Suntory

Full Time Job

Partner sw engineering manager Partner sw engineering manager

Beam Suntory is the world’s third largest premium spirits company with an...

Full Time Job

7-Eleven

Full Time Job

Partner sw engineering manager Partner sw engineering manager

ResponsibilitiesBeing a 7-Eleven Area Leader isn’t easy. In fact, itâ€...

Full Time Job

America's Best Contacts & Eyeglasses

Full Time Job

Partner sw engineering manager Partner sw engineering manager

America's Best is part of National Vision, one of the largest and fastest-growi...

Full Time Job

Cookies

This website uses cookies to ensure you get the best experience on our website.

Accept