As a software engineer in the Azure Reliability SRE group you will work collaboratively and directly with our diverse product teams, SREs, Incident, and Crisis Management teams to understand and learn from operational outages and incidents. You will join an on-call rotation and share what you learn through written and verbal communications. Our team applies a growth-oriented mindset to understand challenges and opportunities in learning from incidents. Through this dynamic role, you will analyze the performance of these teams’ technology, processes, people, and organization through the lens of normal work, as well as incidents and outages.  
 
This role provides you the exciting opportunity to expand your knowledge and skills, and grow your connections with our diverse team. We’re looking for individuals who are inclusive, self-motivated, driven by curiosity, and continually looking for ways to communicate with excellence and clarity.  If you are enthusiastic about learning from incidents and growing your career in a team-oriented environment, we invite you to apply. 
 
We care deeply about our team and setting one another up for success. As an SRE joining us, it is essential you share this value and thrive as an active, collaborative member of our team. Together, we have an unparalleled opportunity to make an organization-wide and a global impact by discovering and communicating patterns and themes that can influence future investments. 
Responsibilities
As a Software Engineer on the Azure Kubernetes SRE team, you will be responsible for improving the reliability of key Azure products.
Utilize the Opportunity Canvas and User Story Mapping to plan, coordinate, and communicate the why of what we are building, and to plan the work of the team according to the user’s journey.
Investigate and analyze production incidents quantitatively and qualitatively to discover themes and similarities.
Interview engineers about their experiences during incidents.
Facilitate open, inclusive, blame-free, cross-team and cross-service incident learning analyses.
Write accessible, engaging incident reports.
Identify and prioritize significant recommendations, decisions, and risks.
Communicate effectively and partner well with other disciplines of the project team to deliver high quality solutions from ideas to production code
Write thorough design documents and code that exemplify quality, simplicity, and maintainability
Be a mentor for design reviews, code, and test cases.
The Azure Reliability SRE key focus areas are:
Defining our systems’ reliability goals via Service Level Objectives (SLOs).
Facilitate open, inclusive, blame-free, cross-team and cross-service incident learning analyses.
Improving our systems’ production posture via targeted observability and operability enhancements (telemetry, alerting, incident management, change management, safe production changes).
Building reusable automation to empower multiple teams to achieve their reliability goals.
Influencing the product architecture and roadmap to make sure the customer-experienced reliability is always a key consideration when evolving the product.
Qualifications
At Microsoft, we emphasize and value intellectual curiosity, initiative, and collaboration. Although it is important that you have some experience in Kubernetes and being on-call for a production service, as well as an understanding of SRE practices, most important is the desire to learn, grow, and continue building skills. This involves an open willingness to explore new ways of learning from everyday work, as well as incidents. Qualitative understanding, curiosity, dynamic, and cross-team communication skills are strongly valued over bias or outdated beliefs about human error and automation. As a member of our team, you will work with a global team of professionals to improve the reliability of the AKS product.
Required qualifications include:
Knowledge of Microsoft Azure, AKS, and Kubernetes and or a learning plan for these skills.
1+ years of scripting, and de-bugging skills.
1+ years of SRE experience with a distributed system.
Demonstrates humility yet confidently and consistently delivers improvements via pull requests.
The ability to articulate and communicate complex technical issues to team members and management.
Experience shipping production software.
An intellectual curiosity and high EQ (emotional intelligence) will serve the successful candidate well.
Preferred qualifications
Linux debugging and triaging experience
This position is clearly cross-disciplinary, involving human factors, safety science, SRE and research applications. Because of this, we encourage you to apply even if you are unsure of if you have exactly the ‘right’ background.
We value work life balance, inclusivity, diversity, and equality. This means that as an equal opportunity employer, Microsoft gives equal consideration to all candidates and employees regardless of age, ancestry, color, family, medical care leave, gender identity, expression, genetic information, marital status, medical condition, national origin, disability, political views, veteran, race, religion, sex, or orientation. If you need assistance or reasonable disability accommodations during recruitment, please submit a request using the Accommodation request form.
Our vision is to Improve the reliability and resilience of Azure AKS, so that we can continue delivering world-class operational excellence, minimize risks to the Azure platform, and identify development opportunities.
#AZCXP #AzRelJobs
 
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.     
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form.
Job ID: 31533
Meta is embarking on the most transformative change to its business and technolo...
Deloitte’s Enterprise Performance professionals are leaders in optimizing...
Job Duties/Responsibilities:Determine the acceptability of specimens for testing...
• JOB TYPE: Direct Hire Position (no agencies/C2C - see notes below)â€Â...
