Job Description
Want to do the best work of your life? With 24 million customers in 7 countries, make your mark at Europe’s leading media and entertainment brand. A workplace where you can proudly be yourself; our people make Sky a truly exciting and inclusive place to work.
Working within the Service Operations Team as a Site Reliability Engineer your primary role is to bridge the gap between operations staff and developer teams, aiming to expedite developments while retaining core resiliency. As AdTech continually evolves to adopt a DevOps culture you will be expected to leverage your experience in this area to help transition a traditional business application support team to a more agile service management ethos. The role has an element of day-to-day technical and business application support for Sky Media’s business critical technology systems and you would be expected to provide technical consultancy as and when applicable directly to Sky Media.
What you’ll do:
- Propose , design, and build automated monitoring solutions including monitoring of error budgets and Identifying ways to improve data reliability, efficiency, and quality with relevant Data quality(DQ) checking
- For any significant new projects ensure these will be supportable by contributing to the creation and execution of operational testing and determine sign-off requirements & go-live activities. Including cutting edge alerting and monitoring, and forward-thinking capacity planning built into the design.
- Provide input and operational guidance and sign off to Non-functional requirements, Service level Agreements, Service Level Objectives, Key Performance Indicators including all requirements to ensure the solution is fully compatible with Sky’s Security standards
- Interact with Engineering, Architects, and Analysis teams to Assist in the migration of on Prem systems to a Cloud solution and to improve existing Cloud solutions and to ensure we achieve stable and resilient systems delivered at optimal cost.
- Build automated operational solution to drive efficiency, ensuring we have efficient and automated testing and release procedures that are continually enhanced and ensure manual processes are automated where possible and practical.
- Deal with incidents escalations for area you are an SME enable informed decision-making and boost service reliability. You will be expected to be able to perform some data analytics across different systems, and to lead the conversation to ensure solutions or acceptable workarounds are found to enable closure of the issue and any service improvements that are required are carried through to completion.
What you’ll bring:
- Degree or equivalent qualified with 5 years + IT experience with 4 years + in SRE and/or application support ideally with qualifications in agile service management, ITIL etc
- Proven track record of inputting into building resilient systems and capacity planning and experience of building bespoke monitoring solutions specifically on cloud platforms
- Experience with Google Cloud technologies including much of the following : cloud functions, app engine, compute engine, cloud storage, fire store, BIG query, Machine learning , GKE , Kubernetes, docker, Jenkins, Terraform, Ansible, AWS, Azure DevOps, Composer, DAG airflow etc.
- Experience of working within an Agile environment following Agile development practices and familiarity with Agile service management within a DevOps culture with a desire to promote the DevOps mindset within the team.
- Hands on experience on developing and executing some of the following technologies Perl, SHELL, BASH, Python and Golang scripts
- Experience of working in an advertising, broadcasting organisation would be advantageous but not essential
Job ID: 52025