Loading ...

le contenu du travail

As a Site Reliability Operations Engineer within the Global Technical Engineering Operations (GTEO) SRC team you will work with other SRC, TDO, SRE, DevOps and Engineering practitioners to pro-actively maintain mission-critical infrastructure, cloud platforms, micro-services, tools, and processes that will ensure highest levels of availability and reliability of all our websites.


You’re right for the job if you are comfortable contributing to major incident response in technical team of engineer’s laser focused on restoring service across complex distributed architectures. You’ll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work directly with our SRE, Engineering and DevOps teams to support our next generation “always up” cloud-based e-commerce platform.


The SRC Site Reliability Operations Engineer is responsible for pro-actively monitoring, detecting and resolving site issues before they become customer and availability impacting. Technically you will understand the full end to end stack and use this knowledge to detect error/failures and take corrective action to mitigate. During a major incident, you will draw on your technical skills and knowledge to triage, differentiating between symptom and cause, to help restore impacting issues. Your ability to continuously challenge yourself and develop a strong network within your peer group will see you exceed in this role. Our goal is to protect the customer experience and deliver outstanding levels of availability. To do so, you will need strong skills in the following areas:

  • Understanding of incident management processes and procedures.
  • Calm under pressure when participating in major incident response.
  • Technical understanding of core infrastructure, cloud services, platforms and micro-services.
  • Ability to understand and capture key data from logs.
  • Ability to understand traffics flows and key dependencies between services.
  • Ability to effectively triage – be able to detect and determine symptom vs cause.
  • Detect and quantify impact.
  • Analyze trends to pro-actively prevent incidents.
  • Focus on immediate restoration vs root cause.
  • Research and recommend alternative actions for incident resolution.
  • Create and maintain procedural documentation.
  • Participate in and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).

- Absorb knowledge and understand complex distributed systems - ability to share and impart this knowledge into your peer group.

  • Help build tools to improve visibility, pro-actively detect issues and restore system availability.
  • Help develop automation and self-healing with DevOps, Engineering and SRE partners.
  • Strong focus on collecting and inferring metrics.
  • Clear communication skills.

Additional responsibilities may include:


  • Actively provide data for and participate in root cause analysis.
  • Adhere to SRC onboarding process when accepting new systems into service.
  • Share knowledge globally between SRC teams.
  • Analyze systems and make recommendations to prevent possible incidents.
  • Strive for continuous improvement and make recommendations based on SRC process.
  • Other duties and responsibilities as assigned.

Qualifications:

  • 2+ years in an infrastructure, systems, engineering or development environment delivering operational excellence to highly complex distributed systems.
  • Bachelor’s Degree in Computer Science or a related field, or relevant work experience.
  • Strong incident management skills with relevant exposure in an enterprise organization.
  • Experience and exposure working is a 24/7 operations support environment.
  • Methodical and systematic problem solving approach.
  • Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
  • Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell
  • Experience administering Unix/Linux in a production environment.
  • Experience working with enterprise monitoring/tooling solutions like Grafana, Kibana, Splunk, Graphite, Nagios, New Relic, Dynatrace.
  • Experience with cloud technologies such as AWS, AZURE OpenStack.
  • Knowledge in docker and Kubernetes.
Loading ...
Loading ...

Date limite: 20-06-2024

Cliquez pour postuler pour un candidat gratuit

Postuler

Loading ...
Loading ...

MÊMES EMPLOIS

Loading ...
Loading ...