MEET THE TEAM
ISS (Infrastructure Shared Service) is an international organisation within Pure, responsible for all of Pure Storage's engineering infrastructure, development environment, and production services. We work with all internal engineering teams to provide reliable services that are used to develop new products and features, in many different environments: from our multiple data center to various public clouds.
As a Reliability Engineer in ISS, you will work to improve the reliability and performance of Pure Storage's critical infrastructure applications. This means setting and owning SLO goals for uptime and latency, as well as helping colleagues leverage the features and workflows available to them. All with the focus of keeping the backend web servers, load balancers, and database servers healthy and running smoothly.
We are looking for engineers who have a mix of software and systems skills, are passionate about reliability, performance, and efficiency, and have experience building tools, services, and automation to manage and improve production services.
Responsibilities
Engage in and improve the whole lifecycle of services—from inception and design, through deployment and operation.
Design, operate, maintain, and troubleshoot enterprise systems such as databases, message queues, APIs, and distributed applications through the use of data and metrics such as SLOs and error budgets.
Establish and practice sustainable incident response and blameless postmortems to prevent problem recurrence.
Support services before they go live through activities such as system design, developing software platforms and frameworks, capacity planning, and launch reviews.
Scale systems sustainably through mechanisms like scripting and automation; evolve systems by pushing changes that improve their operational management reliability and velocity.
Work closely with development teams, infrastructure teams, and business stakeholders to understand requirements and design solutions across multiple time zones
Ensure that hardware design meets business and technical requirements, including performance, scalability, and reliability
Ensure that hardware design meets industry standards and best practices for data center infrastructure
Create and maintain detailed documentation on system configurations, procedures, and operational policies.
Day to day server administration (physical, virtual), storage administration, network config and applications support, health and performance monitoring. Ensuring quick turnaround times, as well as performance levels, availability, and security.
Deploy infrastructure manually and also via configuration management / automation platforms
Troubleshoot hardware, software, and network related issues, provide quick resolution to reported problems and perform root cause analysis to analyze reason for issues and prevent future occurrences
Minimum Qualifications
Experience programming in Python or other languages.
Experience in designing, analysing, and troubleshooting large-scale distributed systems
Able to work in a 24x7 on-call rotation (approx. 1 week every 2 months);
Systematic problem-solving approach, strong communication skills, and a sense of ownership and drive;
Working experience of Observability platforms such as Elastic or DataDog.
Experience deploying / troubleshooting Linux systems (Red Hat/CentOS), Ubuntu as well as VMware environments (esxi, NSX, vsan)
Experience working directly with end users to determine deployment and configuration requirements
Ability to lift 15+ kilograms when working with storage equipment.
Preferred qualifications
5+ years as a Site Reliability Engineer, DevOps Engineer, or Infrastructure engineer;
Understanding of Unix/Linux, and optionally Windows operating systems;
Experience working with Infrastructure as Code / Automation tools (Ansible, Terraform, CloudFormation);
Well organised, with ability to prioritise tasks independently, set goals and follow through in order to see them to completion;
Experience with containers and container orchestration systems such as Docker and/or Kubernetes;
Expertise with hybrid (bare metal/public cloud - AWS & Azure preferred) cloud environments.
Experience with containerisation and virtualisation technologies such as Docker, Kubernetes, and VMware
Knowledge of storage technologies (SAN / NAS devices)