The IBM Technology Lifecycle Services (TLS) group is looking for a technically oriented, talented, innovative and enthusiastic DevOps SRE to join our TLS AI DevOps team. As the SRE Lead you will work in an agile, collaborative environment to build, deploy, configure, and maintain AI based solutions for IBM internal and external clients. In this role you will be responsible for ensuring availability and responsiveness of our applications by setting up and maintaining policies, procedures, and tools. This includes leading the problem resolution process for our users, from analysis and troubleshooting to deploying workarounds or fixes. Working closely with our worldwide teams, you will have a unique opportunity to grow your experience in modern AI technologies, including watsonx.ai technology as part of our AI4Infrastructure initiative. AI4Infrastructure consists of various use cases that shall help us increase productivity for our support engineers, reduce time-to- resolution for client cases and increase support deflection rates. This role may require some attention after normal working hours in order to troubleshoot and resolve production issues experienced by clients. This role may also require shifting working hours as needed basis for follow-the-sun coverage.
The undisputed priority for IBM is to scale watsonx as the AI for business platform for enterprise clients and establish IBM as the top-of-mind choice for AI and foundation models.
Our watsonx suite makes it possible for clients to build, train, tune and deploy AI across their business, leveraging critical, trusted data wherever it resides. TLS is aiming to become client zero in capturing the full potential of this platform.
As the SRE Lead, you will:Ensure availability and responsiveness of application by setting up and maintaining the required documentation method and tools
You will provide expertise and insights for project engineering teams and advise on best approaches to solve for avoiding infrastructure and security challenges.
Define roadmaps and milestones for devOps tasks in support of multiple projects handled on a monthly and quarterly basis
Handle resolution of blockers, escalation to stakeholders, and provisioning of resources
Meet with stakeholders and internal teams to communicate and agree on plans and manage notifications when issues arise
Document plans for maintenance, schedules and status to leadership team and stakeholders
Manage Ansible, Jenkins, Tekton and other CI/CD solutions
Diagnose environmental issues and introduce/implement technologies to solve them
Provision and maintenance of DevOps Infrastructure for projects
Monitor and support of platform infrastructure and manage escalations
Look for enhancements and innovative solutions to help the services scale and improve existing technical support tools, procedures, or processes.
Develop troubleshooting techniques to effectively identify and investigate issues and provide advice and guidance to clients
Work in a global team, collaborating with IBMers to share recommendations, solutions and ideas
Potential on-duty rotation including weekend and holiday support as needed basis