The Senior Manager, Site Reliability Engineering (SRE) will lead the SRE organization to deliver reliable, scalable, and resilient platforms and services. This role will own the strategy, implementation, and continuous improvement of a unified observability platform that provides end-to-end visibility into infrastructure, applications, APM, and databases, enabling proactive issue detection, faster incident resolution, and improved customer experience.

The Sr. Manager will drive practices around SLIs, SLOs, SLAs, and Error Budgets, embedding reliability into engineering culture. They will oversee incident management, RCA, proactive alerting, predictive analysis, and automation, while ensuring close collaboration with engineering, product, and platform teams.

Key Responsibilities

Leadership & Team Management

Hire, lead, and mentor a high-performing SRE team across geographies.

Define and execute the SRE vision, roadmap, and strategy in alignment with business and engineering objectives.

Establish a healthy 24x7 on-call model, ensuring coverage while promoting team well-being.

Drive a blameless culture through structured postmortems and RCA follow-up actions.

Unified Observability & Monitoring

Build and manage a unified observability platform leveraging tools such as New Relic, Datadog, CloudWatch, Prometheus, Grafana, Graylog, and OpenTelemetry.

Deliver holistic monitoring across infrastructure, applications, databases, APIs, and end-user experience.

Implement APM (Application Performance Monitoring) to trace performance across distributed systems.

Establish dashboards, metrics, and proactive alerting to identify anomalies early.

Drive adoption of AIOps and predictive analytics for proactive reliability improvements.

Reliability Engineering

Define and manage SLIs, SLOs, SLAs, and Error Budgets across services.

Partner with engineering teams to balance velocity with reliability, ensuring adherence to Error Budgets.

Reduce MTTD (Mean Time to Detect) and MTTR (Mean Time to Resolve) through automation, faster detection, and better instrumentation.

Perform capacity planning, scalability reviews, and resiliency testing.

Incident & Problem Management

Lead major incident response, coordinating communications with executives and stakeholders.

Drive root cause analysis (RCA) and implement long-term fixes.

Partner with ITSM teams to align with incident, problem, and change management processes.

Ensure continuous improvement loops from incidents back into observability, automation, and engineering practices.

Collaboration & Cross-Functional Work

Collaborate with Engineering, Product, Security, Cloud, and DevOps teams to embed SRE practices.

Provide guidance on instrumentation, reliability design, and operational readiness for new services.

Partner with DBAs and data platform teams to monitor database health, replication, query performance, and failover readiness.

Champion reliability as a shared responsibility across development and operations.

Qualifications & Experience

Required

12+ years of experience in SRE, Operations, or Infrastructure Engineering, with 5+ years in leadership roles.

Proven expertise in unified observability, monitoring, and alerting across infra, apps, APM, and databases.

Strong knowledge of observability tools: New Relic, Datadog, Prometheus, Grafana, Graylog, CloudWatch, OpenTelemetry, SolarWinds.

Hands-on with incident response, RCA, MTTR/MTTD reduction, and on-call management.

Deep understanding of SLIs, SLOs, SLAs, and Error Budgets.

Strong AWS experience (EC2, ECS, EKS, networking, scaling groups).

Hands-on with containers & orchestration (Docker, Kubernetes).

Proficiency in Python, Java, C#, and shell scripting for automation.

Knowledge of networking fundamentals, distributed systems, and high-availability architectures.

Familiarity with ITIL/ITSM processes (incident, problem, change).

Strong leadership, stakeholder management, and communication skills.

Preferred

Experience in large-scale SaaS or product-driven environments.

Hands-on experience with databases: MongoDB, Elasticsearch, SQL Server, Oracle.

Experience with chaos engineering, resiliency testing, and disaster recovery planning.

Certifications: AWS Solutions Architect / DevOps Engineer, CKAD/CKA.

Experience managing global SRE teams across time zones.

Proven ability to embed reliability into engineering culture via SLOs and Error Budgets.

Estimated Salary Range: $143,000 - $191,000 plus bonus

The base salary range represents the anticipated low and high end of the GHX’s salary range for this position. The base salary is one component of GHX’s total compensation package for employees. Other rewards and benefits include: health, vision, and dental insurance, accident and life insurance, 401k matching, paid-time off, and education reimbursement, to name a few. To view more details of our benefits, visit us here: https://www.ghx.com/about/careers/

#LI-SR

GHX: It's the way you do business in healthcare
Global Healthcare Exchange (GHX) enables better patient care and billions in savings for the healthcare community by maximizing automation, efficiency and accuracy of business processes.

GHX is a healthcare business and data automation company, empowering healthcare organizations to enable better patient care and maximize industry savings using our world class cloud-based supply chain technology exchange platform, solutions, analytics and services. We bring together healthcare providers and manufacturers and distributors in North America and Europe - who rely on smart, secure healthcare-focused technology and comprehensive data to automate their business processes and make more informed decisions.

It is our passion and vision for a more operationally efficient healthcare supply chain, helping organizations reduce - not shift - the cost of doing business, paving the way to delivering patient care more effectively. Together we take more than a billion dollars out of the cost of delivering healthcare every year. GHX is privately owned, operates in the United States, Canada and Europe, and employs more than 1000 people worldwide. Our corporate headquarters is in Colorado, with additional offices in Europe.

Disclaimer
Global Healthcare Exchange, LLC and its North American subsidiaries (collectively, “GHX”) provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, national origin, sex, sexual orientation, gender identity, religion, age, genetic information, disability, veteran status or any other status protected by applicable law. All qualified applicants will receive consideration for employment without regard to any status protected by applicable law. This EEO policy applies to all terms, conditions, and privileges of employment, including hiring, training and development, promotion, transfer, compensation, benefits, educational assistance, termination, layoffs, social and recreational programs, and retirement.GHX believes that employees should be provided with a working environment which enables each employee to be productive and to work to the best of his or her ability. We do not condone or tolerate an atmosphere of intimidation or harassment based on race, color, national origin, sex, sexual orientation, gender identity, religion, age, genetic information, disability, veteran status or any other status protected by applicable law. GHX expects and requires the cooperation of all employees in maintaining a discrimination and harassment-free atmosphere. Improper interference with the ability of GHX’s employees to perform their expected job duties is absolutely not tolerated.

Read our GHX Privacy Policy

This job is no longer accepting applications

See open jobs at Global Healthcare Exchange.See open jobs similar to "Senior Manager, Site Reliability Engineering (SRE)" Wasson Enterprise.

See more open positions at Global Healthcare Exchange

Privacy policy Cookie policy