Director of SRE (FTE)
Remote (United States)
About the Role
This opportunity is for a Director of Site Reliability Engineering to own reliability, availability, and operational excellence across a full product and platform portfolio. This role leads a blended reliability organization that includes managed SRE services, an internal QA team, and a growing internal SRE capability, with full accountability for outcomes across all three areas.
This position is responsible for defining SRE strategy and roadmap, building SLA and SLO frameworks across products, owning the incident management process, and creating the observability and release gate infrastructure that engineering teams rely on to ship safely and confidently.
The role requires a leader with deep SRE and infrastructure expertise, strong operational management skills, and the ability to influence engineering culture across multiple teams. The position partners closely with Engineering, Security, and Product leadership and serves as the primary voice of reliability in cross-functional discussions.
Employment Type: Full-Time
Annual Salary: $175,000 - $200,000 per year
Additional Compensation: Variable compensation and stock options
What You’ll Do
- Own and execute the SRE strategy and multi-quarter roadmap across reliability, observability, incident management, QA maturity, and release engineering.
- Define, measure, and continuously improve SLAs, SLOs, error budgets, uptime, performance, and operational health metrics across all products and services.
- Lead production reliability for the full platform, including monitoring, alerting, on-call operations, incident response, root cause analysis, and MTTR reduction.
- Establish release readiness standards, deployment safety controls, and quality gates to support stable and predictable product releases.
- Manage external SRE vendors and partners, including service delivery, SLA governance, escalations, performance reviews, and compliance expectations.
- Lead QA engineering strategy with a focus on automation, regression prevention, test coverage, and reducing escaped defects in production.
- Partner with Security and Engineering leaders to ensure cloud infrastructure, CI/CD pipelines, and operational tooling meet HIPAA, SOC 2, and internal security standards.
- Oversee core platform operations, including Azure AKS environments, Kubernetes, GitOps workflows, CI/CD pipelines, GitHub Actions, secrets management, access controls, and audit readiness.
- Drive observability maturity using tools such as Grafana, Prometheus, logging platforms, tracing tools, and automated alerting frameworks.
- Collaborate with Product, Platform, and Engineering teams to embed reliability and quality best practices throughout the software development lifecycle.
- Build, mentor, and scale high-performing SRE and QA teams while fostering a culture of ownership, accountability, learning, and continuous improvement.
- Drive adoption of AI-enabled automation and intelligent tooling to reduce manual toil, improve productivity, and strengthen operational excellence.
Technical Experience
- Strong hands-on experience with cloud infrastructure, preferably Microsoft Azure, including AKS, networking, storage, IAM, and security services.
- Deep expertise in Kubernetes, containerized workloads, and production-scale distributed systems.
- Experience building and managing CI/CD pipelines using GitHub Actions, ArgoCD, Terraform, or similar DevOps tooling.
- Strong background in monitoring, logging, tracing, and observability platforms such as Grafana, Prometheus, Datadog, Splunk, or equivalent tools.
- Experience with scripting and automation using Python, Bash, PowerShell, or similar languages.
- Strong understanding of release engineering, automated testing frameworks, QA tooling, and shift-left quality practices.
- Experience supporting SaaS applications with uptime, scalability, and security requirements in regulated industries such as healthcare.
- Knowledge of HIPAA, SOC 2, vulnerability management, access controls, and infrastructure security best practices.
- Familiarity with databases, APIs, networking, and troubleshooting across modern web application stacks.
- Exposure to AI-powered DevOps or AIOps tooling for incident management, automation, and engineering productivity is a plus.
Qualifications
- 12+ years of SRE, infrastructure, or platform engineering experience.
- 5+ years of experience in engineering leadership roles.
- Proven track record owning site reliability for complex, multi-tenant SaaS platforms with demanding availability requirements.
- Demonstrated experience defining SLA and SLO frameworks, error budgets, and incident management processes at scale.
- Experience managing vendor relationships for managed infrastructure or SRE services, including SLA governance and performance management.
- Track record leading QA or quality engineering functions, including test automation maturity and release gate ownership.
- Strong communication and cross-functional influence skills, with the ability to represent reliability to both technical and non-technical audiences.
Preferred Qualifications
- Experience in healthcare technology, HIPAA-compliant environments, or other highly regulated SaaS industries.
- Familiarity with FHIR-native or EMR/EHR platform architectures and their specific reliability requirements.
- Experience implementing AI-assisted SRE automation, including runbook generation, anomaly detection, or incident triage tooling.
- Background working with Playwright or equivalent test automation frameworks in a QA leadership capacity.
- Experience building internal SRE capability alongside a managed services provider.
What This Role Offers
- Opportunity to own and build the SRE function for a modern healthcare EMR platform from the ground up.
- Leadership of a blended team model combining managed services, internal QA, and internal SRE.
- Work on systems where reliability directly impacts clinical care delivery for vulnerable patient populations.
- Opportunity to shape engineering culture in an environment that actively embraces AI-assisted software development.
- Fully remote, collaborative engineering environment with direct access to executive leadership.
Looking for more opportunities?
View All Jobs