Site Reliability Engineer
Design, build, and operate a world-class real-time streaming cloud platform
We are building Redpanda, a real-time streaming engine for modern applications. Redpanda is used by Fortune 1000 enterprises pushing hundreds of terabytes a day, and by the solo dev prototyping a React application on her laptop. We go beyond the Kafka protocol into the future of streaming, with inline WASM transforms and geo-replicated hierarchical storage. Think of it as a data API platform that scales with you from the smallest projects to petabytes of data distributed across the globe.
We are on a mission to enable every developer to supercharge their real-time applications.
You will be a part of our cloud team, working with all of engineering on building new services, automating infrastructure lifecycle on Kubernetes, and monitoring our services with the goal of offering a reliable, scalable and high-performance SaaS. One of our primary goals is to run a managed, cloud-based streaming-as-a-service with 99.5% uptime or better, and this role is critical for that goal.
Build & design Vectorized’s cloud infrastructure with reliability and performance in mind.
Build tools & services to allow automated infrastructure management and self-healing, including deployments and upgrades.
Be in charge of end-to-end monitoring of our cloud. Layer observability into our Kubernetes operators. Prioritize what metrics to collect, drive analysis of those metrics, and influence our roadmap based on that analysis.
Participate in on-call rotations, working to keep customer workloads running and incident free.
You’ll be part of a diverse team with members in both US (New York City,
San Francisco, San Diego, Austin, Denver) and international locations, including Colombia, the United Kingdom, Russia, Poland, and growing!
3+ years of experience in an SRE-like role
Comfortable working with a 100% distributed engineering team, collaborating on GitHub, in the open
Strong experience with public cloud providers
Experience running highly-scalable production workloads reliably on Kubernetes
Experience with monitoring at scale
Experience managing infrastructure predictably through GitOps and IaC
Solid programming skills
Willingness to participate in an on-call rotation
Excellent written communication skills
A BS in Computer Science or equivalent experience
Strong understanding of Go and Kubernetes
Experience operating a SaaS platform
Fluency in a couple of programming languages (for example, Go or Python)
Operated and used streaming platforms either as a user or provider
Experience with the Prometheus monitoring stack