This is an exiting time to join a growing company. They are very unique in what they do and have a very strong track record in growth whilst disrupting the market with their innovative software.
About the Role
As a Site Reliability Engineering Manager (SRM) you will own the end-to-end availability and performance of my clients services. You'll also lead by example, develop your team and establish credibility with the quality of your team's technical execution.
On the SRE team, you'll build solutions to enhance availability, performance and stability of my clients products, as well as automating away repetitive work. You'll also respond to pings, pages, and alerts to investigate and dive into issues in the platform.
You'll be working on non-production and production environments, monitoring, data collection, configuration management, as well as disaster recovery planning, capacity engineering, reliability improvement initiatives, and platform automation. The best person for this role is someone that has a collaborative spirit - in a world, it's not about being a hero and having all the answers, it's about sometimes saying "I don't know" and working on finding solutions rather than starting with an assumption.
You'll be strategically minded, thinking about best practice, industry standards, continuous improvement and better ways for us to achieve our goals. The team needs someone who can ask questions, learn from others and turn chaos into order.
What you'll have
- Experience leading a team of Software/Systems Engineers;
- Software development experience with C#
- Automation experience - ideally in Python or PowerShell
- Manage end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence;
- Lead by example, care for the team, and establish credibility with the quality of the teams' technical execution;
- Manage on-call rotations across continents, using a follow-the-sun model;
- Design, write and deliver software to improve the availability, scalability, latency, and efficiency of RecordPoint's services;
- Understanding of incident management process;
- Experience in monitoring distributed systems;
- Experience with container management and micro-services architectures such as Docker and Kubernetes
- Metrics, monitoring and logging software such as AppInsights, Graphana, Prometheus, statsd and Datadog
- Experience with infrastructure as code - ideally Terraform
Contact Ged Wilson for more information.