Lead Software Engineer; Site Reliability
Listed on 2026-05-27
-
IT/Tech
Systems Engineer, IT Support, Cloud Computing
Join Personal Investor Tech's Site Reliability Engineering team and lead cutting-edge SRE initiatives that impact hundreds of applications and millions of investors. You'll architect and build enterprise-scale resiliency solutions, driving our ambitious 2026 roadmap. This is an opportunity to combine deep technical expertise with strategic influence — designing Open Telemetry integrations, implementing distributed tracing at scale, automating incident responses, and pioneering AI-enhanced diagnostics and analysis.
Work alongside a collaborative, technically-focused team where your innovations in resilience engineering will shape Vanguard's next generation of client experiences.
At Vanguard, we pride ourselves on delivering an exceptional client experience to all investors; at the core of this experience are systems that reside in a technically complex and constantly evolving resiliency landscape. Passionate, technically skilled engineers are at the center of our resiliency operations, and we are looking to grow our team.
We are seeking an experienced engineer with broad, end-to-end software development experience, including operating applications in a microservices environment in production s role goes beyond feature implementation - it requires someone who can design, build, and support resilient systems from the ground up.
As a Senior Reliability Engineer at Vanguard, you will play a critical role in solving impactful operational problems. You are curious and take a proactive approach to identifying problems and making improvements. You balance innovative thinking with pragmatism and understand the long-term impacts of technical decisions. You communicate complex ideas clearly and collaborate effectively to deliver scalable solutions.
Core Responsibilities- Improve resiliency engineering practices across platforms and applications, including resilient application design patterns, system observability and deployment strategies
- Incident detection, troubleshooting, and resolution.
- Develop automation for incident response and infrastructure management
- Develop and support Open Telemetry integrations for multiple application platforms (browser, ECS, lambda, etc) and languages (JavaScript, Java)
- Contribute to architectural decisions and support implementation of solutions.
- Deep knowledge of Java or Java script. Practical experience developing and operating software in distributed systems environments.
- Problem‑solving and analytical thinking: ability to diagnose complex issues and propose efficient solutions. Strong debugging and optimization skills for performance and scalability.
- Cloud platforms:
Hands‑on experience with AWS services and cloud infrastructure - System architecture and design: ability to design scalable, secure, and maintainable systems.
- Working knowledge of Python (or similar scripting language).
- Strong knowledge of resiliency engineering techniques for both platforms and applications.
- Experience troubleshooting complex production issues and implementing effective mitigations.
- Familiarity with Open Telemetry specification and core APIs.
Vanguard is not offering visa sponsorship for this position.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).