Site Reliability Engineer; SRE
Listed on 2026-05-21
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing
Bright Vision Technologies is a forward‑thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting‑edge technologies to create scalable, secure, and user‑friendly applications.
Job TitleSite Reliability Engineer (SRE)
Location100% Remote (Continental United States)
Employment TypeFull‑time, direct W2 with Bright Vision Technologies (no C2C, 1099, or third‑party arrangements)
Job SummaryWe are seeking an experienced Site Reliability Engineer to ensure the availability, performance, and operational excellence of large‑scale distributed systems in production. In this role you will bridge development and operations, applying software engineering principles to infrastructure and operations problems to continually improve platform reliability.
A technical coding assessment is required for all applicants.
Key Responsibilities- Define, instrument, and refine service‑level objectives (SLOs), service‑level indicators (SLIs), and error budgets for critical services, using those metrics to drive engineering decisions.
- Lead incident response and resolution, acting as incident commander when needed, and produce high‑quality post‑incident reviews.
- Design and implement monitoring, logging, and tracing strategies (Prometheus, Grafana, Open Telemetry, ELK/EFK, Datadog, or similar).
- Build and maintain on‑call processes, runbooks, and escalation paths to reduce mean time to detect and mean time to resolve.
- Automate operational toil with production‑grade tooling in Python, Go, Bash, or similar languages.
- Architect and operate large‑scale Kubernetes clusters and container workloads, including autoscaling, capacity planning, network policy, and service mesh integration.
- Design CI/CD pipelines that support safe, frequent, and observable releases with automated testing, canary deployments, feature flags, and progressive roll‑out strategies.
- Lead capacity planning and performance engineering activities, validating models with load testing and chaos experiments.
- Partner with application teams to embed reliability practices early in design, such as failure‑mode analysis and graceful degradation.
- Strengthen platform resiliency through chaos engineering, fault injection, retries, timeouts, circuit breakers, and well‑tested failover paths.
- Drive continuous improvement of security posture in collaboration with security teams, including patch management and vulnerability remediation.
- Contribute to the technical roadmap for reliability tooling, observability platforms, and developer‑experience improvements.
- Mentor colleagues on SRE practices and foster a blameless culture of operational excellence.
- Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.
- Five or more years of SRE, Dev Ops, or production engineering experience supporting large‑scale distributed systems.
- Strong programming skills in at least one of Python, Go, or Java.
- Deep experience operating Linux at scale, including networking, performance tuning, and systems‑level troubleshooting.
- Production experience operating Kubernetes and container‑based workloads.
- Proficiency with observability tooling such as Prometheus, Grafana, Open Telemetry, ELK/EFK, or commercial equivalents.
- Hands‑on experience designing and operating CI/CD pipelines for infrastructure and applications.
- Solid understanding of distributed system design, including consistency models, partitioning, and failure semantics.
- Demonstrated experience leading incident response and post‑incident reviews.
- Excellent communication and documentation skills.
- Experience defining and operationalizing SLOs and error budgets in production environments.
- Exposure to chaos engineering practices and tools such as Chaos Monkey, Gremlin, or Litmus.
- Hands‑on experience with at least one major cloud platform (AWS, Azure, or GCP).
- Background in capacity planning, performance engineering, or large‑scale load testing.
- Familiarity with service mesh technologies such as Istio, Linkerd, or Consul.
Bright Vision Technologies (BV Teck) is committed to equal employment opportunity (EEO) for all employees and applicants without regard to race, color, religion, sex, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, veteran status, or any other protected status as defined by applicable federal, state, or local laws. This commitment extends to all aspects of employment, including recruitment, hiring, training, compensation, promotion, transfer, leaves of absence, termination, layoffs, and recall.
BV Teck expressly prohibits any form of workplace harassment or discrimination. Improper interference with an employee’s ability to perform their job duties may result in disciplinary action up to and including termination of employment.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).