×
Register Here to Apply for Jobs or Post Jobs. X

Platform Engineer - Reliability

Job in Houston, Harris County, Texas, 77246, USA
Listing for: Squarepoint
Full Time position
Listed on 2026-02-16
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Cloud Computing
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below

As a Platform Reliability Specialist at Square point, you will play a critical role in ensuring the stability, performance, and day to day reliability of the shared platform services. You will work with a diverse group of stakeholders, including developers, researchers, and infrastructure teams, to maintain highly reliable systems and drive proactive improvements.

You will be responsible for reducing operational toil, improving response and learning from production issues, and evolving our reliability practices. This role blends software engineering, platform ownership, operational ownership, and long‑term architectural thinking to enhance our production systems. While you may have deep expertise in one or more areas, you will contribute across the platform.

Key areas include:
  • Operations & Toil Reduction:
    Own and improve day‑to‑day platform operations by streamlining workflows and enhancing on‑call ergonomics through better automations and runbooks
  • Reliability Engineering & Hardening:
    Work with service owners to apply engineering principles to improve resilience and performance: harden critical services against degradation and outages.
  • Tooling & Automation:
    Build and maintain platform tools, automation, and Git Ops workflows that make it easy for teams to deploy, operate, and observe their services with minimal friction and operational overhead.
  • Knowledge & Standards:
    Capture and share reliability knowledge through documentation, runbooks, and post‑incident reviews. Help define and evolve reliability standards and best practices across the platform.
Required qualifications
  • 4+ years in SRE, Production Engineering, or Reliability Engineering roles with direct ownership of production systems.
  • Experience with system administration and troubleshooting (Linux, Bash, containers).
  • Software development experience with Python, version control (Git), and CI/CD systems.
  • Hands‑on experience with observability systems including metrics, tracing, log pipelines, and alert design.
  • Demonstrated experience running systems at scale, including performance tuning, HA/DR architectures, and resilience engineering.
Nice to have
  • Expertise in a modern observability stack (e.g., Prometheus, Grafana, ELK, Victoria Metrics).
  • Experience operating enterprise platform software such as Kubernetes clusters, Git Lab at scale, or Slurm environments.
  • Familiarity with messaging systems (Kafka/Rabbit

    MQ), service discovery (Consul), and databases (Postgre

    SQL, Click House, Redis).
  • Experience authoring runbooks, running failure/chaos experiments, and participating in DR exercises.
  • Infrastructure automation and configuration management experience (e.g., Ansible, Terraform, Puppet).
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary