Operations-Focused Engineer
Listed on 2026-03-01
-
IT/Tech
Systems Engineer, IT Support, Cybersecurity
Overview
The Testing Consultancy (TTC) is a global specialist software testing company with a focus on helping organizations transform the way they deliver quality software. We have broad capabilities across a wide range of testing areas that enable our clients to increase the speed and quality of software development while reducing risk and cost.
Perks of working for TTC- 401K with company match
- Paid Time Off
- Paid Holidays
- Work Life Balance
- Growth and Development Opportunities
We’re looking for an operations-focused engineer to join our team. This role owns the day-to-day reliability and operational excellence of a portfolio of business-critical third-party enterprise platforms and integrations, partnering closely with engineering and cross-functional infrastructure teams to keep systems healthy, scalable, and secure.
Responsibilities- Serve in an on-call rotation and lead incident response for production issues: triage, mitigation, escalation, and restoration.
- Drive operational excellence
: improve alert quality, reduce toil, document runbooks, and create repeatable operational processes. - Perform root cause analysis for incidents and recurring issues; drive corrective and preventive actions to completion.
- Execute and coordinate maintenance activities (upgrades, patching, configuration changes) with minimal risk and downtime.
- Build and maintain monitoring, dashboards, and health checks to detect issues early and reduce mean time to recovery.
- Automate routine operational workflows using scripts and small tools; improve reliability through safe incremental change.
- Partner cross-functionally (security, networking, storage, compute, vendor/third-party partners) to resolve complex issues.
- Maintain accurate system documentation, operational standards, and service ownership practices across supported platforms.
- 3+ years experience in production operations
, SRE, systems engineering, or production support for enterprise services. - Experience participating in or leading on-call and handling production incidents with clear communication.
- Proficiency in scripting/automation (e.g., Python and/or shell) and comfort with change management / peer review workflows.
- Strong written and verbal communication; able to write clear runbooks and incident summaries.
- Experience operating third-party enterprise platforms (integration middleware, identity/auth systems, web/app tiers, databases, batch/scheduled jobs).
- Familiarity with vulnerability remediation and patch management practices in production environments.
- Demonstrated track record reducing operational toil and improving reliability metrics (MTTR, alert noise, incident recurrence).
- Experience coordinating complex incidents across multiple teams and stakeholders.
- Experience using Capirca for network provisioning, Chef for configuration management, and Infrastructure as Code and Containers for deployment.
- Ramp to primary on-call ownership for supported systems.
- Demonstrate ability to independently troubleshoot common failure modes and follow operational playbooks.
- Deliver at least 1–2 measurable reliability improvements (toil reduction, alert cleanup, monitoring gap closure, recurring issue fix).
- Calm under pressure, structured problem-solver, prioritizes reliability and safety.
- Proactive communicator who keeps stakeholders informed during incidents and planned work.
- “Automate and document” mindset: reduces repeated manual work and makes operations scalable.
Diversity and equal employment opportunity are foundational encourage applications from candidates of all backgrounds and experiences.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).