Operations-Focused Engineer Job Houston area,Texas USA,IT/Tech

Overview

The Testing Consultancy (TTC) is a global specialist software testing company with a focus on helping organizations transform the way they deliver quality software. We have broad capabilities across a wide range of testing areas that enable our clients to increase the speed and quality of software development while reducing risk and cost.

Perks of working for TTC

401K with company match
Paid Time Off
Paid Holidays
Work Life Balance
Growth and Development Opportunities

Role summary

We’re looking for an operations-focused engineer to join our team. This role owns the day-to-day reliability and operational excellence of a portfolio of business-critical third-party enterprise platforms and integrations, partnering closely with engineering and cross-functional infrastructure teams to keep systems healthy, scalable, and secure.

Responsibilities

Serve in an on-call rotation and lead incident response for production issues: triage, mitigation, escalation, and restoration.
Drive operational excellence
: improve alert quality, reduce toil, document runbooks, and create repeatable operational processes.
Perform root cause analysis for incidents and recurring issues; drive corrective and preventive actions to completion.
Execute and coordinate maintenance activities (upgrades, patching, configuration changes) with minimal risk and downtime.
Build and maintain monitoring, dashboards, and health checks to detect issues early and reduce mean time to recovery.
Automate routine operational workflows using scripts and small tools; improve reliability through safe incremental change.
Partner cross-functionally (security, networking, storage, compute, vendor/third-party partners) to resolve complex issues.
Maintain accurate system documentation, operational standards, and service ownership practices across supported platforms.
3+ years experience in production operations
, SRE, systems engineering, or production support for enterprise services.
Experience participating in or leading on-call and handling production incidents with clear communication.
Proficiency in scripting/automation (e.g., Python and/or shell) and comfort with change management / peer review workflows.
Strong written and verbal communication; able to write clear runbooks and incident summaries.

Preferred qualifications

Experience operating third-party enterprise platforms (integration middleware, identity/auth systems, web/app tiers, databases, batch/scheduled jobs).
Familiarity with vulnerability remediation and patch management practices in production environments.
Demonstrated track record reducing operational toil and improving reliability metrics (MTTR, alert noise, incident recurrence).
Experience coordinating complex incidents across multiple teams and stakeholders.
Experience using Capirca for network provisioning, Chef for configuration management, and Infrastructure as Code and Containers for deployment.

Success in the first 60–90 days

Ramp to primary on-call ownership for supported systems.
Demonstrate ability to independently troubleshoot common failure modes and follow operational playbooks.
Deliver at least 1–2 measurable reliability improvements (toil reduction, alert cleanup, monitoring gap closure, recurring issue fix).

Working style

Calm under pressure, structured problem-solver, prioritizes reliability and safety.
Proactive communicator who keeps stakeholders informed during incidents and planned work.
“Automate and document” mindset: reduces repeated manual work and makes operations scalable.

Diversity and equal employment opportunity are foundational encourage applications from candidates of all backgrounds and experiences.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language