Lead Infrastructure Engineer; HOAi
Listed on 2025-12-02
-
IT/Tech
Systems Engineer, Cloud Computing
Overview
The Lead AI Infrastructure Engineer at HOAi is responsible for scaling and maintaining the infrastructure that powers our AI-driven products and services. This role sits at the intersection of infrastructure engineering, machine learning operations, and product development, ensuring our AI systems operate with exceptional reliability, performance, and efficiency. The ideal candidate is someone who gets excited about making AI systems fundamentally faster and more scalable.
You ll work directly with our engineering and product teams to build the foundational infrastructure that enables HOAi to deliver the most advanced AI product in the community association management industry.
- Infrastructure Ownership:
Design, build, and maintain the cloud architecture, model serving infrastructure, and ML pipelines that power HOAi s products - Performance Optimization:
Profile and optimize AI workloads to achieve sub-second inference latency while managing costs effectively - Scalability & Reliability:
Build auto-scaling systems, implement robust failover mechanisms, and ensure 99.99% uptime for mission-critical AI services - MLOps Excellence:
Develop and maintain CI/CD pipelines for model deployment, monitoring, and versioning across development and production environments - Developer Enablement:
Create tooling and infrastructure that allows product engineers to deploy AI features quickly and safely - Security & Compliance:
Implement security best practices and ensure compliance requirements are met across all AI infrastructure
- Infrastructure uptime and reliability
- AI inference latency (p95, p99) and throughput metrics
- Infrastructure cost efficiency and optimization (cost per inference, GPU utilization)
- Time to deploy new models and workflows (deployment velocity)
- Developer satisfaction and productivity using AI infrastructure tools
- System observability and incident response time
Performance & Scalability
- Profile and optimize database queries, API endpoints, and ML inference pipelines
- Implement caching strategies, connection pooling, and distributed systems for scale
- Monitor and optimize GPU utilization, memory usage, and compute costs
- Design load balancing and auto-scaling policies for variable AI workloads
- Build disaster recovery systems with redundancy
MLOps & Deployment
- Build and maintain CI/CD pipelines specifically for model deployment
- Implement model versioning, A/B testing infrastructure, and rollout mechanisms
- Create automated testing frameworks for model quality and performance regression
- Develop infrastructure for model monitoring, drift detection, and retraining workflows
- Manage experiment tracking and model registry systems
Observability & Reliability
- Implement comprehensive monitoring, logging, and alerting across the AI stack
- Refine dashboards for real-time visibility into system health and performance
- Conduct post-mortems and implement reliability improvements
- Design circuit breakers, retry logic, and graceful degradation for critical services
Security & Compliance
- Refine security best practices for AI infrastructure and data handling
- Ensure compliance with data privacy regulations and industry standards
- Manage credentials and access control across infrastructure
- Support security audits and vulnerability assessments
Collaboration & Documentation
- Work closely with Product & Engineering team to understand infrastructure needs and to enable fast, safe feature deployment
- Document infrastructure architecture, runbooks, and operational procedures
- Mentor team members on infrastructure best practices and tooling
- Contribute to technical strategy and architectural decisions
Required Experience
- 3-7 years of experience in infrastructure engineering, Dev Ops, or SRE
- Strong cloud platform expertise
- Experience building and maintaining deployment pipelines
- Experience with Postgre
SQL, Redis, or other production databases - Experience with APM tools, metrics, logging, and alerting
- Familiarity with vector databases, model serving frameworks and cross-system observability and traceability
- Managing and optimizing GPU work
- Real-time inference with low-latency serving…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).