AI Infrastructure Engineer Job San Francisco area,California USA,Software Development

About the Role

This role sits at the intersection of platform engineering, site reliability, and applied ML systems. The function owns the reliability, scalability, and operability of Meshy’s AI model serving stack, along with core engineering infrastructure. The team operates a conventional production infrastructure (CI/CD, build systems, deployment, runtime environments) and develops a model‑serving platform that connects the models developed by our Research Team to product‑facing backend systems.

The position is systems‑heavy, production‑oriented, and focused on turning experimental model artifacts into robust, observable, and cost‑efficient services.

Job Responsibilities

Design, develop, and optimize core capabilities for the AI inference platform, including inference services, task scheduling, service orchestration, elastic scaling, and release governance.
Develop CPU/GPU resource‑management systems to optimize stability, resource utilization, and cost efficiency where online inference and training share a cluster.
Drive unified management and scheduling of GPU resources, exploring implementation of MIG, MPS, time‑sharing, and virtualization in real‑world operations.
Continuously optimize throughput, latency, and availability of the inference pipeline, refining engineering quality in complex pipelines, multi‑model collaboration, and high‑concurrency scenarios.
Focus on R&D efficiency, resource and cost management, online stability, and disaster recovery architecture to drive performance, reliability, and maintainability.
Explore AI‑native infrastructure and automated operations to make infrastructure smarter and more user‑friendly during rapid startup expansion.

Qualifications

Bachelor’s degree or higher; majors in Computer Science, Software Engineering, Artificial Intelligence, Telecommunications, or related fields are preferred.
1 to 3 years of experience in backend development, infrastructure, cloud‑native platforms, machine learning platforms, or AI platforms.
Proficiency in Go or Python, with solid software engineering skills and a strong commitment to code quality.
Understanding of fundamental principles in Linux, operating systems, computer networks, and distributed systems; ability to independently identify and resolve complex engineering issues.
Practical development experience with Kubernetes, Docker, microservices, or distributed systems, with a basic understanding of production system stability.
Real‑world project experience in model inference, task orchestration, resource scheduling, and service stability—beyond conceptual understanding.
Self‑motivated, curious, and a fast learner; willing to take on greater ownership and broader responsibilities in a startup environment while continuously learning and quickly adopting new technologies.

Nice to Have

Experience with GPU inference platforms, Kubernetes schedulers, Device Plugins, or related platform development.
Familiarity with Ray and Ray Serve, or experience developing and optimizing model serving, distributed inference, and task orchestration frameworks.
Familiarity with MIG, MPS, vGPU, partitioned GPUs, or GPU resource reuse, and experience balancing performance and stability.
Engineering experience in observability, SRE, capacity planning, cost governance, canary deployments, and automated rollbacks.
Open‑source projects, technical blogs, personal projects, or other achievements that demonstrate learning agility and growth potential.
Ongoing interest and hands‑on experience in emerging areas such as AI infrastructure (AI infra), inference systems, and AI agent tool chains.

Our Values

Brain:
We value intelligence and the pursuit of knowledge. Our team is composed of some of the brightest minds in the industry.
Heart:
We care deeply about our work, our users, and each other. Empathy and passion drive us forward.
Gut:
We trust our instincts and are not afraid to take bold risks. Innovation requires courage.
Taste:
We have a keen eye for quality and aesthetics. Our products are not just functional but also beautiful.

Why Join Meshy?

Competitive salary, equity, and benefits package.
Opportunity to work with a talented and passionate team at the forefront of AI and 3D technology.
Flexible work environment, with options for remote and on‑site work.
Opportunities for fast professional growth and development.
An inclusive culture that values creativity, innovation, and collaboration.
Unlimited, flexible time off.

Benefits

Stock options available for core team members.
401(k) plan for employees.
Comprehensive health, dental, and vision insurance.
The latest and best office equipment.

#J-18808-Ljbffr