×
Register Here to Apply for Jobs or Post Jobs. X

Senior Solutions Architect, AI Factory Observability and Visualization - NVIS

Job in Austin, Travis County, Texas, 78701, USA
Listing for: Nvidia
Full Time position
Listed on 2026-07-01
Job specializations:
  • IT/Tech
    Systems Engineer
Salary/Wage Range or Industry Benchmark: 184000 USD Yearly USD 184000.00 YEAR
Job Description & How to Apply Below
Senior Solutions Architect - Ai Factory Observability & Visualization  NVIDIA's Infrastructure Specialists team is hiring a Senior Solutions Architect - AI Factory Observability & Visualization! This remote role develops full-spectrum visibility that supports the smooth functioning of HPC systems and AI factories, transforming intricate telemetry across network and compute into straightforward, actionable perspectives.
The role has a complete, end-to-end understanding of the HPC/AI system, running and interpreting microbenchmarks and workloads to confirm system readiness, then establishing the observability that maintains this state. The work involves collaborating across NVIDIA teams to help partners see, understand, and respond to HPC system and AI factory performance, from hardware to workload.
What You Will Be Doing:
Run AI factory validation tools, microbenchmarks, and workloads provided by the team, and interpret results to assess system health and performance.
Gain a comprehensive understanding of the system from start to finish, including network topology, interconnects, and compute.
Establish what "healthy" represents across the stack the metrics, logs, and signals that confirm a system is functioning well, and the thresholds that show it isn't.
Build and extend the telemetry surface across hardware, fabric, and workload, crafting how data is collected, transformed, stored, and surfaced.
Serve as the observability expert, investigating gaps in visibility to ensure it reflects true system behavior.
Develop automation (Python, Shell) for collecting, transforming, and presenting system and network data.
Recommend improvements to system visibility, data sources, and reporting that give teams clearer insight.
Collaborate with hardware, software, networking, datacenter, and product groups to ready HPC systems and AI factories for customer deployment, contributing documentation and readiness materials throughout the process.
What We Need to See:
Bachelor's degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field.
6+ years of experience managing Linux-based systems in HPC, distributed systems, or large AI/ML settings.
Hands-on experience with the architecture of multi-GPU and/or multi-node clusters, including networking and interconnects.
Solid grasp of how HPC and AI factory systems fit together end to end, from network fabric through compute.
Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
Practical experience working with observability systems (e.g., Prometheus, Grafana, Loki, or similar), including building custom exporters or collectors, setting up alerts, and handling metric cardinality and retention on a large scale.
Experience transforming metrics, logs, and traces into clear, actionable insight for complex distributed environments.
Familiarity with GPU and fabric telemetry (e.g., DCGM, NVLink, Infini Band/Ethernet fabric counters) and using it to diagnose performance regressions.
Strong communication skills and the ability to work effectively with cross-functional teams.
Ways to Stand Out From the Crowd:

Experience with AI factory or large-scale AI infrastructure build, deployment, or operations.
Background in HPC systems engineering, SRE, or systems analysis for GPU-accelerated environments.
Experience building automation and data pipelines that feed dashboards and reporting onstrated desire to use AI to solve practical problems, improve workflows, and guide data-driven decisions.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until June 28, 2026.
NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary