Principal Software Engineer,AI Training Platform Job California Missouri USA,Software Development

Position: Principal Staff Software Engineer, AI Training Platform
Location: California

Company Description

Linked In is the worlds largest professional network, built to create economic opportunity for every member of the global workforce. Our products help people make powerful connections, discover exciting opportunities, build necessary skills, and gain valuable insights every day. Were also committed to providing transformational opportunities for our own employees by investing in their growth. We aspire to create a culture thats built on trust, care, inclusion, and fun where everyone can succeed.

Job Description

This role will be based in Mountain View, CA. At Linked In, we trust each other to do our best work where it works best for us and our teams. This role offers hybrid work options, meaning you can work from home and commute to a Linked In office, depending on what's best for you and when your team needs to be together.

As part of Linked In's AI Platform group, the AI Training team is responsible for developing and maintaining highly available and scalable deep learning training solutions to power our rapidly growing AI use cases. The team is responsible for scaling Linked In's AI model training with hundreds of billions of parameters for all AI use cases from recommendation models, large language models (Generative AI), to computer vision models.

We optimize training performance across algorithms, AI frameworks, infrastructure software, and hardware to harness the power of our GPU fleet with thousands of latest GPU cards. The team also works closely with the open source community and has many open source committers (Tensor Flow, Horovod, Ray, Hadoop, etc.) in the team. Additionally, this team focussed on technologies like LLMs, GNNs, Incremental Learning, Online Learning, and advanced LLM Agents work for Training infrastructure.

Responsibilities

Owning the technical strategy for broad or complex requirements with insightful and forward-looking approaches that go beyond the direct team and solve large open-ended problems.
Designing, implementing, and optimizing the performance of large-scale distributed training for personalized recommendation as well as large language models.
Improving the observability and understandability of various systems with a focus on improving developer productivity and system sustenance.
Mentoring other engineers, defining our challenging technical culture, and helping to build a fast-growing team.
Working closely with the open-source community to participate and influence cutting edge open-source projects (e.g., PyTorch, GNNs, Deep Speed, Huggingface, etc.).
Functioning as the tech-lead for several concurrent key initiatives for the Training Infrastructure and defining the future of AI training platforms.

Basic Qualifications

BS/BA in Computer Science or related technical field or equivalent technical experience
7+ years of industry experience in software design, development, and algorithm related solutions
7+ years of experience programming in object-oriented languages such as Python, C++, Java, Go, Rust, Scala
5+ years of experience as an architect, or technical leadership position
5+ years of experience in the industry with leading / building deep learning systems
Hands-on experience developing distributed systems or other large-scale systems

Preferred Qualifications

MS or PhD in Computer Science or related technical discipline.
12+ years of experience in software design, development, and algorithm related solutions with at least 5 years of experience in a technical leadership position
12+ years of experience in an object-oriented programming language such as Python, C++, Java, Go, Rust, Scala
5+ years of experience with large-scale distributed systems and client-server architectures
Co-author or maintainer of any open-source projects
Expertise in machine learning infrastructure, including technologies like MLFlow, Kubeflow and large scale distributed systems
Familiarity with containers and container orchestration systems
Expertise in deep learning frameworks and tensor libraries like PyTorch, Tensorflow, JAX/FLAX

Suggested Skills

ML Algorithm Development
Machine Learning / Deep Learning
Big Data
Stakeholder Management

Linked In is committed to fair and equitable compensation practices. The pay range for this role is $207,000 to $340,000. Actual compensation packages are based on several factors that are unique to each candidate, including but not limited to skill set, depth of experience, certifications, and specific work location. This may be different in other locations due to differences in the cost of labor.

The total compensation package for this position may also include annual performance bonus, stock, benefits and/or other applicable incentive compensation plans. For more information, visit

Equal Opportunity Statement

We seek candidates with a wide range of perspectives and backgrounds and we are proud to be an equal opportunity employer. Linked In considers qualified applicants without regard to race, color, religion, creed, gender, national origin, age, disability,…


Increase/decrease your Search Radius (miles)



Job Posting Language

Principal Software Engineer, AI Training Platform