Sr. Software Engineer,AI Infra Job Mountain View Wyoming USA,Software Development

Position: Sr. Staff Software Engineer, AI Infra
Location: Mountain View

Company Description

Linked In is the worlds largest professional network, built to create economic opportunity for every member of the global workforce. Our products help people make powerful connections, discover exciting opportunities, build necessary skills, and gain valuable insights every day. Were also committed to providing transformational opportunities for our own employees by investing in their growth. We aspire to create a culture thats built on trust, care, inclusion, and fun where everyone can succeed.

Job Description

At Linked In, our approach to flexible work is centered on trust and optimized for culture, connection, clarity, and the evolving needs of our business. The work location of this role is hybrid, meaning it will be performed both from home and from a Linked In office on select days, as determined by the business needs of the team.

Join us to push the boundaries of scaling large models together. The team is responsible for scaling Linked In’s AI model training, feature engineering and serving with hundreds of billions of parameters models and large scale feature engineering infra for all AI use cases from recommendation models, large language models, to computer vision models. We optimize performance across algorithms, AI frameworks, data infra, compute software, and hardware to harness the power of our GPU fleet with thousands of latest GPU cards.

The team also works closely with the open source community and has many open source committers (Tensor Flow, Horovod, Ray, vLLM, Hugginface, Deep Speed etc.) in the team. Additionally, this team focussed on technologies like LLMs, GNNs, Incremental Learning, Online Learning and Serving performance optimizations across billions of user queries.

Model Training Infrastructure:
As an engineer on the AI Training Infra team, you will play a crucial role in building the next-gen training infrastructure to power AI use cases. You will design and implement high performance data I/O, work with open source teams to identify and resolve issues in popular libraries like Huggingface, Horovod and PyTorch, enable distributed training over 100s of billions of parameter models, debug and optimize deep learning training, and provide advanced support for internal AI teams in areas like model parallelism, tensor parallelism, Zero etc.

Finally, you will assist in and guide the development of containerized pipeline orchestration infrastructure, including developing and distributing stable base container images, providing advanced profiling and observability, and updating internally maintained versions of deep learning frameworks and their companion libraries like Tensorflow, PyTorch, Deep Speed, GNNs, Flash Attention. PyTorch Lightning and more and more.

Model Serving Infrastructure: this team builds low latency high performance applications serving very large & complex models across LLM and Personalization models. As an engineer, you will build compute efficient infra on top of native cloud, enable GPU based inference for a large variety of use cases, cuda level optimizations for high performance, enable on-device and online training. Challenges include scale (10s of thousands of QPS, multiple terabytes of data, billions of model parameters), agility (experiment with hundreds of new ML models per quarter using thousands of features), and enabling GPU inference at scale.

As a Sr. Staff Software Engineer, you will have first‑hand opportunities to advance one of the most scalable AI platforms in the world. At the same time, you will work together with our talented teams of researchers and engineers to build your career and your personal brand in the AI industry.

Responsibilities

Owning the technical strategy for broad or complex requirements with insightful and forward‑looking approaches that go beyond the direct team and solve large open‑ended problems.
Designing, implementing, and optimizing the performance of large‑scale distributed serving or training for personalized recommendation as well as large language models.
Improving the observability and understandability of various systems with a focus on improving developer productivity and system…


Increase/decrease your Search Radius (miles)



Job Posting Language

Sr. Software Engineer, AI Infra