AI Safety & Governance Full-time

Member of Technical Staff - ML Infra

Causal Labs

Location

Remote

Type

Full-time

Posted

Jul 27, 2025

Mission

What you will drive

  • Design, deploy, and maintain large distributed ML training and inference clusters
  • Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle
  • Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales
  • Analyze, profile and debug low-level GPU operations to optimize performance

Impact

The difference you'll make

This role contributes to advancing machine learning infrastructure, enabling more efficient and scalable AI development, which can support various impactful applications across different domains.

Profile

What makes you a great fit

  • Strong grasp of state-of-the-art techniques for optimizing training and inference workloads
  • Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models
  • Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings
  • Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)

Benefits

What's in it for you

No specific benefits, compensation, or perks mentioned in the job description.

About

Inside Causal Labs

No information provided about the organization's mission and work in the job description.