AI Safety & Governance
Full-time
Member of Technical Staff - ML Infra
Causal Labs
Location
Remote
Type
Full-time
Posted
Jul 27, 2025
Mission
What you will drive
- Design, deploy, and maintain large distributed ML training and inference clusters
- Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle
- Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales
- Analyze, profile and debug low-level GPU operations to optimize performance
Impact
The difference you'll make
This role contributes to advancing machine learning infrastructure, enabling more efficient and scalable AI development, which can support various impactful applications across different domains.
Profile
What makes you a great fit
- Strong grasp of state-of-the-art techniques for optimizing training and inference workloads
- Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models
- Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings
- Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)
Benefits
What's in it for you
No specific benefits, compensation, or perks mentioned in the job description.
About
Inside Causal Labs
No information provided about the organization's mission and work in the job description.