Job Description
We are seeking a visionary Senior AI Systems Architect to lead the infrastructure strategy for our next-generation Generative AI platform. As we look toward 2026, the pace of technological evolution is accelerating, and we need a leader who can design systems that are not only scalable today but adaptable for the quantum and edge computing eras.
In this pivotal role, you will bridge the gap between cutting-edge machine learning research and production-grade reliability. You will be responsible for architecting the backbone of our neural network training pipelines and real-time inference engines. If you are passionate about solving complex engineering challenges and shaping the future of AI, this is your opportunity to make an impact.
Responsibilities
- Architect Scalable Infrastructure: Design and implement high-throughput, low-latency distributed systems capable of handling petabyte-scale data processing for large language models.
- Model Deployment: Lead the deployment, optimization, and monitoring of AI models in production environments using Kubernetes and serverless architectures.
- Performance Tuning: Conduct deep-dive performance analysis to optimize GPU cluster utilization and reduce inference costs significantly.
- Security & Compliance: Implement robust security protocols to protect sensitive training data and ensure compliance with emerging AI ethics regulations.
- Technical Leadership: Mentor a team of DevOps and ML engineers, fostering a culture of innovation and continuous improvement.
- Future-Proofing: Research and evaluate emerging technologies (e.g., edge AI, federated learning) to integrate into our core architecture roadmap.
Qualifications
- Education: Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field; PhD preferred.
- Experience: 7+ years of experience in software engineering, DevOps, or Systems Architecture with a strong focus on AI/ML workloads.
- Technical Skills: Proficiency in Python, PyTorch, TensorFlow, and modern cloud platforms (AWS/GCP/Azure).
- Containerization: Extensive experience with Docker, Kubernetes, and Terraform for infrastructure as code.
- Networking: Strong understanding of networking protocols, load balancing, and high-availability architecture.
- Problem Solving: Exceptional ability to troubleshoot complex, multi-layered system issues under pressure.