Job Description
What you will do:
- Deploy, manage, and optimize Kubernetes clusters specifically tailored for AI/ML workloads, ensuring optimal resource allocation and scalability across different network configurations.
- Develop and maintain CI/CD pipelines tailored for continuous training and deployment of machine learning models, integrating tools like Kubeflow, MLflow, ArgoFlow or TensorFlow Extended (TFX) etc.
- Collaborate with data scientists to oversee the deployment of machine learning models and set up monitoring systems to track their performance and health in production.
- Design and implement data pipelines for large-scale data ingestion, processing, and analytics essential for machine learning models, utilizing distributed storage and processing technologies such as Hadoop, Spark, and Kafka.
The skills you bring:
- Extensive experience with Kubernetes and cloud services (AWS, Azure, GCP, private cloud) with a focus on deploying and managing AI/ML environments.
- Strong proficiency in scripting and automation using languages like Python, Bash, Ansible and HasiCorp Terraform
- In-depth knowledge of data pipeline and workflow management tools, distributed data processing (Hadoop, Spark), and messaging systems (Kafka, RabbitMQ).
- Expertise in implementing CI/CD pipelines, infrastructure as code (IaC), and configuration management tools.
- Familiarity with security standards and data protection regulations relevant to AI/ML projects.
- Proven ability to design and maintain reliable and scalable infrastructure tailored for AI/ML workloads.