Description
Responsibilities:
• Design, deploy, and manage scalable and efficient infrastructure for machine learning development, testing, and production environments.
• Implement frameworks and best practices for building automated machine learning pipelines to build, train, and deploy models seamlessly and consistently across various environments.
• Establish monitoring and logging systems to track the performance, health, and usage of platform infrastructure and deployed machine learning models.
• Set up mechanisms for continuous monitoring of model performance and implement processes for automated retraining.
• Implement alerts and dashboards to proactively identify and address issues.
• Optimize and scale machine learning infrastructure to handle varying workloads, and collaborate with cross-functional teams to analyze and address performance bottlenecks.
• Implement security and governance measures to protect machine learning models, data, and infrastructure.
• Collaborate closely with data scientists to understand requirements and needs for building models and pipelines, and to facilitate the integration of said components into production systems.
• Define and enforce engineering best practices to ensure high-quality deliverables.
• Document processes and best practices for the team, and conduct knowledge-sharing sessions with data science to educate them on best practices and tools.
• Participate in the code review process to ensure our code quality standards are met.
• Stay up-to-date with the latest ML platform technologies and trends to identify opportunities for product innovation.
Requirements:
• You have a BS or MS in Computer Science or equivalent.
• 8+ years of experience in commercial software development.
• Strong background in machine learning engineering or operations.
• Demonstrated excellence in participating in cross-functional teams in fast-paced environments, both in terms of technical leadership and hands-on coding.
• Excellent ability to break down complex problems into simple solutions.
• Willingness and ability to learn, evaluate, and make recommendations for leveraging new technologies.
• Strong analytical skills and desire to write clean, correct, and efficient code.
• Sense of ownership, urgency, and pride in your work.
• Proven that you are a leader who prioritizes, communicates clearly, and partners effectively with both technical and non-technical employees.
• Excellent command of tools and expertise for troubleshooting production issues.
• Experience with Python or Java.
• Experience with Docker.
• Experience following Software Engineering best practices including source control, code reviews, CI/CD, and automated testing.
• Ability to write complex SQL queries.
• Experience building data pipelines with orchestration tools (e. g. Kubeflow, Argo, Jenkins).
• Experience deploying and maintaining models in cloud environments (e. g. AWS, Azure, GCP).
• Experience with model tracking and deployment tools (e. g. MLflow, Seldon, Sagemaker).
Nice to have:
• Experience with Kubernetes.
• Experience with Recommendation Systems.
• Experience with online businesses.
• Experience with ML frameworks like Tensorflow and PyTorch.
• Experience with streaming architecture, e. g. Kinesis, Apache Kafka.
• Experience with distributed computing (e. g. Snowflake, Apache Spark, Ray).