MLOps Engineer

Building end-to-end machine learning platforms that are reliable, observable, and ready for production.

MLOps Engineer and AI/ML practitioner with hands-on experience across AWS, Azure, and GCP, focused on model training, deployment, monitoring, retraining, CI/CD, infrastructure-as-code, and cloud-native delivery.

Portrait of Sai Chandu Machavarapu

Sai Chandu Machavarapu

MLflow · Kubeflow · Docker · Kubernetes · Terraform

mlops-stack.yaml
stack:
  orchestrate: airflow + kubeflow
  track: mlflow + dvc
  deploy: fastapi + docker + kubernetes
  observe: prometheus + grafana
  cloud: aws / gcp / azure
Focus MLOps & Production AI

Training, serving, retraining, and monitoring systems.

Cloud AWS · Azure · GCP

Multi-cloud machine learning deployment experience.

Core Strength CI/CD · IaC · Observability

Automation-first delivery with measurable reliability.

About

A production-minded engineer connecting machine learning, cloud infrastructure, and DevOps execution.

MLOps Engineer and AI/ML practitioner with hands-on experience building end-to-end machine learning pipelines, data pipelines, model serving infrastructure, and production monitoring systems.

Experienced in automating model training, deployment, versioning, and retraining using MLflow, Kubeflow, Apache Airflow, Docker, Kubernetes, and GitHub Actions.

Skilled in Python, PyTorch, TensorFlow, scikit-learn, LLM and GenAI workflows including RAG, fine-tuning, prompt engineering, vector search, and API-based delivery.

Strong foundation in DevOps and cloud-native systems including Terraform, Kafka, Prometheus, Grafana, SageMaker, EC2, ECS, Lambda, Cloud Run, and Azure ML.

Skills

Tools used across model development, deployment, infrastructure, and observability.

Cloud Platforms

AWS · SageMaker · EC2 · ECS · ECR · S3 · Lambda · CloudFormation · CloudWatch · Vertex AI · Cloud Run · Azure ML · Azure AD

MLOps & Workflow

MLflow · Kubeflow Pipelines · Apache Airflow · Argo Workflows · DVC · Weights & Biases · Great Expectations · Evidently AI · Prefect · BentoML

ML, DL & LLMs

PyTorch · TensorFlow · scikit-learn · XGBoost · LightGBM · ONNX Runtime · LangChain · LlamaIndex · HuggingFace · vLLM · LoRA · QLoRA · RAG

Infrastructure

Docker · Docker Compose · Kubernetes · Helm · ArgoCD · Terraform · Ansible · Nginx · Linux · YAML · HCL

CI/CD & Serving

GitHub Actions · GitLab CI/CD · Jenkins · FastAPI · Flask · TorchServe · KServe · REST APIs · gRPC · Canary Deployments

Monitoring & Data

Prometheus · Grafana · AlertManager · OpenTelemetry · Datadog · PostgreSQL · MySQL · Redis · Kafka · Spark · dbt · Snowflake · BigQuery · Pinecone · pgvector

Experience

Professional work shaped around reproducibility, deployment, automation, and measurable system improvements.

Nov 2023 — Apr 2024

Codegnan

Internship

Machine Learning Intern

  • Engineered an end-to-end ML pipeline for house price prediction using Python, scikit-learn, and XGBoost; used MLflow for experiment tracking across 12+ configurations and improved RMSE by 18% through ensemble stacking and hyperparameter tuning.
  • Containerized the inference service with Docker and deployed a FastAPI REST endpoint with sub-100ms response time; integrated CI/CD with GitHub Actions for automated model validation on each push.
  • Automated data ingestion using web scraping and pandas pipelines to collect, clean, and validate 5,000+ property records; implemented schema checks with Great Expectations before training runs.
  • Versioned datasets and model artifacts using DVC with an S3 remote, enabling reproducible training and rollback to prior versions during evaluation cycles.
  • Documented model behavior, deployment steps, retraining triggers, and rollback procedures in a clear operational runbook.

May 2023 — Oct 2023

Wipro

Trainee

Java Full Stack Trainee

  • Developed backend REST APIs using Spring Boot, Hibernate/JPA, and MySQL with JWT authentication and role-based access control; integrated with a React frontend for a full-stack HR management application.
  • Implemented CI/CD with Jenkins and GitHub Actions to automate build, test, and Docker-based deployment, reducing manual deployment effort and surfacing integration issues earlier.
  • Established structured JSON logging, health check endpoints, and API versioning to improve observability and maintainability across development and staging environments.
  • Contributed to an HR management module covering employee records, attendance tracking, and leave management, used by 50+ internal users during user acceptance testing.

Mar 2022 — May 2022

Amazon Web Services

Training

AI/ML Training

  • Trained and deployed supervised ML models for classification and regression using AWS SageMaker managed training jobs with spot instances, reducing training compute cost.
  • Built ETL pipelines using AWS Lambda and Amazon S3 to ingest, preprocess, and standardize structured datasets from multiple sources.
  • Deployed an ML inference endpoint on SageMaker with auto-scaling and CloudWatch alarms for latency and error rate monitoring.
  • Implemented a model versioning and promotion workflow spanning training, evaluation, registry, and deployment stages.
  • Authored CloudFormation templates for reproducible SageMaker environments and integrated automated accuracy threshold checks for promotion to staging.

Projects

Selected projects focused on retraining workflows, streaming features, model monitoring, and production-ready AI systems.

01

Apache Airflow · MLflow · PostgreSQL · FastAPI · Prometheus

Automated ML Retraining Pipeline with Shadow Deployment

Built an Airflow-orchestrated retraining workflow that triggers on production metric drops, runs champion-challenger shadow deployment, validates data quality, and promotes only statistically better models.

02

Apache Kafka · Faust · Redis · FastAPI · Grafana

Real-Time ML Feature Engineering Pipeline

Designed a streaming feature pipeline processing 100+ events per second, serving online features from Redis to a FastAPI inference service with sub-50ms P95 latency and full Prometheus/Grafana observability.

03

FastAPI · PostgreSQL · Evidently AI · AlertManager · Docker

ML Model Monitoring & Observability Stack

Created a production model monitoring system with prediction logging, data drift detection, alerting for latency and error thresholds, and pre-provisioned dashboards for model health and confidence tracking.

04

scikit-learn · MLflow · FastAPI · Terraform · AWS

Hospital Readmission Prediction — End-to-End MLOps

Built a full sklearn-to-production workflow with experiment tracking, GitHub Actions CI/CD, Docker image deployment to AWS EC2, drift-triggered retraining, and Terraform-based infrastructure provisioning.

05

TypeScript · React · Redis · Sentence-Transformers · MySQL

LLM Cost Optimizer — Semantic Cache & Intelligent Router

Built a semantic caching proxy and intelligent router for LLM requests that reduces redundant API calls, routes prompts by complexity, tracks token and cost metrics, and exposes a full analytics dashboard.

Education

Academic background in computer science with an applied AI and systems focus.

Master of Science — Computer Science & Information Systems

University of Texas at Tyler

Aug 2024 — May 2026 · GPA 3.6

Bachelor of Technology — Computer Science & Engineering (AI Specialization)

Jawaharlal Nehru Technological University, Kakinada

Jan 2021 — May 2024 · GPA 3.3

Certifications

Certifications supporting cloud architecture, AI engineering, and MLOps delivery.

Microsoft Certified: Azure Administrator Associate
Microsoft Certified: Azure AI Fundamentals
Oracle Database@AWS Certified Architect Professional
Machine Learning Engineering for Production (MLOps) Specialization
Generative AI with Large Language Models
Deep Learning Specialization
Oracle AI Vector Search Certified Professional
GitHub Copilot Certified

Contact

Open to MLOps, AI/ML, platform, and DevOps engineering opportunities.

Feel free to reach out for full-time roles, internships, collaborations, or conversations around production ML systems and cloud-native AI platforms.