Training, serving, retraining, and monitoring systems.
MLOps Engineer
Building end-to-end machine learning platforms that are reliable, observable, and ready for production.
MLOps Engineer and AI/ML practitioner with hands-on experience across AWS, Azure, and GCP, focused on model training, deployment, monitoring, retraining, CI/CD, infrastructure-as-code, and cloud-native delivery.
Sai Chandu Machavarapu
MLflow · Kubeflow · Docker · Kubernetes · Terraform
stack:
orchestrate: airflow + kubeflow
track: mlflow + dvc
deploy: fastapi + docker + kubernetes
observe: prometheus + grafana
cloud: aws / gcp / azure
Multi-cloud machine learning deployment experience.
Automation-first delivery with measurable reliability.
About
A production-minded engineer connecting machine learning, cloud infrastructure, and DevOps execution.
MLOps Engineer and AI/ML practitioner with hands-on experience building end-to-end machine learning pipelines, data pipelines, model serving infrastructure, and production monitoring systems.
Experienced in automating model training, deployment, versioning, and retraining using MLflow, Kubeflow, Apache Airflow, Docker, Kubernetes, and GitHub Actions.
Skilled in Python, PyTorch, TensorFlow, scikit-learn, LLM and GenAI workflows including RAG, fine-tuning, prompt engineering, vector search, and API-based delivery.
Strong foundation in DevOps and cloud-native systems including Terraform, Kafka, Prometheus, Grafana, SageMaker, EC2, ECS, Lambda, Cloud Run, and Azure ML.
Skills
Tools used across model development, deployment, infrastructure, and observability.
Cloud Platforms
AWS · SageMaker · EC2 · ECS · ECR · S3 · Lambda · CloudFormation · CloudWatch · Vertex AI · Cloud Run · Azure ML · Azure AD
MLOps & Workflow
MLflow · Kubeflow Pipelines · Apache Airflow · Argo Workflows · DVC · Weights & Biases · Great Expectations · Evidently AI · Prefect · BentoML
ML, DL & LLMs
PyTorch · TensorFlow · scikit-learn · XGBoost · LightGBM · ONNX Runtime · LangChain · LlamaIndex · HuggingFace · vLLM · LoRA · QLoRA · RAG
Infrastructure
Docker · Docker Compose · Kubernetes · Helm · ArgoCD · Terraform · Ansible · Nginx · Linux · YAML · HCL
CI/CD & Serving
GitHub Actions · GitLab CI/CD · Jenkins · FastAPI · Flask · TorchServe · KServe · REST APIs · gRPC · Canary Deployments
Monitoring & Data
Prometheus · Grafana · AlertManager · OpenTelemetry · Datadog · PostgreSQL · MySQL · Redis · Kafka · Spark · dbt · Snowflake · BigQuery · Pinecone · pgvector
Experience
Professional work shaped around reproducibility, deployment, automation, and measurable system improvements.
Machine Learning Intern
- Engineered an end-to-end ML pipeline for house price prediction using Python, scikit-learn, and XGBoost; used MLflow for experiment tracking across 12+ configurations and improved RMSE by 18% through ensemble stacking and hyperparameter tuning.
- Containerized the inference service with Docker and deployed a FastAPI REST endpoint with sub-100ms response time; integrated CI/CD with GitHub Actions for automated model validation on each push.
- Automated data ingestion using web scraping and pandas pipelines to collect, clean, and validate 5,000+ property records; implemented schema checks with Great Expectations before training runs.
- Versioned datasets and model artifacts using DVC with an S3 remote, enabling reproducible training and rollback to prior versions during evaluation cycles.
- Documented model behavior, deployment steps, retraining triggers, and rollback procedures in a clear operational runbook.
Java Full Stack Trainee
- Developed backend REST APIs using Spring Boot, Hibernate/JPA, and MySQL with JWT authentication and role-based access control; integrated with a React frontend for a full-stack HR management application.
- Implemented CI/CD with Jenkins and GitHub Actions to automate build, test, and Docker-based deployment, reducing manual deployment effort and surfacing integration issues earlier.
- Established structured JSON logging, health check endpoints, and API versioning to improve observability and maintainability across development and staging environments.
- Contributed to an HR management module covering employee records, attendance tracking, and leave management, used by 50+ internal users during user acceptance testing.
AI/ML Training
- Trained and deployed supervised ML models for classification and regression using AWS SageMaker managed training jobs with spot instances, reducing training compute cost.
- Built ETL pipelines using AWS Lambda and Amazon S3 to ingest, preprocess, and standardize structured datasets from multiple sources.
- Deployed an ML inference endpoint on SageMaker with auto-scaling and CloudWatch alarms for latency and error rate monitoring.
- Implemented a model versioning and promotion workflow spanning training, evaluation, registry, and deployment stages.
- Authored CloudFormation templates for reproducible SageMaker environments and integrated automated accuracy threshold checks for promotion to staging.
Projects
Selected projects focused on retraining workflows, streaming features, model monitoring, and production-ready AI systems.
Apache Airflow · MLflow · PostgreSQL · FastAPI · Prometheus
Automated ML Retraining Pipeline with Shadow Deployment
Built an Airflow-orchestrated retraining workflow that triggers on production metric drops, runs champion-challenger shadow deployment, validates data quality, and promotes only statistically better models.
Apache Kafka · Faust · Redis · FastAPI · Grafana
Real-Time ML Feature Engineering Pipeline
Designed a streaming feature pipeline processing 100+ events per second, serving online features from Redis to a FastAPI inference service with sub-50ms P95 latency and full Prometheus/Grafana observability.
FastAPI · PostgreSQL · Evidently AI · AlertManager · Docker
ML Model Monitoring & Observability Stack
Created a production model monitoring system with prediction logging, data drift detection, alerting for latency and error thresholds, and pre-provisioned dashboards for model health and confidence tracking.
scikit-learn · MLflow · FastAPI · Terraform · AWS
Hospital Readmission Prediction — End-to-End MLOps
Built a full sklearn-to-production workflow with experiment tracking, GitHub Actions CI/CD, Docker image deployment to AWS EC2, drift-triggered retraining, and Terraform-based infrastructure provisioning.
TypeScript · React · Redis · Sentence-Transformers · MySQL
LLM Cost Optimizer — Semantic Cache & Intelligent Router
Built a semantic caching proxy and intelligent router for LLM requests that reduces redundant API calls, routes prompts by complexity, tracks token and cost metrics, and exposes a full analytics dashboard.
Education
Academic background in computer science with an applied AI and systems focus.
Master of Science — Computer Science & Information Systems
University of Texas at Tyler
Aug 2024 — May 2026 · GPA 3.6
Bachelor of Technology — Computer Science & Engineering (AI Specialization)
Jawaharlal Nehru Technological University, Kakinada
Jan 2021 — May 2024 · GPA 3.3
Certifications
Certifications supporting cloud architecture, AI engineering, and MLOps delivery.
Contact
Open to MLOps, AI/ML, platform, and DevOps engineering opportunities.
Feel free to reach out for full-time roles, internships, collaborations, or conversations around production ML systems and cloud-native AI platforms.