Senior LLM Deployment & Inference Optimization Engineer

VIPKID

Software Engineering, Data Science

Singapore

Posted on Jun 18, 2026

We are looking for an experienced Senior LLM Deployment & Inference Optimization Engineer to build and operate self-hosted inference infrastructure for LLMs, multimodal models, ASR, and TTS systems in the cloud. Your mission is to deliver a stable, low-latency, and cost-efficient inference platform that powers real-time conversations and voice interactions in AI-driven English learning classrooms. This is a senior, cross-functional engineering role focused on deploying, optimizing, and operating open-source inference engines and GPU infrastructure at scale, rather than developing inference kernels from scratch.

Responsibilities

Design, deploy, and operate self-hosted cloud inference services for LLMs, multimodal models, ASR, and TTS systems, building highly available and elastically scalable inference infrastructure.
Optimize and productionize open-source inference frameworks such as vLLM, SGLang, TensorRT-LLM, Triton, and TGI, focusing on: Throughput, Latency, time-to-First-Token (TTFT), Continuous batching, KV cache optimization, Quantization and Parallelization strategies
Achieve the optimal balance between user experience and infrastructure cost.
Manage and optimize GPU resources and infrastructure costs, including: Instance selection, GPU utilization improvements, Scheduling and workload co-location, Spot and reserved instance strategies and Cost-per-inference optimization
Build reliability, observability, and performance management systems for inference services, including: Monitoring and alerting, Load testing, Capacity planning, Rate limiting
Graceful degradation and disaster recovery
GPU memory management and OOM mitigation
Ensure high SLA performance for real-time production workloads.
Improve model-serving engineering capabilities, including: Multi-model routing, Load balancing, Auto-scaling, Canary deployments and Rollback mechanisms
Support rapid and reliable model iteration
Collaborate closely with AI researchers, backend engineers, and application teams to establish an end-to-end path from model development to production deployment.

Requirements

Bachelor's degree or above in Computer Science or a related field.
5+ years of experience in backend engineering, infrastructure engineering, MLOps, or related domains.
Proven production experience with self-hosted model inference systems
Independently deployed or led deployment of LLM, multimodal, or speech models in production environments.
Responsible for real-world reliability, scalability, and cost management—not just proof-of-concept or demo deployments.
Strong hands-on experience with one or more of: vLLM, SGLang, TensorRT-LLM, Triton Inference Server and Hugging Face TGI
Able to understand their internals and perform advanced service optimization.
Deep understanding of inference optimization techniques, including: Transformer inference mechanisms, KV Cache, Continuous/Dynamic Batching, Quantization (INT8, FP8, AWQ, GPTQ, etc.), Tensor Parallelism (TP), Pipeline Parallelism (PP) and PagedAttention
With proven experience tuning and deploying these techniques in production.
Strong knowledge of cloud-native infrastructure and GPU environments: Docker, Kubernetes, AWS, GCP, Alibaba Cloud, or similar platforms
GPU resource scheduling and utilization optimization
Infrastructure cost optimization
Solid systems engineering and reliability background: Distributed systems, High-concurrency services, High-availability architectures, Monitoring and observability, Load testing, Capacity planning and Production troubleshooting
Strong data-driven mindset toward SLA and infrastructure efficiency.

Preferred Qualifications

Experience optimizing real-time or streaming inference systems, including streaming generation and low TTFT workloads.
Experience deploying and accelerating: ASR systems, TTS systems, Speech models, Multimodal models
Experience building or operating: Large-scale GPU clusters, Inference scheduling platforms, Model serving platforms
Familiarity with: CUDA programming, GPU kernel optimization
Model compilation technologies such as TensorRT, TVM, and torch.compile
Understanding of model fine-tuning, distillation, and compression techniques, with awareness of the interplay between training and inference.
Demonstrated success in: Significantly reducing LLM inference costs and Building inference infrastructure from 0 to 1

Apply now

See more open positions at VIPKID

Build ventures that help people flourish.

Senior LLM Deployment & Inference Optimization Engineer