Senior LLM Deployment & Inference Optimization Engineer

VIPKID
VIPKID

Software Engineering, Data Science

Singapore

Posted on Jun 18, 2026

We are looking for an experienced Senior LLM Deployment & Inference Optimization Engineer to build and operate self-hosted inference infrastructure for LLMs, multimodal models, ASR, and TTS systems in the cloud. Your mission is to deliver a stable, low-latency, and cost-efficient inference platform that powers real-time conversations and voice interactions in AI-driven English learning classrooms. This is a senior, cross-functional engineering role focused on deploying, optimizing, and operating open-source inference engines and GPU infrastructure at scale, rather than developing inference kernels from scratch.

Responsibilities

  • Design, deploy, and operate self-hosted cloud inference services for LLMs, multimodal models, ASR, and TTS systems, building highly available and elastically scalable inference infrastructure.
  • Optimize and productionize open-source inference frameworks such as vLLM, SGLang, TensorRT-LLM, Triton, and TGI, focusing on: Throughput, Latency, time-to-First-Token (TTFT), Continuous batching, KV cache optimization, Quantization and Parallelization strategies
  • Achieve the optimal balance between user experience and infrastructure cost.
  • Manage and optimize GPU resources and infrastructure costs, including: Instance selection, GPU utilization improvements, Scheduling and workload co-location, Spot and reserved instance strategies and Cost-per-inference optimization
  • Build reliability, observability, and performance management systems for inference services, including: Monitoring and alerting, Load testing, Capacity planning, Rate limiting
  • Graceful degradation and disaster recovery
  • GPU memory management and OOM mitigation
  • Ensure high SLA performance for real-time production workloads.
  • Improve model-serving engineering capabilities, including: Multi-model routing, Load balancing, Auto-scaling, Canary deployments and Rollback mechanisms
  • Support rapid and reliable model iteration
  • Collaborate closely with AI researchers, backend engineers, and application teams to establish an end-to-end path from model development to production deployment.

Requirements

  • Bachelor's degree or above in Computer Science or a related field.
  • 5+ years of experience in backend engineering, infrastructure engineering, MLOps, or related domains.
  • Proven production experience with self-hosted model inference systems
  • Independently deployed or led deployment of LLM, multimodal, or speech models in production environments.
  • Responsible for real-world reliability, scalability, and cost management—not just proof-of-concept or demo deployments.
  • Strong hands-on experience with one or more of: vLLM, SGLang, TensorRT-LLM, Triton Inference Server and Hugging Face TGI
  • Able to understand their internals and perform advanced service optimization.
  • Deep understanding of inference optimization techniques, including: Transformer inference mechanisms, KV Cache, Continuous/Dynamic Batching, Quantization (INT8, FP8, AWQ, GPTQ, etc.), Tensor Parallelism (TP), Pipeline Parallelism (PP) and PagedAttention
  • With proven experience tuning and deploying these techniques in production.
  • Strong knowledge of cloud-native infrastructure and GPU environments: Docker, Kubernetes, AWS, GCP, Alibaba Cloud, or similar platforms
  • GPU resource scheduling and utilization optimization
  • Infrastructure cost optimization
  • Solid systems engineering and reliability background: Distributed systems, High-concurrency services, High-availability architectures, Monitoring and observability, Load testing, Capacity planning and Production troubleshooting
  • Strong data-driven mindset toward SLA and infrastructure efficiency.

Preferred Qualifications

  • Experience optimizing real-time or streaming inference systems, including streaming generation and low TTFT workloads.
  • Experience deploying and accelerating: ASR systems, TTS systems, Speech models, Multimodal models
  • Experience building or operating: Large-scale GPU clusters, Inference scheduling platforms, Model serving platforms
  • Familiarity with: CUDA programming, GPU kernel optimization
  • Model compilation technologies such as TensorRT, TVM, and torch.compile
  • Understanding of model fine-tuning, distillation, and compression techniques, with awareness of the interplay between training and inference.
  • Demonstrated success in: Significantly reducing LLM inference costs and Building inference infrastructure from 0 to 1