أبلاي إيدج ابدأ البحث عن عمل

HPC Engineer

Penta Consulting · Riyadh, Saudi Arabia

قدّم وتابع مع أبلاي إيدج
Penta Consulting are a technology service provider and leading outsourced partner helping to deliver professional and managed solutions across EMEA.We are seeking an experienced Senior Infrastructure HPC Engineer who has personally designed, deployed, configured, and operated every component of a large-scale high-performance computing environment.Key Responsibilities• Design, deploy, and maintain HPC clusters end-to-end: compute nodes, storage tiers, high-speed networking (InfiniBand / RoCE), and management fabric.• Personally, provision and administer NVIDIA Base Command Manager (BCM) for bare-metal cluster imaging, OS lifecycle, and GPU fleet health monitoring.• Deploy and manage the full NVIDIA AI Enterprise Suite: install, license, update, and integrate with MLOps pipelines (NeMo, Triton, RAPIDS).• Deploy and operate NVIDIA GPU Operator and Network Operator on Kubernetes to automate driver and CUDA lifecycle, DCGM exporter, and MIG configuration.• Configure and serve NVIDIA NIM inference endpoints; implement NVIDIA Blueprint reference architectures for production AI workloads.• Install, administer, and tune Slurm: partitions, QOS, fair-share policies, node accounting, MPI integration, and Slurm-on-Kubernetes hybrid scheduling.• Bootstrap and operate Kubernetes clusters using kubeadm - including control plane HA, etcd backup, and zero-downtime upgrades.• Administer RHEL / Canonical Ubuntu across all cluster nodes.• Build and maintain CI/CD pipelines (GitLab CI / GitHub Actions) for infrastructure provisioning and HPC software delivery.• Profile and tune GPU and CPU workload performance; resolve bottlenecks across hardware, drivers, MPI fabric, and application layers.• Implement cluster monitoring with Prometheus, Grafana, and DCGM; define alerting and capacity planning thresholds.• Enforce security best practices: node hardening, kernel patching, RBAC, and compliance audits across the HPC environment.