أبلاي إيدج ابدأ البحث عن عمل

Senior Infrastructure Engineer (HPC)

CONNECT Professional Services · Riyadh, Riyadh, Saudi Arabia

قدّم وتابع مع أبلاي إيدج
Job Summary:Deploying, configuring, and managing large-scale High-Performance Computing (HPC) environments. Demonstrating practical expertise across Linux administration (RHEL and Ubuntu), NVIDIA GPU infrastructure, Slum workload scheduling, Kubernetes, CI/CD automation, and the NVIDIA Enterprise software ecosystem. Key Responsibilities:Design, implement, and maintain end-to-end HPC clusters, including compute nodes, storage layers, high-speed networking (InfiniBand/RoCE), and management infrastructure. Provision and administer NVIDIA Base Command Manager (BCM) for bare-metal cluster deployment, operating system lifecycle management, and GPU fleet monitoring. Deploy, maintain, and integrate the NVIDIA AI Enterprise Suite with MLOps frameworks, including NeMo, Triton, and RAPIDS. Manage NVIDIA GPU Operator and Network Operator within Kubernetes environments to automate GPU driver and CUDA lifecycle management, DCGM exporter, and MIG configuration. Configure and support NVIDIA NIM inference services and implement NVIDIA Blueprint reference architectures for production AI workloads. Install, administer, and optimize Slurm environments, including partitions, QoS policies, fair-share scheduling, node accounting, MPI integration, and hybrid Slurm-on-Kubernetes scheduling. Build and manage Kubernetes clusters using kubeadm, including high-availability control planes, etcd backup strategies, and zero-downtime upgrades. Administer and maintain Red Hat Enterprise Linux (RHEL) and Canonical Ubuntu systems across all cluster nodes. Develop and maintain CI/CD pipelines using GitLab CI and GitHub Actions to automate infrastructure provisioning and software delivery. Analyze and optimize GPU and CPU performance, troubleshooting bottlenecks across hardware, drivers, MPI fabric, and application layers. Implement monitoring and observability solutions using Prometheus, Grafana, and DCGM, and establish alerting and capacity-planning mechanisms. Ensure adherence to security best practices through system hardening, kernel patching, RBAC implementation, and compliance monitoring across the HPC environment.  Requirements:Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or a related field. Minimum of 10 years of hands-on experience in High-Performance Computing (HPC) and infrastructure engineering. Active Red Hat Certified Engineer (RHCE) certification. Active Certified Kubernetes Administrator (CKA) certification. Proven experience designing, deploying, and managing large-scale HPC environments. Strong hands-on expertise with NVIDIA Base Command Manager (BCM) and the NVIDIA AI Enterprise ecosystem. Experience with NVIDIA GPU Operator, Network Operator, NVIDIA NIMs, and NVIDIA Blueprints. Extensive experience administering Slurm and managing workload scheduling in HPC environments. Strong knowledge of Kubernetes cluster deployment and administration, including high availability and lifecycle management. Solid experience with Red Hat Enterprise Linux (RHEL) and Canonical Ubuntu LTS administration. Proficiency in CUDA, GPU drivers, and GPU infrastructure management. Experience building and maintaining CI/CD pipelines using GitLab CI and/or GitHub Actions. Familiarity with high-speed networking technologies, including InfiniBand and RoCE. Experience with monitoring and observability tools such as Prometheus, Grafana, and NVIDIA DCGM. Strong understanding of infrastructure security, system hardening, RBAC, and compliance best practices. Excellent troubleshooting, performance optimization, and problem-solving skills. Strong communication and collaboration skills with the ability to work effectively in cross-functional teams.