Infrastructure & DevOps Engineer | Italy (Remote/Hybrid) A leading European innovator in high-performance computing and AI infrastructure is seeking an Infrastructure & DevOps Engineer to architect the foundation of its compute capabilities.
Si candidi in fretta: consulti la descrizione completa scorrendo verso il basso per scoprire tutti i requisiti di questo ruolo.
The company operates at the cutting edge of multi-GPU cluster management, bridging the gap between sophisticated hardware provisioning and seamless cloud-native orchestration.
The Role The focus of this position is the end-to-end reliability of a heterogeneous compute environment.
The engineer will be responsible for making large-scale deployments reproducible and ensuring that developers have frictionless access to high-power resources.
Key Responsibilities:
* Provision and maintain high-performance CPU/GPU clusters across multiple physical locations.
* Implement dynamic compute and storage scaling to meet fluctuating workload demands.
* Design hardware and software-level storage solutions, including distributed filesystems and storage tiering.
* Manage container orchestration through Kubernetes and Docker for both production and R&D workloads.
* Develop infrastructure as code (IaC) utilising Terraform and Ansible.
* Optimise job scheduling and resource allocation via Slurm and Kubernetes.
* Establish robust observability using Prometheus, Grafana, and IPMI.
* Conduct system-level performance profiling, focusing on GPU utilisation and I/O throughput.
* Oversee secure networking, VPN management, and disaster recovery protocols.
Technical Profile The ideal candidate brings a deep understanding of Linux system administration and the unique challenges of managing bare-metal and virtualised hardware.
Essential Experience:
* Advanced Linux administration and networking principles.
* Proven expertise with Docker and Kubernetes orchestration.
* Hands-on experience with IaC tools (Terraform or Ansible).
* Background in HPC environments and job scheduling via Slurm.
* Experience in hardware infrastructure management (IPMI, BMC) and server maintenance.
* Ability to design storage systems such as NFS, Ceph, or other distributed filesystems.
Preferred Skills:
* Familiarity with bare-metal provisioning tools like MaaS.
* Experience navigating cloud service environments (AWS or similar).
Why Join? The company offers the opportunity to work on highly complex, large-scale infrastructure projects that directly power the next generation of AI development.
This is a chance to move beyond standard cloud DevOps and dive into the intricacies of hardware-level performance and global compute distribution.
If this role is of any interest please apply directly on LinkedIn or send a copy of your CV to .
Interested? xjrgpwk Apply directly through LinkedIn, or send your CV to By applying to this role you understand that we may collect your personal data and store and process it on our systems.
For more information please see our Privacy Notice ( ).