Hpc / ai specialist

Centro

AI4I Foundation

Pubblicato il 16 marzo

Descrizione

The Italian Institute of Artificial Intelligence (AI4I) is seeking a hands‐on HPC / AI Specialist to support and optimize the compute infrastructure powering the AI Foundry and Deployment activities.At AI4I, you will work on a state‐of‐the‐art AI computing environment built with best‐in‐class technologies acquired over the past year, including next‐generation GPU systems such as NVIDIA B200 accelerators and high‐performance distributed storage solutions such as VAST. This infrastructure is designed to support AI training, fine‐tuning, and inference workloads for both research and industrial deployment.Leonardo hosts the physical hardware infrastructure and delivers agreed infrastructure services in partnership with AI4I. In this role, you will focus on optimizing performance and providing direct technical support to internal users and clients running AI workloads, while contributing to the continuous evolution and improvement of the system design.You will act as a key interface between the machine infrastructure and the teams executing AI workflows, ensuring efficient, stable, and predictable operations.Location: AI4I, OGR – Turin, ItalyHybrid work: Flexible arrangements may be negotiatedAbout the RoleAs HPC / AI Specialist at AI4I, you will operate at the intersection of infrastructure operations and applied AI execution. You will ensure that engineers, researchers, and deployment teams can efficiently run training, fine‐tuning, inference, and data‐intensive pipelines on shared compute resources.This is a cross‐unit role shared 50% with the new Deployment Unit, working closely with both infrastructure and client‐facing teams.You will work closely with:

Per il seguente ruolo potrebbero essere richieste diverse soft skill ed esperienze. La preghiamo di consultare attentamente la panoramica riportata di seguito.

AI engineers and ML / GenAI teams running training, fine‐tuning, and inference workloadsCloud / DevOps engineers operating the private cloudThe Deployment Unit supporting industrial AI clientsHardware vendors and infrastructure partners

This role is strongly operational and performance‐oriented, with a focus on workload efficiency, system tuning, AI workload optimization, and user support.Key Responsibilities

Operate and maintain Linux‐based HPC clusters supporting AI training, fine‐tuning, and inference workloadsManage GPU and CPU compute environments, including workload scheduling, resource isolation, and performance tuningSupport distributed and software‐defined storage systems used for large‐scale datasetsAct as the primary technical interface between infrastructure operations and internal or external users running AI workflowsProvide hands‐on technical support for AI workload optimization, including distributed training and parameter‐efficient fine‐tuning of foundation models on HPC infrastructureSupport foundation model fine‐tuning workflows (including parameter‐efficient approaches), including configuration of data pipelines, checkpoints, runtime settings, and GPU memory optimization in HPC environmentsOptimize resource utilization and workload performance across multi‐tenant environmentsSupport containerised workloads running on shared compute infrastructureMonitor system health, performance, and capacity; troubleshoot user‐facing production issuesContribute to the continuous improvement and evolution of system architecture in collaboration with infrastructure teamsSupport internal users and clients with debugging, environment setup, and best practices for scalable AI execution

Required Qualifications

Solid background in CPU and GPU architectures and performance characteristicsExperience operating HPC clusters or large‐scale compute environmentsHands‐on experience with distributed and software‐defined storage systems (e.g., VAST or equivalent)Experience with workload managers and job schedulers (e.g., Slurm or equivalent)Experience troubleshooting performance bottlenecks in compute or storage environmentsPractical understanding of AI training and fine‐tuning workloads, including GPU memory management, batch sizing strategies, distributed execution constraints, checkpointing, and data pipeline performance in HPC or large‐scale compute environmentsScripting and automation skills (Bash, Python, or equivalent)Experience supporting shared infrastructure with uptime and operational responsibility

Additional Strengths

Experience supporting AI / ML training workloads in production environmentsExperience with parameter‐efficient fine‐tuning workflows and runtime optimisation in shared HPC environmentsFamiliarity with foundation model adaptation workflows and large‐scale training constraintsFamiliarity with containerised execution environmentsExperience operating multi‐tenant compute environmentsExperience with monitoring and observability systemsNetworking fundamentals for high‐throughput environmentsExperience collaborating with engineering or deployment teams in production settings

Key Performance Metrics

Cluster availability and operational stabilityGPU and CPU utilisation efficiencyWorkload performance and scheduling effectivenessTime required to debug and resolve user issuesTime required to onboard new workloads and users

What We Offer

A collaborative environment with engineers and researchers working on real industrial AI deploymentsDirect impact: your infrastructure will run daily AI workloads and production systemsAn office at the epicentre of tech: OGR Torino technology hubCompetitive compensation and access to advanced computing infrastructure

How to ApplySubmit your application exclusively through the online form:

Cover letter (max. 1 page) describing how your profile fits this specific positionCV and optional links to technical projects or operational experience

The Italian Institute for Artificial Intelligence (AI4I) was founded as a research institute to perform transformative, application‐oriented research in Artificial Intelligence, driving innovation and industrial progress. The Institute is designed to engage and empower gifted, entrepreneurial, and ambitious researchers who are committed to generating real‐world impact at the intersection of science, technology, and industrial transformation.Competitive salaries, performance‐based incentives, access to dedicated high‐performance computing resources, state‐of‐the‐art laboratories, and strong industrial collaborations are among the distinctive features that define AI4I. The Institute fosters a dynamic international environment and an ecosystem that supports the creation and growth of innovative startups.AI4I's mission is to advance scientific research, technology transfer, and, more broadly, Italy's innovation capacity, promoting positive impact across industry, services, and public administration. To achieve this, the Institute contributes to building a research and innovation infrastructure that leverages AI methods, with a special focus on manufacturing processes and the broader Industry 4.0 value chain. xjrgpwk AI4I also maintains strategic relationships with leading organisations in Italy and abroad, including Competence Centres and European Digital Innovation Hubs (EDIHs), positioning itself as an attractive destination for researchers, companies, and startups seeking collaboration and impact.AI4IThe Italian Institute of Artificial Intelligence (AI4I)Corso Castelfidardo 22, 10129 TorinoCodice fiscale 97904430010Partita IVA: 13130030011#J-18808-Ljbffr

Rispondere all'offerta

Crea una notifica

Salva