Infrastructure Engineer (L. 68/99)
Join our dynamic team as an Infrastructure Engineer and be at the forefront of innovation, working on cutting-edge technologies and collaborating on high-impact projects with major research institutes and universities in Italy.
High-Performance Computing (HPC) uses powerful parallel computers equipped with GPU to solve complex problems quickly. It's essential for tasks requiring massive computations, like weather forecasting, drug design, engineering simulations and riskmanagement, and to train complex Artificial Intelligence (AI) models.
In this exciting role, you'll have the opportunity to grow your expertise in cloud technologies, HPC systems and datacenter operations. You will learn how to deploy computing systems using some of the biggest open-source projects existing like OpenStack and Kubernetes and see an unmatched variety of hardware and computing solutions. Be part of a team that's pushing the boundaries of what's possible and making a real difference in the world of research and technology.
This job opportunity is reserved for candidates belonging to protected categories in accordance with Law 68/99.
KEY RESPONSIBILITIES:
* Work in the HPC Infrastructure Team in close collaboration with HPC Applications team and Operational Data Analytics Team.
* Design and implement reliable and scalable infrastructure solutions based on OpenStack.
* Automate the lifecycle management of systems using Infrastructure as Code tools (Terraform/OpenTofu, Ansible).
* Develop custom deployment and management tools using scripting languages like Bash or Python.
* Manage telemetry and alerting tools such as Prometheus, AlertManager, Grafana, and OpenSearch.
* Manage both TCP-based and low-latency (Infiniband) networking appliances.
* Configure computational resources such as accelerators and shared GPUs in baremetal or virtualized environments.
* Draft technical documentation and best practices for the use and management of hybrid HPC/cloud infrastructure.
* Provide technical support and troubleshooting for users and customers at already deployed sites.
* Serve as a point of escalation for colleagues and teams with less experience.
* Interact with providers to escalate issues and seek L3 support when necessary.
EDUCATION
- Master's degree in Computer Science, Information Technology, or a related STEM field, or a Bachelor's degree plus at least 2 years of experience in a similar role.
SKILLS AND COMPETENCES
CORE TECHNICAL SKILLS
Linux System Administration:
* Understanding of Linux OS fundamentals (kernel, Filesystem Hierarchy Standard, system utilities and other fundamental aspects).
* Experience with package management and update mechanisms for major distributions (RedHat/CentOS, Ubuntu/Debian, SUSE).
* Familiarity with POSIX filesystems (ext4, XFS) and basic filesystem operations.
Networking:
* Knowledge of TCP/IP protocol suite, OSI model, and common networking hardware (switches, routers).
* Experience with Linux networking utilities (ip, ss, netstat, tcpdump, curl, wget).
* Understanding of VLANs, subnetting, and DNS.
Containerization & Orchestration:
- Experience with container runtimes (Docker, Podman, or Singularity).- Basic understanding of Kubernetes (pods, deployments, services, ingress) and container orchestration concepts.
Infrastructure as Code & Automation:
* Proficiency in shell scripting (Bash, Zsh) for automation tasks.
* Basic knowledge of Python for scripting and automation.
* Familiarity with infrastructure-as-code tools like Terraform or Ansible.
* Understanding of configuration management principles.
Version Control:
- Experience with Git (cloning, committing, branching, merging, rebasing).- Awareness of GitOps principles and workflows.
Problem-Solving & Collaboration:
* Strong analytical and troubleshooting skills.
* Ability to document processes and write clear technical documentation.
* Comfortable working both independently and in cross-functional teams.
Communication:
- Clear written and verbal communication skills for technical and non-technical audiences.
NICE TO HAVE SKILLS
Cloud Platforms:
- Basic understanding of OpenStack architecture (e.g., Nova, Cinder, Keystone, Neutron, Glance).- Exposure to deployment tools like Kolla Ansible or Helm.
Programming & Build Systems:
- Knowledge of compiled languages (C, C++, Rust) and the build process (compilers, linkers, Makefiles).- Familiarity with debugging tools (gdb, strace, valgrind) and performance profiling.
High Performance Computing (HPC):
- Awareness of HPC concepts, workload managers (e.g., Slurm, PBS), and parallel computing.- Understanding of HPC storage solutions (Lustre, GPFS, Ceph) is a plus.
Monitoring & Observability:
- Basic knowledge of logging (rsyslog, journalctl), metrics (Prometheus, Grafana) tools.
Security Basics:
- Awareness of Linux security best practices (SELinux, AppArmor, firewalls, user/permission management).
WORKING ENVIRONMENT
- Remote work: Enjoy the flexibility of working from anywhere in Italy, with no hybrid model or office constraints. Only required to be present in-person about once a month, with expenses covered by the company.- Objective-based work: Work towards clear, achievable objectives that are reviewed and updated throughout the year, giving you a sense of purpose and direction.