JSC HPC Infrastructure Management — Ansible Repository Overview

What Is This Repository?

This is an Ansible-based Infrastructure-as-Code (IaC) repository that defines and enforces the configuration of the supercomputers, cloud platforms, and storage systems operated by the High-Performance Computing, Cloud and Data Systems and Services (HPCCDSS) division of the Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich GmbH.

The HPCCDSS division is tasked with integrating multiple computer architectures into a production environment to support a scientific environment for solving different kinds of computing problems. It comprises three teams: HPC and Cloud Systems, Exascale Supercomputer, and Storage Systems.

Rather than configuring hundreds of servers manually, the repository codifies every aspect of system setup—user accounts, network interfaces, job schedulers, storage mounts, monitoring stacks, security policies—into repeatable, auditable automation scripts called playbooks and roles. Changes are tracked in Git, reviewed via merge requests, and applied consistently across thousands of nodes.

Jülich Supercomputing Centre

Division Structure

The HPCCDSS division sits within JSC and is organised into three teams, each reflected in this repository's scope:

At a Glance

HPC Systems

6000+

Managed Nodes

150+

Ansible Roles

Playbooks

40+

Inventory Groups

High-Level Architecture

The diagram below shows how the Ansible control node applies configuration to the various HPC clusters and their subsystems.

Managed Supercomputer Systems

The repository contains playbooks and inventory definitions for each of the following JSC production systems:

System	Full Name	Description	Playbook
JUWELS	Jülich Wizard for European Leadership Science	Flagship hybrid CPU/GPU cluster with booster module; used for large-scale parallel simulations.	`juwels.yml`
JURECA-DC	Jülich Research on Exascale Cluster Architectures	Data-centric system with 768+ compute nodes including AI accelerator prototypes (MI200, H100).	`jurecadc.yml`
JUSUF	Jülich Support for Fenix	GPU-accelerated system for interactive and batch workloads; supports the Fenix research infrastructure.	`jusuf.yml`
JUPITER	Joint Undertaking Pioneer for Innovative and Transformative Exascale Research	Europe's first exascale-class supercomputer; EuroHPC JU flagship.	`jupiter.yml`
JUZEA	Jülich Zone of Energy Abstraction	Smaller test and development system with container-based workloads.	`juzea.yml`
JUST	Jülich Storage Cluster	Central GPFS-based storage providing home, project, scratch, and data filesystems for all HPC systems.	`servers.yml`
JSC Cloud	JSC Cloud Infrastructure	Cloud platform providing virtual machines and services alongside the HPC systems.	`servers.yml`
JUDAC	Jülich Data Access Server	Data access and transfer node for moving data between systems and external partners.	`servers.yml`
Supporting	HPSMC, DEEP, Gateways, …	Management clusters, SSH gateways, LDAP servers, CI runners, and monitoring infrastructure.	`servers.yml`

Why Infrastructure as Code?

Auditability

Every change is committed to Git with an author, timestamp, and review trail. Auditors and compliance officers can trace who changed what, when, and why.

Consistency

Identical configurations are applied to hundreds of nodes simultaneously, eliminating "snowflake" servers and reducing human error.

Reproducibility

A new system or disaster recovery scenario can be bootstrapped from scratch by running the relevant playbook—no manual steps required.

Security

Secrets are encrypted with Ansible Vault. SSH keys, certificates, and access policies are managed centrally and rotated systematically.

Core Concepts Illustrated

Declarative Control — define the desired state; Ansible converges each host to match.

System Inventory — every cluster and node is catalogued in structured, versioned inventory files.

Operational Security — secrets stay in Vault; deployments go through gated checks.

Key Capabilities

Job Scheduling & Resource Management

Configures SLURM controllers, daemons, partitions, accounting, and job submit filters across all clusters. Integrates with UNICORE for federated job submission.

Parallel Filesystems

Manages IBM Spectrum Scale (GPFS) clusters providing home, project, scratch, and data filesystems. Also handles Ceph distributed storage and NFS exports.

Network Fabric

Provisions InfiniBand (Mellanox OFED, OpenSM subnet managers) and Ethernet interfaces. Configures DNS, DHCP, firewalls, and SSH gateways.

Monitoring & Observability

Deploys a full Prometheus + Grafana stack with alerting (Alertmanager) and centralized log aggregation (Loki + Promtail).

Containers & Virtualization

Supports HPC container runtimes (Apptainer), system containers (Podman), and Kubernetes clusters for service orchestration.

Maintenance Operations

Dedicated maint.yml playbook orchestrates planned downtime: SLURM reservations, SSH banners, GPFS graceful shutdown, and status-page integration.

Repository Structure

The repository follows Ansible best practices with a clear separation of concerns:

Directory	Purpose
`roles/`	150+ reusable Ansible roles (the building blocks of configuration: one role per service or subsystem).
`group_vars/`	Variables scoped per system or host group (e.g., `group_vars/juwels/`), allowing per-cluster customization.
`host_vars/`	Variables specific to individual nodes, for host-level tuning.
`files/`	Static configuration files deployed as-is: SSH keys, certificates, pre-built configs.
`templates/`	Jinja2 templates rendered at deploy time (e.g., SLURM configs, kickstart files).
`vault/`	Encrypted secrets (passwords, API tokens) managed with Ansible Vault and GPG.
`juwels/, jureca/, …`	Per-system inventory directories defining which hosts belong to which groups.
`*.yml` (root)	Top-level playbooks — one per system — that orchestrate which roles run on which hosts.

How Changes Are Applied

Technology Stack

Configuration Management: Ansible (agentless, SSH-based automation)
Operating Systems: Red Hat Enterprise Linux, Rocky Linux
Job Scheduling: SLURM, UNICORE
Parallel Filesystems: IBM Spectrum Scale (GPFS), Ceph, NFS
Interconnect: InfiniBand (Mellanox/NVIDIA OFED), Ethernet
Accelerators: NVIDIA & AMD GPUs (driver & runtime management)
Monitoring: Prometheus, Grafana, Loki, Alertmanager
Containers: Apptainer (Singularity), Podman, Kubernetes
High Availability: Pacemaker/Corosync, HAProxy, Keepalived
Secrets Management: Ansible Vault with GPG encryption
Source Control & CI/CD: GitLab, GitLab-CI, Ansible AWX

Glossary

Term	Definition
Ansible	An open-source IT automation engine that manages configuration, deployment, and orchestration over SSH — no agent software is required on managed nodes.
Playbook	A YAML file that declares what should be configured on which hosts. Think of it as a recipe that Ansible follows step by step.
Role	A self-contained, reusable unit of automation (e.g., "install and configure SLURM"). Roles are the building blocks assembled by playbooks.
Inventory	A listing of all servers (nodes) organised into groups. Each HPC system has its own inventory.
Infrastructure as Code	The practice of managing IT infrastructure through machine-readable definition files rather than manual configuration, enabling version control, review, and repeatable deployments.
HPC	High-Performance Computing — using supercomputers and parallel processing to solve large-scale computational problems in science, engineering, and AI.
SLURM	Simple Linux Utility for Resource Management — the industry-standard job scheduler that allocates compute resources to user workloads on HPC clusters.
GPFS	General Parallel File System (IBM Spectrum Scale) — a high-performance clustered filesystem designed for large-scale data-intensive workloads.