Automated, version-controlled management of Jülich Supercomputing Centre's high-performance computing systems using Ansible.
Forschungszentrum Jülich · JSC · Division HPC, Cloud and Data Systems and ServicesThis is an Ansible-based Infrastructure-as-Code (IaC) repository that defines and enforces the configuration of the supercomputers, cloud platforms, and storage systems operated by the High-Performance Computing, Cloud and Data Systems and Services (HPCCDSS) division of the Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich GmbH.
The HPCCDSS division is tasked with integrating multiple computer architectures into a production environment to support a scientific environment for solving different kinds of computing problems. It comprises three teams: HPC and Cloud Systems, Exascale Supercomputer, and Storage Systems.
Rather than configuring hundreds of servers manually, the repository codifies every aspect of system setup—user accounts, network interfaces, job schedulers, storage mounts, monitoring stacks, security policies—into repeatable, auditable automation scripts called playbooks and roles. Changes are tracked in Git, reviewed via merge requests, and applied consistently across thousands of nodes.
Jülich Supercomputing Centre
The HPCCDSS division sits within JSC and is organised into three teams, each reflected in this repository's scope:
The diagram below shows how the Ansible control node applies configuration to the various HPC clusters and their subsystems.
The repository contains playbooks and inventory definitions for each of the following JSC production systems:
| System | Full Name | Description | Playbook |
|---|---|---|---|
| JUWELS | Jülich Wizard for European Leadership Science | Flagship hybrid CPU/GPU cluster with booster module; used for large-scale parallel simulations. | juwels.yml |
| JURECA-DC | Jülich Research on Exascale Cluster Architectures | Data-centric system with 768+ compute nodes including AI accelerator prototypes (MI200, H100). | jurecadc.yml |
| JUSUF | Jülich Support for Fenix | GPU-accelerated system for interactive and batch workloads; supports the Fenix research infrastructure. | jusuf.yml |
| JUPITER | Joint Undertaking Pioneer for Innovative and Transformative Exascale Research | Europe's first exascale-class supercomputer; EuroHPC JU flagship. | jupiter.yml |
| JUZEA | Jülich Zone of Energy Abstraction | Smaller test and development system with container-based workloads. | juzea.yml |
| JUST | Jülich Storage Cluster | Central GPFS-based storage providing home, project, scratch, and data filesystems for all HPC systems. | servers.yml |
| JSC Cloud | JSC Cloud Infrastructure | Cloud platform providing virtual machines and services alongside the HPC systems. | servers.yml |
| JUDAC | Jülich Data Access Server | Data access and transfer node for moving data between systems and external partners. | servers.yml |
| Supporting | HPSMC, DEEP, Gateways, … | Management clusters, SSH gateways, LDAP servers, CI runners, and monitoring infrastructure. | servers.yml |
Every change is committed to Git with an author, timestamp, and review trail. Auditors and compliance officers can trace who changed what, when, and why.
Identical configurations are applied to hundreds of nodes simultaneously, eliminating "snowflake" servers and reducing human error.
A new system or disaster recovery scenario can be bootstrapped from scratch by running the relevant playbook—no manual steps required.
Secrets are encrypted with Ansible Vault. SSH keys, certificates, and access policies are managed centrally and rotated systematically.
Declarative Control — define the desired state; Ansible converges each host to match.
System Inventory — every cluster and node is catalogued in structured, versioned inventory files.
Operational Security — secrets stay in Vault; deployments go through gated checks.
Configures SLURM controllers, daemons, partitions, accounting, and job submit filters across all clusters. Integrates with UNICORE for federated job submission.
Manages IBM Spectrum Scale (GPFS) clusters providing home, project, scratch, and data filesystems. Also handles Ceph distributed storage and NFS exports.
Provisions InfiniBand (Mellanox OFED, OpenSM subnet managers) and Ethernet interfaces. Configures DNS, DHCP, firewalls, and SSH gateways.
Deploys a full Prometheus + Grafana stack with alerting (Alertmanager) and centralized log aggregation (Loki + Promtail).
Supports HPC container runtimes (Apptainer), system containers (Podman), and Kubernetes clusters for service orchestration.
Dedicated maint.yml playbook orchestrates planned downtime:
SLURM reservations, SSH banners, GPFS graceful shutdown, and
status-page integration.
The repository follows Ansible best practices with a clear separation of concerns:
| Directory | Purpose |
|---|---|
roles/ | 150+ reusable Ansible roles (the building blocks of configuration: one role per service or subsystem). |
group_vars/ | Variables scoped per system or host group (e.g., group_vars/juwels/), allowing per-cluster customization. |
host_vars/ | Variables specific to individual nodes, for host-level tuning. |
files/ | Static configuration files deployed as-is: SSH keys, certificates, pre-built configs. |
templates/ | Jinja2 templates rendered at deploy time (e.g., SLURM configs, kickstart files). |
vault/ | Encrypted secrets (passwords, API tokens) managed with Ansible Vault and GPG. |
juwels/, jureca/, … | Per-system inventory directories defining which hosts belong to which groups. |
*.yml (root) | Top-level playbooks — one per system — that orchestrate which roles run on which hosts. |
| Term | Definition |
|---|---|
| Ansible | An open-source IT automation engine that manages configuration, deployment, and orchestration over SSH — no agent software is required on managed nodes. |
| Playbook | A YAML file that declares what should be configured on which hosts. Think of it as a recipe that Ansible follows step by step. |
| Role | A self-contained, reusable unit of automation (e.g., "install and configure SLURM"). Roles are the building blocks assembled by playbooks. |
| Inventory | A listing of all servers (nodes) organised into groups. Each HPC system has its own inventory. |
| Infrastructure as Code | The practice of managing IT infrastructure through machine-readable definition files rather than manual configuration, enabling version control, review, and repeatable deployments. |
| HPC | High-Performance Computing — using supercomputers and parallel processing to solve large-scale computational problems in science, engineering, and AI. |
| SLURM | Simple Linux Utility for Resource Management — the industry-standard job scheduler that allocates compute resources to user workloads on HPC clusters. |
| GPFS | General Parallel File System (IBM Spectrum Scale) — a high-performance clustered filesystem designed for large-scale data-intensive workloads. |