Forschungszentrum Jülich logo

HPC Infrastructure Configuration Repository

Automated, version-controlled management of Jülich Supercomputing Centre's high-performance computing systems using Ansible.

Forschungszentrum Jülich · JSC · Division HPC, Cloud and Data Systems and Services

What Is This Repository?

This is an Ansible-based Infrastructure-as-Code (IaC) repository that defines and enforces the configuration of the supercomputers, cloud platforms, and storage systems operated by the High-Performance Computing, Cloud and Data Systems and Services (HPCCDSS) division of the Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich GmbH.

The HPCCDSS division is tasked with integrating multiple computer architectures into a production environment to support a scientific environment for solving different kinds of computing problems. It comprises three teams: HPC and Cloud Systems, Exascale Supercomputer, and Storage Systems.

Rather than configuring hundreds of servers manually, the repository codifies every aspect of system setup—user accounts, network interfaces, job schedulers, storage mounts, monitoring stacks, security policies—into repeatable, auditable automation scripts called playbooks and roles. Changes are tracked in Git, reviewed via merge requests, and applied consistently across thousands of nodes.

JSC supercomputer hall

Jülich Supercomputing Centre

Division Structure

The HPCCDSS division sits within JSC and is organised into three teams, each reflected in this repository's scope:

Forschungszentrum Jülich (FZJ) Jülich Supercomputing Centre (JSC) HPCCDSS Division HPC and Cloud Systems Exascale Supercomputer Storage Systems

At a Glance

6+
HPC Systems
6000+
Managed Nodes
150+
Ansible Roles
36
Playbooks
40+
Inventory Groups

High-Level Architecture

The diagram below shows how the Ansible control node applies configuration to the various HPC clusters and their subsystems.

Ansible Control Node Git repo · Playbooks · Roles · Vault JUWELS Booster & Cluster JURECA-DC 768+ compute nodes JUSUF GPU-accelerated JUPITER Exascale system JUZEA & more Test & supporting MANAGED SUBSYSTEMS Job Scheduling SLURM · UNICORE Storage Systems GPFS · Ceph · NFS Networking InfiniBand · DNS · FW Monitoring Prometheus · Grafana Security SSH · LDAP · Certs Containers Apptainer · Podman · K8s Accounts Users · SSH keys · sudo High Availability Pacemaker · HAProxy Accelerators NVIDIA · AMD · OFED Ansible Control Node Git repo · Playbooks · Roles · Vault JUWELS Booster & Cluster JURECA-DC 768+ compute nodes JUSUF GPU-accelerated JUPITER Exascale system JUZEA & more Test & supporting MANAGED SUBSYSTEMS Job Scheduling SLURM · UNICORE Storage Systems GPFS · Ceph · NFS Networking InfiniBand · DNS · FW Monitoring Prometheus · Grafana Security SSH · LDAP · Certs Containers Apptainer · Podman · K8s Accounts Users · SSH keys · sudo High Availability Pacemaker · HAProxy Accelerators NVIDIA · AMD · OFED

Managed Supercomputer Systems

The repository contains playbooks and inventory definitions for each of the following JSC production systems:

SystemFull NameDescriptionPlaybook
JUWELS Jülich Wizard for European Leadership Science Flagship hybrid CPU/GPU cluster with booster module; used for large-scale parallel simulations. juwels.yml
JURECA-DC Jülich Research on Exascale Cluster Architectures Data-centric system with 768+ compute nodes including AI accelerator prototypes (MI200, H100). jurecadc.yml
JUSUF Jülich Support for Fenix GPU-accelerated system for interactive and batch workloads; supports the Fenix research infrastructure. jusuf.yml
JUPITER Joint Undertaking Pioneer for Innovative and Transformative Exascale Research Europe's first exascale-class supercomputer; EuroHPC JU flagship. jupiter.yml
JUZEA Jülich Zone of Energy Abstraction Smaller test and development system with container-based workloads. juzea.yml
JUST Jülich Storage Cluster Central GPFS-based storage providing home, project, scratch, and data filesystems for all HPC systems. servers.yml
JSC Cloud JSC Cloud Infrastructure Cloud platform providing virtual machines and services alongside the HPC systems. servers.yml
JUDAC Jülich Data Access Server Data access and transfer node for moving data between systems and external partners. servers.yml
Supporting HPSMC, DEEP, Gateways, … Management clusters, SSH gateways, LDAP servers, CI runners, and monitoring infrastructure. servers.yml

Why Infrastructure as Code?

Auditability

Every change is committed to Git with an author, timestamp, and review trail. Auditors and compliance officers can trace who changed what, when, and why.

Consistency

Identical configurations are applied to hundreds of nodes simultaneously, eliminating "snowflake" servers and reducing human error.

Reproducibility

A new system or disaster recovery scenario can be bootstrapped from scratch by running the relevant playbook—no manual steps required.

Security

Secrets are encrypted with Ansible Vault. SSH keys, certificates, and access policies are managed centrally and rotated systematically.

Core Concepts Illustrated

Declarative control diagram

Declarative Control — define the desired state; Ansible converges each host to match.

System inventory diagram

System Inventory — every cluster and node is catalogued in structured, versioned inventory files.

Operational security diagram

Operational Security — secrets stay in Vault; deployments go through gated checks.

Key Capabilities

Job Scheduling & Resource Management

Configures SLURM controllers, daemons, partitions, accounting, and job submit filters across all clusters. Integrates with UNICORE for federated job submission.

Parallel Filesystems

Manages IBM Spectrum Scale (GPFS) clusters providing home, project, scratch, and data filesystems. Also handles Ceph distributed storage and NFS exports.

Network Fabric

Provisions InfiniBand (Mellanox OFED, OpenSM subnet managers) and Ethernet interfaces. Configures DNS, DHCP, firewalls, and SSH gateways.

Monitoring & Observability

Deploys a full Prometheus + Grafana stack with alerting (Alertmanager) and centralized log aggregation (Loki + Promtail).

Containers & Virtualization

Supports HPC container runtimes (Apptainer), system containers (Podman), and Kubernetes clusters for service orchestration.

Maintenance Operations

Dedicated maint.yml playbook orchestrates planned downtime: SLURM reservations, SSH banners, GPFS graceful shutdown, and status-page integration.

Repository Structure

The repository follows Ansible best practices with a clear separation of concerns:

DirectoryPurpose
roles/150+ reusable Ansible roles (the building blocks of configuration: one role per service or subsystem).
group_vars/Variables scoped per system or host group (e.g., group_vars/juwels/), allowing per-cluster customization.
host_vars/Variables specific to individual nodes, for host-level tuning.
files/Static configuration files deployed as-is: SSH keys, certificates, pre-built configs.
templates/Jinja2 templates rendered at deploy time (e.g., SLURM configs, kickstart files).
vault/Encrypted secrets (passwords, API tokens) managed with Ansible Vault and GPG.
juwels/, jureca/, …Per-system inventory directories defining which hosts belong to which groups.
*.yml (root)Top-level playbooks — one per system — that orchestrate which roles run on which hosts.

How Changes Are Applied

1. Edit Code Modify role / var / playbook 2. Merge Request Peer review on GitLab 3. CI Lint & Test Automated validation 4. Merge to main Approved & merged 5. Deploy ansible-playbook run Every step is tracked, reversible, and auditable.

Technology Stack

Glossary

TermDefinition
AnsibleAn open-source IT automation engine that manages configuration, deployment, and orchestration over SSH — no agent software is required on managed nodes.
PlaybookA YAML file that declares what should be configured on which hosts. Think of it as a recipe that Ansible follows step by step.
RoleA self-contained, reusable unit of automation (e.g., "install and configure SLURM"). Roles are the building blocks assembled by playbooks.
InventoryA listing of all servers (nodes) organised into groups. Each HPC system has its own inventory.
Infrastructure as CodeThe practice of managing IT infrastructure through machine-readable definition files rather than manual configuration, enabling version control, review, and repeatable deployments.
HPCHigh-Performance Computing — using supercomputers and parallel processing to solve large-scale computational problems in science, engineering, and AI.
SLURMSimple Linux Utility for Resource Management — the industry-standard job scheduler that allocates compute resources to user workloads on HPC clusters.
GPFSGeneral Parallel File System (IBM Spectrum Scale) — a high-performance clustered filesystem designed for large-scale data-intensive workloads.