NERSC Perlmutter¶
Perlmutter is a Cray EX supercomputer at the National Energy Research Scientific Computing Center (NERSC), DOE. It features both GPU and CPU partitions, with GPU nodes equipped with 4x NVIDIA A100 40GB GPUs each. It runs a Cray PE software stack with Cray MPICH and Cray libfabric for high-speed interconnect.
This guide covers running RHAPSODY's AI inference workloads using the DragonVllmInferenceBackend on Perlmutter's GPU partition.
Environment Setup¶
1. Create and activate a conda environment¶
Note
PYTHONNOUSERSITE=1 prevents user-site packages from leaking into the conda environment, which is important on shared systems like Perlmutter.
2. Install Dragon and RHAPSODY with Dragon-VLLM support¶
3. Set the HuggingFace cache directory¶
Perlmutter's $SCRATCH filesystem is the recommended location for large model caches:
Note
Add this to your ~/.bashrc or job script so it persists across sessions. See the NERSC docs for more on managing storage.
4. Configure Dragon's OFI runtime library¶
Dragon needs to be pointed to the Cray libfabric library for inter-node communication:
Warning
This step is required on Perlmutter. Without it, Dragon will fail to initialize its transport layer across nodes.
5. Download the vLLM config file¶
wget https://raw.githubusercontent.com/radical-cybertools/vllm-dragonhpc/main/config.sample -O config.yaml
Edit config.yaml and set at minimum:
6. Allocate a GPU node¶
Once the allocation is granted, launch with the dragon command:
Example¶
Simulation via Inference + Model Fine-Tuning¶
This example demonstrates a two-phase HPC-AI workflow on Perlmutter:
- Simulation phase — run a batch of scientific simulations using LLM inference (
DragonVllmInferenceBackend). EachAITasksends a domain-specific prompt and collects the generated output as synthetic simulation data. - Training phase — fine-tune a small model on the collected simulation outputs using a
ComputeTaskdispatched throughDragonExecutionBackendV3.
Both phases run within the same RHAPSODY Session, sharing the same Dragon runtime across GPU and CPU resources.
import asyncio
import logging
import multiprocessing as mp
import rhapsody
from rhapsody.api import AITask, ComputeTask, Session
from rhapsody.backends import DragonExecutionBackendV3, DragonVllmInferenceBackend
rhapsody.enable_logging(level=logging.INFO)
# Scientific simulation prompts — each represents one simulation scenario
SIMULATION_PROMPTS = [
"Simulate the energy minimization of a water molecule using DFT. Report bond lengths and angles.",
"Describe the outcome of a molecular dynamics simulation of a lipid bilayer at 310K over 100ns.",
"Run a Monte Carlo simulation of a 2D Ising model at the critical temperature. Report magnetization.",
"Simulate heat diffusion in a 1D rod with fixed boundary conditions. Report steady-state profile.",
"Describe the result of a lattice Boltzmann simulation of fluid flow through a porous medium.",
"Simulate protein folding of a 50-residue helix using coarse-grained MD. Report RMSD over time.",
"Run a finite element simulation of stress distribution in a titanium beam under axial load.",
"Simulate a 2D incompressible Navier-Stokes flow around a cylinder at Re=100. Report drag coefficient.",
]
def fine_tune(simulation_outputs: list):
"""Fine-tune a small causal LM on simulation-generated data."""
import torch
from datasets import Dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
Trainer,
TrainingArguments,
)
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
print(f"Loading tokenizer and model: {model_name}", flush=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float16, device_map="auto"
)
dataset = Dataset.from_dict({"text": simulation_outputs})
def tokenize(batch):
tokens = tokenizer(
batch["text"], truncation=True, max_length=256, padding="max_length"
)
tokens["labels"] = tokens["input_ids"].copy()
return tokens
tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
args = TrainingArguments(
output_dir="./sim-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
fp16=True,
save_strategy="no",
logging_steps=5,
report_to="none",
)
trainer = Trainer(model=model, args=args, train_dataset=tokenized)
trainer.train()
trainer.save_model("./sim-finetuned")
tokenizer.save_pretrained("./sim-finetuned")
print("Fine-tuning complete. Model saved to ./sim-finetuned", flush=True)
return "./sim-finetuned"
async def main():
mp.set_start_method("dragon")
execution_backend = await DragonExecutionBackendV3()
inference_backend = await DragonVllmInferenceBackend(
config_file="config.yaml",
model_name="Qwen/Qwen2.5-0.5B-Instruct",
num_nodes=1,
num_gpus=1,
tp_size=1,
port=8001,
offset=0,
)
inference_backend = await inference_backend.initialize()
session = Session([execution_backend, inference_backend])
async with session:
# --- Phase 1: Simulation via inference ---
print(f"--- Phase 1: Running {len(SIMULATION_PROMPTS)} simulations ---", flush=True)
sim_tasks = [
AITask(prompt=prompt, backend=inference_backend.name)
for prompt in SIMULATION_PROMPTS
]
futures = await session.submit_tasks(sim_tasks)
await asyncio.gather(*futures)
simulation_outputs = [t.response for t in sim_tasks if t.response]
print(f"Collected {len(simulation_outputs)} simulation outputs", flush=True)
for i, out in enumerate(simulation_outputs):
print(f" Sim {i}: {out[:80]}...", flush=True)
# --- Phase 2: Fine-tune a small model on the simulation data ---
print("--- Phase 2: Fine-tuning model on simulation outputs ---", flush=True)
train_task = ComputeTask(
function=fine_tune,
args=[simulation_outputs],
backend=execution_backend.name,
)
train_futures = await session.submit_tasks([train_task])
await asyncio.gather(*train_futures)
print(f"Training result: {train_task.stdout.strip()}", flush=True)
if __name__ == "__main__":
asyncio.run(main())
Run with:
Expected Output
--- Phase 1: Running 8 simulations ---
Collected 8 simulation outputs
Sim 0: The DFT energy minimization of a water molecule yields an O-H bond length of 0.96 Å...
Sim 1: The lipid bilayer simulation at 310K shows stable membrane structure with an area per...
Sim 2: At the critical temperature (T_c ≈ 2.269), the magnetization fluctuates around zero...
Sim 3: The steady-state heat profile follows a linear gradient between the two fixed boundar...
Sim 4: The porous medium simulation shows channeling effects with an effective permeability...
Sim 5: The coarse-grained MD simulation shows the helix stabilizes around 2ns with RMSD < 2...
Sim 6: Stress concentrations appear at the fixed end with a maximum von Mises stress of 320...
Sim 7: The drag coefficient Cd ≈ 1.38, consistent with experimental values at Re=100. Karman...
--- Phase 2: Fine-tuning model on simulation outputs ---
Loading tokenizer and model: Qwen/Qwen2.5-0.5B-Instruct
{'loss': 2.4231, 'epoch': 1.0}
{'loss': 1.8902, 'epoch': 2.0}
{'loss': 1.5214, 'epoch': 3.0}
Fine-tuning complete. Model saved to ./sim-finetuned
Training result: ./sim-finetuned
Workflow pattern
This pattern — inference-driven simulation followed by fine-tuning — is a common AI-for-science loop: the LLM generates domain-specific synthetic data, which is then used to specialize a smaller model for downstream tasks such as surrogate modeling or experiment recommendation.
Scaling up
- Increase
SIMULATION_PROMPTSto hundreds or thousands of scenarios to generate a larger training corpus - Use
num_gpus=4andtp_size=4on Perlmutter's A100 nodes for larger inference models - Replace
Qwen/Qwen2.5-0.5B-Instructinfine_tune()with any HuggingFace model that fits in GPU memory