parallelooza

a survey of parallel computing approaches

about me

solutions engineer at Posit, PBC
retired policy wonk
not an HPC expert

this talk is a map

slides github.com/edavidaja/parallelooza
example code github.com/edavidaja/iggi
HPC + cloud development environnments aws parallelcluster + posit workbench

how to accelerate code

make it faster
add cores
add machines

make it faster

figure out why it is slow WITH DATA:
- profvis
- bench & alternatives
RIIR
- extendr
- r-rust
SIMD
- rcppsimdjson

add cores

futureverse

{future} overview

handles tricky parallelization problems
supplies primitives for re-use in higher-level pacakges

`purrr::`

map(.x, .f, ...)

for every element of .x,
do .f

p`furrr::`

future_map(.x, .f, …)

for every element of .x,
do .f,
according to the plan()

demo

parse_pdf.R

add machines

“scheduler”:

software that matches tasks to available resources:

orchestrators

kubernetes
nomad

(not a scheduler)

apache spark

hpc

slurm
lsf
pbs
grid engine

hpc

graph LR;
A(/apps)
L(login node)
M(Resource Manager)
S(Shared Storage)
H(/home)

L -->|Submit| M;
M --> C(Compute Node)
M --> D(Compute Node)
M --> E(Compute Node)
S --- L
S --- M
S --- C
S --- D
S --- E
A --- S
H --- S

templates

#!/bin/bash -l

# File: slurm.tmpl
# Template for using clustermq against a SLURM backend

#SBATCH --job-name={{ job_name }}
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 1024 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --cpus-per-task={{ cores | 1 }}


export OMP_NUM_THREADS={{ cores | 1 }}
CMQ_AUTH={{ auth }} ${R_HOME}/bin/R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

`future.batchtools`

library(future.batchtools)
library(furrr)

plan(
  list(
    batchtools_slurm,
    multisession
    )
  )

targets %$%
  future_map2(report, files, ~ iggi::parse_pdf(.x, .y))

`clustermq`

options(
  clustermq.scheduler = "slurm",
  clustermq.template = here::here("inst", "clustermq.slurm.tmpl")
)

library(clustermq)
library(iggi)

res <- Q(
  iggi::parse_pdf, 
  report_id = targets$report, file = targets$files, 
  n_jobs = 2,
  pkgs = c("magrittr", "pdftools", "tesseract", "purrr", "dplyr", "stringr", "iggi")
  )

`crew.cluster`

library(crew.cluster)
controller <- crew_controller_slurm(
  name = "parse_pdf",
  workers = 3L,
  seconds_idle = 300,
  slurm_memory_gigabytes_per_cpu = 1,
  script_lines = paste0("export PATH=",Sys.getenv("R_HOME"),"/bin:$PATH"),
  verbose = TRUE
)

controller$start()

results <- controller$map(
  command = iggi::parse_pdf(.x, .y),
  iterate = list(
    .x = targets$report,
    .y = targets$files
  ),
  verbose = TRUE
)

finaldata <- results$result

controller$terminate()

python

`dask-jobqueue`

https://docs.dask.org/en/stable/deploying-hpc.html

`ray`

🚧 https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html

questions? feedback?

david@posit.co