parallelooza

a survey of parallel computing approaches

about me

  • solutions engineer at Posit, PBC
  • retired policy wonk
  • not an HPC expert

this talk is a map

how to accelerate code

  • make it faster
  • add cores
  • add machines

make it faster

add cores

futureverse

{future} overview

  • handles tricky parallelization problems
  • supplies primitives for re-use in higher-level pacakges

purrr::

map(.x, .f, ...) 

for every element of .x,
do .f

pfurrr::

future_map(.x, .f, …)

for every element of .x,
do .f,
according to the plan()

demo

add machines

“scheduler”:

software that matches tasks to available resources:

orchestrators

  • kubernetes
  • nomad

(not a scheduler)

  • apache spark

hpc

  • slurm
  • lsf
  • pbs
  • grid engine

hpc

graph LR;
A(/apps)
L(login node)
M(Resource Manager)
S(Shared Storage)
H(/home)

L -->|Submit| M;
M --> C(Compute Node)
M --> D(Compute Node)
M --> E(Compute Node)
S --- L
S --- M
S --- C
S --- D
S --- E
A --- S
H --- S

templates

#!/bin/bash -l

# File: slurm.tmpl
# Template for using clustermq against a SLURM backend

#SBATCH --job-name={{ job_name }}
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 1024 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --cpus-per-task={{ cores | 1 }}


export OMP_NUM_THREADS={{ cores | 1 }}
CMQ_AUTH={{ auth }} ${R_HOME}/bin/R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

future.batchtools

library(future.batchtools)
library(furrr)

plan(
  list(
    batchtools_slurm,
    multisession
    )
  )

targets %$%
  future_map2(report, files, ~ iggi::parse_pdf(.x, .y))

clustermq

options(
  clustermq.scheduler = "slurm",
  clustermq.template = here::here("inst", "clustermq.slurm.tmpl")
)

library(clustermq)
library(iggi)

res <- Q(
  iggi::parse_pdf, 
  report_id = targets$report, file = targets$files, 
  n_jobs = 2,
  pkgs = c("magrittr", "pdftools", "tesseract", "purrr", "dplyr", "stringr", "iggi")
  )

crew.cluster

library(crew.cluster)
controller <- crew_controller_slurm(
  name = "parse_pdf",
  workers = 3L,
  seconds_idle = 300,
  slurm_memory_gigabytes_per_cpu = 1,
  script_lines = paste0("export PATH=",Sys.getenv("R_HOME"),"/bin:$PATH"),
  verbose = TRUE
)

controller$start()

results <- controller$map(
  command = iggi::parse_pdf(.x, .y),
  iterate = list(
    .x = targets$report,
    .y = targets$files
  ),
  verbose = TRUE
)

finaldata <- results$result

controller$terminate()

python

dask-jobqueue

https://docs.dask.org/en/stable/deploying-hpc.html

ray

🚧 https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html

questions? feedback?

david@posit.co