Reference page
Rayon Integration Guide
Miniextendr provides seamless integration with Rayon for parallel computation in R packages. This enables writing high-performance parallel code while maintaining R's safety guarantees.
Miniextendr provides seamless integration with Rayon for parallel computation in R packages. This enables writing high-performance parallel code while maintaining Rβs safety guarantees.
πTable of Contents
πQuick Start
πEnable the Feature
[dependencies]
miniextendr-api = { path = "../miniextendr-api", features = ["rayon"] }πSimplest Example
use miniextendr_api::prelude::*;
use miniextendr_api::rayon_bridge::{rayon::prelude::*, ParCollectR};
#[miniextendr]
fn parallel_sqrt(x: &[f64]) -> SEXP {
// .collect_r() writes directly into R memory β zero intermediate allocation
x.par_iter().map(|&val| val.sqrt()).collect_r()
}πAlternative Styles
use miniextendr_api::rayon_bridge::*;
// par_map: same zero-copy performance, closure style
#[miniextendr]
fn parallel_sqrt_map(x: &[f64]) -> SEXP {
par_map(x, |&v| v.sqrt())
}
// Vec<T> return: simplest (one extra allocation, miniextendr converts to R)
#[miniextendr]
fn parallel_sqrt_vec(x: &[f64]) -> Vec<f64> {
x.par_iter().map(|&val| val.sqrt()).collect()
}πArchitecture
πDesign Philosophy
Rust computation: Parallel on Rayon threads (normal 2MB stacks) R API calls: Serial on main thread
The framework handles all parallelism internally. User closures receive chunks
of data and never need to call par_iter() or manage thread dispatch.
πThread Model
βββββββββββββββββββββββββββββββββββββββββββ
β Rayon Thread Pool (2MB stacks) β
β β
β Thread 1 Thread 2 Thread 3 β
β β β β β
β chunk[0] chunk[1] chunk[2] β β Parallel computation
β β β β β
β f(chunk,0) f(chunk,n) f(chunk,2n) β β User closure per chunk
ββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββββ
β β β
ββββββββββββ΄βββββββββββ
β
ββββββββββββββββββββββββ
β Main R Thread β
β (allocates before, β β R memory alloc/dealloc
β returns after) β
ββββββββββββββββββββββββπKey Insights
- Framework-managed parallelism:
with_r_vec,par_map, etc. split data into chunks internally - Deterministic chunks: Same
(length, thread_count)always produces the same chunk boundaries - Zero copy: Pre-allocation writes directly into R memory
- No R calls in closures: Pure Rust only inside parallel sections
πAPI Overview
πChunk-Based Fill
πwith_r_vec(len, f) β Chunk-based parallel fill
Allocates an R vector of len elements, splits into chunks, calls f(chunk, offset)
in parallel. The closure receives a mutable slice and the starting index.
pub fn with_r_vec<T, F>(len: usize, f: F) -> SEXP
where
T: RNativeType + Send + Sync,
F: Fn(&mut [T], usize) + Send + Sync,with_r_vec(1000, |chunk: &mut [f64], offset: usize| {
for (i, slot) in chunk.iter_mut().enumerate() {
*slot = ((offset + i) as f64).sqrt();
}
});πwith_r_vec_map(len, f) β Index-based fill (sugar)
Each element depends only on its index. Sugar over with_r_vec.
with_r_vec_map(1000, |i: usize| (i as f64).sqrt());πParallel Map
πpar_map(input, f) β Transform slice β R vector
Element-wise parallel transform. Writes directly into R memory (zero intermediate allocation).
pub fn par_map<T, U, F>(input: &[T], f: F) -> SEXP
where
T: Send + Sync,
U: RNativeType + Send + Sync,
F: Fn(&T) -> U + Send + Sync,// Parallel sqrt
par_map(x, |&v| v.sqrt())
// Type conversion: i32 β f64
par_map(ints, |&v| v as f64)πpar_map2(a, b, f) β Two-input parallel map
// Element-wise addition
par_map2(a, b, |&x, &y| x + y)πpar_map3(a, b, c, f) β Three-input parallel map
// Fused multiply-add: a * b + c
par_map3(a, b, c, |&x, &y, &z| x * y + z)πMatrix and Array
πwith_r_matrix(nrow, ncol, f) β Column-wise parallel fill
Each column is a natural chunk (contiguous in Rβs column-major layout).
The closure receives (column_slice, col_idx).
with_r_matrix(100, 50, |col: &mut [f64], col_idx: usize| {
for (row, slot) in col.iter_mut().enumerate() {
*slot = (row + col_idx * 1000) as f64;
}
});πwith_r_array(dims, f) β Slab-wise parallel fill
For dims [d0, d1, ..., dN], each slab has d0 * d1 * ... * d(N-1) elements.
The closure receives (slab_slice, slab_idx) where slab_idx is the index along
the last dimension.
// 2Γ3Γ4 array: 4 slabs of 6 elements each
with_r_array([2, 3, 4], |slab: &mut [f64], slab_idx: usize| {
for (i, val) in slab.iter_mut().enumerate() {
*val = (slab_idx * 100 + i) as f64;
}
});πTyped wrappers: new_r_matrix, new_r_array
Same as above but return RMatrix<T> / RArray<T, NDIM> instead of raw SEXP.
πReduction
use miniextendr_api::rayon_bridge::reduce;
let sum = reduce::sum(&data); // Parallel sum β R scalar
let min = reduce::min(&data); // Parallel min
let max = reduce::max(&data); // Parallel max
let mean = reduce::mean(&data); // Parallel mean
let sum_int = reduce::sum_int(&ints); // Integer sumπPerformance Utilities
use miniextendr_api::rayon_bridge::perf;
perf::num_threads() // Number of Rayon threads
perf::in_rayon_thread() // Are we in a Rayon thread?
perf::thread_index() // Current thread index (if in pool)π.collect_r() β Parallel Iterator β R Vector
The ParCollectR extension trait adds .collect_r() to every indexed parallel
iterator. It allocates an R vector on the main thread, then fills it in parallel
using Rayonβs zip β zero intermediate allocation.
use miniextendr_api::rayon_bridge::{rayon::prelude::*, ParCollectR};
#[miniextendr]
fn parallel_sqrt(x: &[f64]) -> SEXP {
x.par_iter().map(|&v| v.sqrt()).collect_r()
}
#[miniextendr]
fn index_fill(n: i32) -> SEXP {
(0..n).into_par_iter().map(|i| (i as f64).sin()).collect_r()
}
Which iterators are indexed? Most of them:
| Pattern | Indexed? | .collect_r() works? |
|---|---|---|
slice.par_iter().map(...) | Yes | Yes |
vec.into_par_iter().map(...) | Yes | Yes |
(0..n).into_par_iter() | Yes | Yes |
.enumerate(), .zip(), .take(), .skip() | Yes | Yes |
.filter(...) | No | No β use par_collect_sexp |
.flat_map(...) | No | No β use par_collect_sexp |
πpar_collect_sexp() β Non-Indexed Fallback
For iterators that lose their index (.filter(), .flat_map(), .par_bridge()),
use this function. It collects to an intermediate Vec<T> then converts to R.
use miniextendr_api::rayon_bridge;
#[miniextendr]
fn parallel_filter(x: &[f64]) -> SEXP {
// .filter() loses index β can't use .collect_r()
rayon_bridge::par_collect_sexp(
x.par_iter().filter(|&&v| v > 0.0).copied()
)
}π.collect() with Sendable<SEXP> Return Type
If you prefer standard .collect() syntax, return Sendable<SEXP> from your function.
The return type drives type inference so no turbofish is needed:
use miniextendr_api::worker::Sendable;
#[miniextendr]
fn parallel_sqrt(x: &[f64]) -> Sendable<SEXP> {
x.par_iter().map(|&v| v.sqrt()).collect()
}
Sendable<SEXP> implements both IntoR (works as #[miniextendr] return type) and
From<Sendable<SEXP>> for SEXP (for manual unwrapping with .into()).
This path collects to an intermediate Vec<T> before converting to R.
For zero-copy performance, use .collect_r().
πChoosing a Collection Strategy
| Strategy | Allocation | Requires | Best For |
|---|---|---|---|
.collect_r() | Zero-copy into R | Indexed iterator | Max performance |
-> Sendable<SEXP> + .collect() | Vec + copy to R | Any par iterator | Standard rayon idiom |
par_map(x, f) | Zero-copy into R | Input slice | Simple 1:1 transforms |
par_collect_sexp(iter) | Vec + copy to R | Any par iterator | Filters, flat maps |
-> Vec<T> return | Vec + copy to R | Nothing special | Simplest code |
πParallel Collection (Vec<T>)
For operations that donβt fit the chunk model (filtering, variable-length output),
you can also just return Vec<T> and let miniextendr convert to R:
#[miniextendr]
fn parallel_pipeline(x: &[f64]) -> Vec<f64> {
x.par_iter()
.filter(|&&v| v > 0.0)
.map(|&v| v.log2())
.collect()
}πPatterns
πPattern 1: .collect_r() (Best Default)
Use when: Any parallel transform where output length equals input length
Performance: Best β zero intermediate allocation, writes directly into R memory
use miniextendr_api::rayon_bridge::{rayon::prelude::*, ParCollectR};
#[miniextendr]
fn parallel_sqrt(x: &[f64]) -> SEXP {
x.par_iter().map(|&v| v.sqrt()).collect_r()
}
#[miniextendr]
fn generate_sequence(n: i32) -> SEXP {
(0..n).into_par_iter().map(|i| (i as f64).sin()).collect_r()
}πPattern 2: par_map / with_r_vec (Closure Style)
Use when: You prefer a closure-based API, or need per-chunk state (e.g., RNG)
Performance: Same as .collect_r() β zero copy, deterministic chunks
// par_map: one-liner for element-wise transforms
#[miniextendr]
fn parallel_sqrt(x: &[f64]) -> SEXP {
par_map(x, |&v| v.sqrt())
}
// with_r_vec: access to chunk offset for per-chunk state
#[miniextendr]
fn generate_random(n: i32, seed: i64) -> SEXP {
with_r_vec(n as usize, |chunk: &mut [f64], offset| {
let mut rng = ChaChaRng::seed_from_u64(seed as u64 + offset as u64);
for slot in chunk.iter_mut() { *slot = rng.gen(); }
})
}πPattern 3: par_collect_sexp / Vec<T> (Flexible)
Use when: Variable-length output (filtering, flat-mapping)
Performance: Moderate β one extra allocation (intermediate Vec<T>)
// par_collect_sexp: returns SEXP directly
#[miniextendr]
fn parallel_filter(x: &[f64]) -> SEXP {
rayon_bridge::par_collect_sexp(
x.par_iter().filter(|&&v| v > 0.0).copied()
)
}
// Or just return Vec<T> β miniextendr converts automatically
#[miniextendr]
fn parallel_filter_vec(x: &[f64]) -> Vec<f64> {
x.par_iter().filter(|&&v| v > 0.0).map(|&v| v.log2()).collect()
}πPattern 4: Parallel Reduction
Use when: Computing aggregates (sum, mean, min, max)
Performance: Best for aggregations
use miniextendr_api::rayon_bridge::reduce;
#[miniextendr]
fn fast_sum(x: &[f64]) -> SEXP {
reduce::sum(x)
}πRNG Reproducibility
Chunk boundaries are deterministic for a given (length, thread_count). This means
you can seed per-chunk RNG from the offset parameter and get reproducible results
regardless of thread scheduling.
πHow It Works
with_r_vec splits the output into len / (num_threads * 4) sized chunks using
Rayonβs par_chunks_mut. The chunk boundaries are fixed β chunk 0 always starts at
index 0, chunk 1 always starts at chunk_size, etc. Only the scheduling order
varies between runs, not the boundaries.
This means: if you derive each chunkβs RNG seed from its offset, the same chunk
always gets the same seed, producing the same random values in those positions.
πPattern: Seed-per-chunk
use rand::SeedableRng;
use rand_chacha::ChaChaRng;
use rand::Rng;
#[miniextendr]
fn reproducible_random(len: i32, seed: i64) -> SEXP {
with_r_vec(len as usize, |chunk: &mut [f64], offset| {
// Deterministic seed derived from base seed + chunk offset
let mut rng = ChaChaRng::seed_from_u64(seed as u64 + offset as u64);
for slot in chunk.iter_mut() {
*slot = rng.gen();
}
})
}πWhy This Works
Thread count = 4, length = 1000, chunk_size = 62
Run 1 (scheduling order: T2, T0, T3, T1):
T2 fills chunk[2] (offset=124) with seed=42+124 β same values
T0 fills chunk[0] (offset=0) with seed=42+0 β same values
T3 fills chunk[3] (offset=186) with seed=42+186 β same values
T1 fills chunk[1] (offset=62) with seed=42+62 β same values
Run 2 (scheduling order: T0, T1, T2, T3):
T0 fills chunk[0] (offset=0) with seed=42+0 β same values β
T1 fills chunk[1] (offset=62) with seed=42+62 β same values β
T2 fills chunk[2] (offset=124) with seed=42+124 β same values β
T3 fills chunk[3] (offset=186) with seed=42+186 β same values β
The output vector is bitwise identical between runs (given same length and thread count).
πImportant: Thread Count Sensitivity
Chunk boundaries depend on rayon::current_num_threads(). Different machines with
different core counts will produce different chunk boundaries and therefore different
random sequences. This is fine for most use cases (simulation, Monte Carlo), but if
you need cross-machine reproducibility, pin the thread pool size:
// Pin to exactly 4 threads regardless of machine
rayon::ThreadPoolBuilder::new()
.num_threads(4)
.build_global()
.unwrap();πDo NOT Use Rβs RNG in Parallel
Rβs RNG (RRng, Rf_runif, etc.) calls R APIs, which panic on Rayon threads.
Use a Rust RNG crate (rand, rand_chacha) instead:
// WRONG: R's RNG calls R APIs β panics in parallel
with_r_vec(len, |chunk, _| {
let mut rng = RRng::new(); // PANICS!
for slot in chunk { *slot = rng.uniform_f64(); }
});
// CORRECT: Rust RNG is thread-safe
with_r_vec(len, |chunk, offset| {
let mut rng = ChaChaRng::seed_from_u64(seed + offset as u64);
for slot in chunk { *slot = rng.gen(); }
});πPerformance
πWhen to Use Rayon
Good Use Cases:
- CPU-intensive transformations (sqrt, log, trig functions)
- Large datasets (>10,000 elements)
- Embarrassingly parallel problems
- Reductions (sum, min, max, mean)
- Operations with minimal R API interaction
Be Careful:
- Frequent R API calls (each call has channel overhead)
- Small datasets (<1,000 elements β overhead > gains)
- Operations R can vectorize efficiently
Avoid:
- Calling R for every element in a tight loop
- Parallel evaluation of R code (R is single-threaded!)
- Simple operations R handles well (addition, multiplication)
πOptimization Tips
- Use
.collect_r()for par iterator chains β zero-copy into R memory - Use
par_mapfor simple element-wise transforms (same performance, closure style) - Use
with_r_vecwhen you need per-chunk state (RNG, accumulators) - Use
par_collect_sexponly for non-indexed iterators (.filter(),.flat_map()) - Profile First: Measure before assuming parallelism helps
- Consider R Alternatives: Vectorized R operations are fast
πSafety
πThread Safety Invariants
Safe Patterns:
// par_map: framework handles everything
par_map(x, |&v| v.sqrt());
// with_r_vec: chunk-based, pure Rust in closure
with_r_vec(n, |chunk, offset| {
for (i, slot) in chunk.iter_mut().enumerate() {
*slot = ((offset + i) as f64).sqrt();
}
});
Unsafe Patterns:
// WRONG: R API in closure β PANICS
with_r_vec(n, |chunk, _| {
let sexp = unsafe { ffi::Rf_ScalarReal(1.0) }; // CRASH!
});
// WRONG: with_r_thread inside closure β PANICS
with_r_vec(n, |chunk, _| {
with_r_thread(|| { ... }); // PANICS! Not on worker thread
});πMemory Safety
- GC Protection: Pre-allocated SEXPs are
Rf_protected during parallel writes - No Concurrent R Access: All R operations happen before/after parallel section
- RAII Guards:
WorkerUnprotectGuardensures cleanup even on panic
πExamples
πExample 1: Parallel Normalization
use miniextendr_api::rayon_bridge::*;
use rayon::prelude::*;
#[miniextendr]
fn parallel_normalize(x: &[f64]) -> SEXP {
let (sum, sum_sq, count) = x.par_iter().fold(
|| (0.0, 0.0, 0usize),
|(s, sq, c), &val| (s + val, sq + val * val, c + 1)
).reduce(
|| (0.0, 0.0, 0),
|(s1, sq1, c1), (s2, sq2, c2)| (s1 + s2, sq1 + sq2, c1 + c2)
);
let mean = sum / count as f64;
let sd = ((sum_sq / count as f64) - mean * mean).sqrt();
par_map(x, |&v| (v - mean) / sd)
}πExample 2: Matrix Fill by Column
#[miniextendr]
fn identity_matrix(n: i32) -> SEXP {
let n = n as usize;
with_r_matrix(n, n, |col: &mut [f64], col_idx| {
// Each column gets a 1.0 at the diagonal position
col[col_idx] = 1.0;
})
}πExample 3: Parallel Filtering
#[miniextendr]
fn parallel_filter_positive(x: &[f64]) -> Vec<f64> {
x.par_iter()
.copied()
.filter(|&v| v > 0.0)
.collect()
}πExample 4: Two-Input Operations
#[miniextendr]
fn euclidean_distance(x: &[f64], y: &[f64]) -> SEXP {
par_map2(x, y, |&a, &b| (a - b).powi(2))
}πTroubleshooting
πError: βwith_r_thread called outside of run_on_worker contextβ
Solution: With the worker-thread feature, Rayon integration only works
inside #[miniextendr] functions (which use run_on_worker). Without the
feature, with_r_thread is an inline stub that requires miniextendr_runtime_init()
to have been called.
πSlow Performance
Check:
- Dataset size (< 10K elements might not benefit)
- Number of threads (
perf::num_threads()) - Computation cost per element (must justify thread overhead)
πDebugging
use miniextendr_api::rayon_bridge::perf;
eprintln!("Rayon threads: {}", perf::num_threads());
eprintln!("In Rayon thread: {}", perf::in_rayon_thread());
eprintln!("Thread index: {:?}", perf::thread_index());πSee Also
- SAFETY.md β Thread safety invariants
- ENTRYPOINT.md β Worker initialization requirements
miniextendr-bench/benches/rayon.rsβ Performance benchmarks