Zero-copy conversions between R vectors and Apache Arrow arrays.

πŸ”—Quick Reference

use miniextendr_api::{miniextendr, ffi::SEXP};
use miniextendr_api::optionals::arrow_impl::*;

// R numeric β†’ Arrow Float64Array β†’ back to R: zero-copy both directions
#[miniextendr]
pub fn passthrough_numeric(x: Float64Array) -> Float64Array {
    x
}

// R integer β†’ Arrow Int32Array β†’ back to R: zero-copy both directions
#[miniextendr]
pub fn passthrough_integer(x: Int32Array) -> Int32Array {
    x
}

// Compute on Arrow, return to R (copies on return β€” new data)
#[miniextendr]
pub fn doubled(x: Float64Array) -> Float64Array {
    x.iter().map(|v| v.map(|f| f * 2.0)).collect()
}

// RecordBatch round-trip: primitive columns zero-copy per-column
#[miniextendr]
pub fn passthrough_df(df: RecordBatch) -> RecordBatch {
    df
}

πŸ”—Zero-Copy String Vectors

R stores strings as STRSXP (array of CHARSXP pointers). Each CHARSXP is interned, GC-managed, and has a known LENGTH. Instead of copying into String, borrow directly.

πŸ”—Cow<'static, str> β€” scalar

#[miniextendr]
pub fn greet(name: Cow<'static, str>) -> String {
    // name is Cow::Borrowed β€” points directly into R's CHARSXP data
    // No allocation unless you call .to_mut()
    format!("Hello, {}!", name)
}

πŸ”—Vec<Cow<'static, str>> β€” vector, zero-copy per element

#[miniextendr]
pub fn upper_first(words: Vec<Cow<'static, str>>) -> Vec<String> {
    // Each element is Cow::Borrowed (zero-copy from R's CHARSXP pool)
    words.iter().map(|w| {
        let mut s = w.to_string();
        if let Some(c) = s.get_mut(0..1) {
            c.make_ascii_uppercase();
        }
        s
    }).collect()
}

// NA-aware variant: None for NA_character_
#[miniextendr]
pub fn count_non_na(words: Vec<Option<Cow<'static, str>>>) -> i32 {
    words.iter().filter(|w| w.is_some()).count() as i32
}

πŸ”—Cow<'static, [T]> β€” numeric slices

#[miniextendr]
pub fn sum_cow(x: Cow<'static, [f64]>) -> f64 {
    // Cow::Borrowed β€” x points directly into R's REALSXP data
    x.iter().sum()
}

// Round-trip: if x was borrowed from R, IntoR returns the original SEXP
#[miniextendr]
pub fn passthrough_cow(x: Cow<'static, [i32]>) -> Cow<'static, [i32]> {
    x  // zero-copy: SEXP pointer recovery finds the original R vector
}

πŸ”—ProtectedStrVec vs StrVec β€” safety vs speed

ProtectedStrVec and StrVec both wrap an R STRSXP and provide zero-copy &str access to its elements. They differ in GC safety:

StrVecProtectedStrVec
Size1 word (just the SEXP)3 words (SEXP + len + OwnedProtect)
CopyCopy!Copy (owns protection guard)
GC protectionNone β€” caller’s responsibilityOwnedProtect keeps STRSXP alive
Borrow lifetime&'static str (lie)&'a str tied to &'a self
IteratorStrVecIter (Option<&'static str>)ProtectedStrVecIter<'a> (Option<&'a str>)

The key difference is lifetime safety. ProtectedStrVec ties all borrows to the struct’s lifetime. The compiler catches use-after-free:

let dangling: &str;
{
    let sv = unsafe { ProtectedStrVec::new(sexp) };
    dangling = sv.get_str(0).unwrap(); // borrows &sv
} // sv dropped β†’ SEXP unprotected
// dangling is now invalid β€” COMPILER ERROR: sv doesn't live long enough

With StrVec or Vec<&'static str>, the same code compiles silently and produces a dangling pointer β€” the 'static lifetime is a lie (the data is only valid while R protects the SEXP).

When to use which:

  • StrVec / Vec<&'static str> β€” inside a #[miniextendr] function where R protects the .Call argument. Lightweight, fine. The SEXP won’t be GC’d during the call.
  • ProtectedStrVec β€” when you store the string vector beyond the immediate scope, pass it to a closure, or want the compiler to catch lifetime bugs. The OwnedProtect guard keeps the STRSXP alive until the struct is dropped.

Usage examples:

use miniextendr_api::ProtectedStrVec;
use std::collections::HashSet;

#[miniextendr]
pub fn count_unique(strings: ProtectedStrVec) -> i32 {
    // Lifetimes tied to &self β€” compiler enforces GC safety
    let unique: HashSet<&str> = strings.iter()
        .filter_map(|s| s)  // skip NA
        .collect();
    unique.len() as i32
}

// Can't return &str β€” ProtectedStrVec is consumed by IntoR, so there's
// nothing to borrow from. Return String or the whole ProtectedStrVec.
#[miniextendr]
pub fn first_non_na(strings: ProtectedStrVec) -> String {
    strings.iter()
        .find_map(|s| s)
        .unwrap_or("")
        .to_owned()
}
use miniextendr_api::StrVec;

#[miniextendr]
pub fn has_empty(strings: StrVec) -> bool {
    // StrVec is Copy β€” just a SEXP wrapper. R protects .Call arguments,
    // so this is safe within the function body.
    strings.iter().any(|opt| opt == Some(""))
}

πŸ”—Arrow Arrays

πŸ”—R β†’ Arrow (already zero-copy for primitives)

use miniextendr_api::optionals::arrow_impl::*;

#[miniextendr]
pub fn arrow_mean(x: Float64Array) -> f64 {
    // x.values() points directly into R's REALSXP data (zero-copy)
    // NA values are tracked in Arrow's null bitmap, not in the data
    let sum: f64 = x.iter().flatten().sum();
    let count = x.len() - x.null_count();
    sum / count as f64
}

#[miniextendr]
pub fn arrow_filter_positive(x: Int32Array) -> Int32Array {
    // Arrow compute β€” result is a new array (Rust-allocated)
    x.iter()
        .map(|v| v.filter(|&i| i > 0))
        .collect()
}

πŸ”—Arrow β†’ R (automatic SEXP recovery)

When an Arrow array’s data buffer came from R (via sexp_to_arrow_buffer), IntoR automatically recovers the original SEXP using pointer arithmetic. No wrapper types needed.

// This is zero-copy BOTH directions:
#[miniextendr]
pub fn identity(x: Float64Array) -> Float64Array {
    x  // R→Arrow (zero-copy) → Arrow→R (pointer recovery, zero-copy)
}

// This copies on return (new data, not from R):
#[miniextendr]
pub fn squares(x: Float64Array) -> Float64Array {
    x.iter().map(|v| v.map(|f| f * f)).collect()
}

πŸ”—RecordBatch (data.frame)

use arrow_array::cast::AsArray;

#[miniextendr]
pub fn df_add_column(df: RecordBatch) -> RecordBatch {
    let col0: &Float64Array = df.column(0).as_primitive();

    // Compute new column
    let new_col: Float64Array = col0.iter()
        .map(|v| v.map(|f| f * 2.0))
        .collect();

    // Build new batch β€” original columns return to R zero-copy,
    // new column copies (it's Rust-allocated)
    let mut fields = df.schema().fields().to_vec();
    fields.push(Arc::new(Field::new("doubled", DataType::Float64, true)));
    let schema = Arc::new(Schema::new(fields));

    let mut columns = df.columns().to_vec();
    columns.push(Arc::new(new_col));

    RecordBatch::try_new(schema, columns).unwrap()
}

πŸ”—alloc_r_backed_buffer β€” Rustβ†’Arrowβ†’R zero-copy

Allocate an Arrow buffer backed by R memory from the start. Write through the raw SEXP pointer, then wrap in Arrow types. When the array is later converted to R, pointer recovery finds the original SEXP.

use miniextendr_api::optionals::arrow_impl::alloc_r_backed_buffer;

#[miniextendr]
pub fn generate_sequence(n: i32) -> SEXP {
    use miniextendr_api::IntoR;
    let n = n as usize;

    // Allocate buffer as R REALSXP β€” data lives in R's heap
    let (buffer, sexp) = unsafe { alloc_r_backed_buffer::<f64>(n) };

    // Fill through the SEXP's raw pointer (before wrapping in Arrow)
    unsafe {
        let ptr = miniextendr_api::ffi::REAL(sexp);
        for i in 0..n {
            *ptr.add(i) = i as f64;
        }
    }

    // Wrap as Arrow array
    let values = arrow_buffer::ScalarBuffer::<f64>::from(buffer);
    let array = Float64Array::new(values, None);

    // IntoR β†’ pointer recovery β†’ returns the same REALSXP (zero-copy)
    array.into_sexp()
}

πŸ”—RStringArray β€” string round-trip tracking

Arrow’s StringArray and R’s STRSXP have incompatible layouts (contiguous data+offsets vs per-element CHARSXPs). Automatic pointer recovery can’t work for strings. RStringArray explicitly tracks the source STRSXP.

use miniextendr_api::optionals::arrow_impl::RStringArray;

#[miniextendr]
pub fn string_passthrough(x: RStringArray) -> RStringArray {
    // x.source is Some(strsxp) β€” IntoR returns original STRSXP
    x
}

#[miniextendr]
pub fn string_lengths(x: RStringArray) -> Vec<i32> {
    // Deref to StringArray β€” all Arrow APIs work
    x.iter().map(|opt| opt.map(|s| s.len() as i32).unwrap_or(-1)).collect()
}

πŸ”—ALTREP for Cow string vectors

Vec<Cow<'static, str>> supports ALTREP with seamless serialization:

use miniextendr_api::IntoRAltrep;
use std::borrow::Cow;

#[miniextendr]
pub fn lazy_strings(prefix: &str, n: i32) -> SEXP {
    let strings: Vec<Cow<'static, str>> = (0..n)
        .map(|i| Cow::Owned(format!("{}_{}", prefix, i)))
        .collect();
    strings.into_sexp_altrep()
    // R sees a character vector; elements computed on demand via ALTREP Elt
    // saveRDS/readRDS works β€” serializes to STRSXP, deserializes back to Vec<Cow>
}

πŸ”—How It Works

πŸ”—SEXP Pointer Recovery (r_memory module)

R stores vector data at a fixed offset from the SEXP header:

[VECTOR_SEXPREC header (48 bytes on 64-bit)] [data...]
 ^                                            ^
 SEXP                                         DATAPTR_RO(sexp)

All R vector types (REALSXP, INTSXP, RAWSXP, STRSXP, VECSXP) use the same VECTOR_SEXPREC header. Non-vector types use larger SEXPREC but don’t have data pointers.

At package init, we measure the offset on a real R vector. Then in IntoR:

candidate_sexp = data_ptr - offset
verify: TYPEOF(candidate) == expected AND LENGTH(candidate) == expected AND DATAPTR_RO(candidate) == data_ptr

Safety consideration: For Rust-allocated buffers, data_ptr - offset points to arbitrary heap memory. The 4-byte type-tag read at that address is technically undefined behavior in Rust’s abstract model (the pointer wasn’t derived from an R allocation). In practice, this is safe β€” the address is in mapped heap memory and the read is immediately validated by the triple check (type + length + DATAPTR_RO round-trip), which makes false positives impossible. ALTREP vectors also fail safely (the DATAPTR_RO round-trip check catches them, since ALTREP data isn’t at a fixed offset).

πŸ”—String conversion (charsxp_to_str)

charsxp_to_str() uses R_CHAR + LENGTH (O(1)) with from_utf8_unchecked. No per-string UTF-8 validation β€” miniextendr_assert_utf8_locale() at package init guarantees all CHARSXPs in the session are valid UTF-8. charsxp_to_cow() wraps the result in Cow::Borrowed (always borrowed, never owned).

πŸ”—Type Decision Tree

Need strings from R?
β”œβ”€β”€ Scalar β†’ Cow<'static, str>          (zero-copy)
β”œβ”€β”€ Vector, need ownership β†’ Vec<String> (copies, lossy NAβ†’"")
β”œβ”€β”€ Vector, read-only β†’ Vec<Cow<'static, str>>  (zero-copy per element)
β”œβ”€β”€ Vector, NA-aware β†’ Vec<Option<Cow<'static, str>>>
β”œβ”€β”€ View with GC safety β†’ ProtectedStrVec
└── Lightweight view β†’ StrVec           (Copy, caller manages GC)

Need numerics from R?
β”œβ”€β”€ As Rust slice β†’ &[f64] / &[i32]    (zero-copy, 'static lifetime)
β”œβ”€β”€ Copy-on-write β†’ Cow<'static, [f64]> (zero-copy, copies on .to_mut())
β”œβ”€β”€ As Arrow array β†’ Float64Array       (zero-copy both directions)
└── Owned copy β†’ Vec<f64>              (copies)

Need data frames?
β”œβ”€β”€ As Arrow β†’ RecordBatch             (primitive cols zero-copy both ways)
└── As Arrow (string cols too) β†’ use RStringArray per column