Parallel Processing

When processing large datasets, parallel generation can significantly reduce execution time. This tutorial covers efficient batch processing strategies with localLLM.

Why Parallel Processing?

Sequential processing with a for-loop processes one prompt at a time. Parallel processing batches multiple prompts together, sharing computation and reducing overhead.

In benchmarks, generate_parallel() typically completes in 60–70% of the time compared to sequential generate() calls (1.3×–1.7× speedup depending on model size).

Using generate_parallel()

Basic Usage

library(localLLM)

# Load model
model <- model_load("Llama-3.2-3B-Instruct-Q5_K_M.gguf", n_gpu_layers = 999)

# Create context with batch support
ctx <- context_create(
  model,
  n_ctx = 2048,
  n_seq_max = 10  # Allow up to 10 parallel sequences
)

# Define prompts
prompts <- c(
  "What is the capital of France?",
  "What is the capital of Germany?",
  "What is the capital of Italy?"
)

# Format prompts
formatted_prompts <- sapply(prompts, function(p) {
  messages <- list(
    list(role = "system", content = "Answer concisely."),
    list(role = "user", content = p)
  )
  apply_chat_template(model, messages)
})

# Process in parallel
results <- generate_parallel(ctx, formatted_prompts, max_tokens = 50)
print(results)

#> [1] "The capital of France is Paris."
#> [2] "The capital of Germany is Berlin."
#> [3] "The capital of Italy is Rome."

Progress Tracking

Progress reporting is enabled by default in interactive sessions (progress = interactive()). To force it in non-interactive scripts, set progress = TRUE explicitly:

results <- generate_parallel(
  ctx,
  formatted_prompts,
  max_tokens = 50,
  progress = TRUE  # force progress bar even in non-interactive mode
)

#> Processing 100 prompts...
#> [##########----------] 50%
#> [####################] 100%
#> Done!

Text Classification Example

Here’s a complete example classifying news articles:

library(localLLM)

# Load sample dataset
data("ag_news_sample", package = "localLLM")

# Load model
model <- model_load("Llama-3.2-3B-Instruct-Q5_K_M.gguf", n_gpu_layers = 999)

# Create context (n_seq_max determines max parallel prompts)
ctx <- context_create(model, n_ctx = 1048, n_seq_max = 10)

# Prepare all prompts
all_prompts <- character(nrow(ag_news_sample))

for (i in seq_len(nrow(ag_news_sample))) {
  messages <- list(
    list(role = "system", content = "You are a helpful assistant."),
    list(role = "user", content = paste0(
      "Classify this news article into exactly one category: ",
      "World, Sports, Business, or Sci/Tech. ",
      "Respond with only the category name.\n\n",
      "Title: ", ag_news_sample$title[i], "\n",
      "Description: ", substr(ag_news_sample$description[i], 1, 100), "\n\n",
      "Category:"
    ))
  )
  all_prompts[i] <- apply_chat_template(model, messages)
}

# Process all samples in parallel
results <- generate_parallel(
  context = ctx,
  prompts = all_prompts,
  max_tokens = 5,
  seed = 92092,
  progress = TRUE,
  clean = TRUE
)

# Extract predictions
ag_news_sample$LLM_result <- sapply(results, function(x) {
  trimws(gsub("\\n.*$", "", x))
})

# Calculate accuracy
accuracy <- mean(ag_news_sample$LLM_result == ag_news_sample$class)
cat("Accuracy:", round(accuracy * 100, 1), "%\n")

#> Accuracy: 87.0 %

Sequential vs Parallel Comparison

Sequential (For Loop)

# Sequential approach
ag_news_sample$LLM_result <- NA
ctx <- context_create(model, n_ctx = 512)

system.time({
  for (i in seq_len(nrow(ag_news_sample))) {
    formatted_prompt <- all_prompts[i]
    output <- generate(ctx, formatted_prompt, max_tokens = 5, seed = 92092)
    ag_news_sample$LLM_result[i] <- trimws(output)
  }
})

#>    user  system elapsed
#>   0.62    0.08   41.55

Parallel

# Parallel approach
ctx <- context_create(model, n_ctx = 1048, n_seq_max = 10)

system.time({
  results <- generate_parallel(
    ctx, all_prompts,
    max_tokens = 5,
    seed = 92092,
    progress = TRUE
  )
})

#>    user  system elapsed
#>   0.38    0.04   24.08

Result: ~42% faster with parallel processing (1.73×).

Benchmark: Multiple Models

Tested on Apple M3 Pro (18 GB unified memory), 100 AG News classification prompts, ctx_size = 512, max_tokens = 50, n_seq_max = 10:

Model	Sequential	Parallel (10×)	Speedup
Llama-3.2-3B-Instruct-Q5_K_M	41.6 sec	24.1 sec	1.73×
Gemma-3-4B-it-QAT-Q5_K_M	41.3 sec	30.0 sec	1.38×
OLMo-3-7B-Instruct-Q5_K_M	61.5 sec	43.3 sec	1.42×
Gemma-4-26B-A4B-it-IQ2_XXS	69.2 sec	52.9 sec	1.31×

On Apple Silicon (M3 Pro), smaller models tend to show higher parallel speedup than larger ones. The GPU is underutilised during single-sequence inference for small models, so batching provides more headroom. Larger models approach GPU saturation even at n_seq_max = 1, leaving less room for parallel gains.

Note on reasoning models: DeepSeek-R1 and similar reasoning models (QwQ, Gemma 4) output a thinking block before the final answer (e.g. <think>...</think>answer). For classification tasks, strip the thinking section before evaluating predictions:
clean_pred <- function(x) {
  # Remove thinking block, keep only text after closing tag
  x <- gsub("<think>.*?</think>", "", x, perl = TRUE)
  trimws(gsub("\n.*", "", trimws(x)))
}

Using quick_llama() for Batches

The simplest approach for parallel processing is passing a vector to quick_llama():

# quick_llama automatically uses parallel mode for vectors
prompts <- c(
  "Summarize: Climate change is affecting global weather patterns...",
  "Summarize: The stock market reached new highs today...",
  "Summarize: Scientists discovered a new species of deep-sea fish..."
)

results <- quick_llama(prompts, max_tokens = 50)
print(results)

Performance Considerations

Context Size and n_seq_max

The context window is shared across parallel sequences:

# If n_ctx = 2048 and n_seq_max = 8
# Each sequence gets approximately 2048/8 = 256 tokens

# For longer prompts, increase n_ctx proportionally
ctx <- context_create(
  model,
  n_ctx = 4096,   # Larger context
  n_seq_max = 8   # 8 parallel sequences
)

Memory Usage

Parallel processing uses more memory. Monitor with:

hw <- hardware_profile()
cat("Available RAM:", round(hw$ram_total / 1e9, 1), "GB\n")
cat("GPU:", hw$gpu$name, "\n")

Batch Size Recommendations

Dataset Size	Recommended n_seq_max
< 100	4-8
100-1000	8-16
> 1000	16-32 (memory permitting)

Error Handling

If a prompt fails, the result will contain an error message:

results <- generate_parallel(ctx, prompts, max_tokens = 50)

# Check for errors
for (i in seq_along(results)) {
  if (grepl("^Error:", results[i])) {
    cat("Prompt", i, "failed:", results[i], "\n")
  }
}

Complete Workflow

library(localLLM)

# 1. Setup
model <- model_load("Llama-3.2-3B-Instruct-Q5_K_M.gguf", n_gpu_layers = 999)
ctx <- context_create(model, n_ctx = 2048, n_seq_max = 10)

# 2. Prepare prompts
data("ag_news_sample", package = "localLLM")

prompts <- sapply(seq_len(nrow(ag_news_sample)), function(i) {
  messages <- list(
    list(role = "system", content = "Classify news articles."),
    list(role = "user", content = paste0(
      "Category (World/Sports/Business/Sci/Tech): ",
      ag_news_sample$title[i]
    ))
  )
  apply_chat_template(model, messages)
})

# 3. Process in batches with progress
results <- generate_parallel(
  ctx, prompts,
  max_tokens = 10,
  seed = 42,
  progress = TRUE,
  clean = TRUE
)

# 4. Extract and evaluate
predictions <- sapply(results, function(x) trimws(gsub("\\n.*", "", x)))
accuracy <- mean(predictions == ag_news_sample$class)
cat("Accuracy:", round(accuracy * 100, 1), "%\n")

Summary

Function	Use Case
`generate()`	Single prompts, interactive use
`generate_parallel()`	Batch processing, large datasets
`quick_llama(vector)`	Quick batch processing
`explore()`	Multi-model comparison with batching

Tips

Set n_seq_max when creating context for parallel use
Scale n_ctx with n_seq_max to give each sequence enough space
Progress is shown automatically in interactive sessions; set progress = TRUE to force it in scripts
Use clean = TRUE to automatically strip control tokens
Set consistent seed for reproducibility across batches
Set verbosity = 0 in scripts and automated pipelines to prevent backend log lines from appearing in output files or R CMD check output. generate_parallel() already defaults to verbosity = 0; suppress loading output by passing verbosity = 0 to model_load() and context_create() as well:

# Fully silent batch pipeline
model   <- model_load("model.gguf",          verbosity = 0)
ctx     <- context_create(model, n_seq_max = 8, verbosity = 0)
results <- generate_parallel(ctx, prompts, max_tokens = 50, progress = FALSE)

Welcome to ClientVPS Mirrors