Welcome to ClientVPS Mirrors

Parallel Processing

Parallel Processing

When processing large datasets, parallel generation can significantly reduce execution time. This tutorial covers efficient batch processing strategies with localLLM.

Why Parallel Processing?

Sequential processing with a for-loop processes one prompt at a time. Parallel processing batches multiple prompts together, sharing computation and reducing overhead.

In benchmarks, generate_parallel() typically completes in 60–70% of the time compared to sequential generate() calls (1.3×–1.7× speedup depending on model size).

Using generate_parallel()

Basic Usage

library(localLLM)

# Load model
model <- model_load("Llama-3.2-3B-Instruct-Q5_K_M.gguf", n_gpu_layers = 999)

# Create context with batch support
ctx <- context_create(
  model,
  n_ctx = 2048,
  n_seq_max = 10  # Allow up to 10 parallel sequences
)

# Define prompts
prompts <- c(
  "What is the capital of France?",
  "What is the capital of Germany?",
  "What is the capital of Italy?"
)

# Format prompts
formatted_prompts <- sapply(prompts, function(p) {
  messages <- list(
    list(role = "system", content = "Answer concisely."),
    list(role = "user", content = p)
  )
  apply_chat_template(model, messages)
})

# Process in parallel
results <- generate_parallel(ctx, formatted_prompts, max_tokens = 50)
print(results)
#> [1] "The capital of France is Paris."
#> [2] "The capital of Germany is Berlin."
#> [3] "The capital of Italy is Rome."

Progress Tracking

Progress reporting is enabled by default in interactive sessions (progress = interactive()). To force it in non-interactive scripts, set progress = TRUE explicitly:

results <- generate_parallel(
  ctx,
  formatted_prompts,
  max_tokens = 50,
  progress = TRUE  # force progress bar even in non-interactive mode
)
#> Processing 100 prompts...
#> [##########----------] 50%
#> [####################] 100%
#> Done!

Text Classification Example

Here’s a complete example classifying news articles:

library(localLLM)

# Load sample dataset
data("ag_news_sample", package = "localLLM")

# Load model
model <- model_load("Llama-3.2-3B-Instruct-Q5_K_M.gguf", n_gpu_layers = 999)

# Create context (n_seq_max determines max parallel prompts)
ctx <- context_create(model, n_ctx = 1048, n_seq_max = 10)

# Prepare all prompts
all_prompts <- character(nrow(ag_news_sample))

for (i in seq_len(nrow(ag_news_sample))) {
  messages <- list(
    list(role = "system", content = "You are a helpful assistant."),
    list(role = "user", content = paste0(
      "Classify this news article into exactly one category: ",
      "World, Sports, Business, or Sci/Tech. ",
      "Respond with only the category name.\n\n",
      "Title: ", ag_news_sample$title[i], "\n",
      "Description: ", substr(ag_news_sample$description[i], 1, 100), "\n\n",
      "Category:"
    ))
  )
  all_prompts[i] <- apply_chat_template(model, messages)
}

# Process all samples in parallel
results <- generate_parallel(
  context = ctx,
  prompts = all_prompts,
  max_tokens = 5,
  seed = 92092,
  progress = TRUE,
  clean = TRUE
)

# Extract predictions
ag_news_sample$LLM_result <- sapply(results, function(x) {
  trimws(gsub("\\n.*$", "", x))
})

# Calculate accuracy
accuracy <- mean(ag_news_sample$LLM_result == ag_news_sample$class)
cat("Accuracy:", round(accuracy * 100, 1), "%\n")
#> Accuracy: 87.0 %

Sequential vs Parallel Comparison

Sequential (For Loop)

# Sequential approach
ag_news_sample$LLM_result <- NA
ctx <- context_create(model, n_ctx = 512)

system.time({
  for (i in seq_len(nrow(ag_news_sample))) {
    formatted_prompt <- all_prompts[i]
    output <- generate(ctx, formatted_prompt, max_tokens = 5, seed = 92092)
    ag_news_sample$LLM_result[i] <- trimws(output)
  }
})
#>    user  system elapsed
#>   0.62    0.08   41.55

Parallel

# Parallel approach
ctx <- context_create(model, n_ctx = 1048, n_seq_max = 10)

system.time({
  results <- generate_parallel(
    ctx, all_prompts,
    max_tokens = 5,
    seed = 92092,
    progress = TRUE
  )
})
#>    user  system elapsed
#>   0.38    0.04   24.08

Result: ~42% faster with parallel processing (1.73×).

Benchmark: Multiple Models

Tested on Apple M3 Pro (18 GB unified memory), 100 AG News classification prompts, ctx_size = 512, max_tokens = 50, n_seq_max = 10:

Model Sequential Parallel (10×) Speedup
Llama-3.2-3B-Instruct-Q5_K_M 41.6 sec 24.1 sec 1.73×
Gemma-3-4B-it-QAT-Q5_K_M 41.3 sec 30.0 sec 1.38×
OLMo-3-7B-Instruct-Q5_K_M 61.5 sec 43.3 sec 1.42×
Gemma-4-26B-A4B-it-IQ2_XXS 69.2 sec 52.9 sec 1.31×

On Apple Silicon (M3 Pro), smaller models tend to show higher parallel speedup than larger ones. The GPU is underutilised during single-sequence inference for small models, so batching provides more headroom. Larger models approach GPU saturation even at n_seq_max = 1, leaving less room for parallel gains.

Note on reasoning models: DeepSeek-R1 and similar reasoning models (QwQ, Gemma 4) output a thinking block before the final answer (e.g. <think>...</think>answer). For classification tasks, strip the thinking section before evaluating predictions:

clean_pred <- function(x) {
  # Remove thinking block, keep only text after closing tag
  x <- gsub("<think>.*?</think>", "", x, perl = TRUE)
  trimws(gsub("\n.*", "", trimws(x)))
}

Using quick_llama() for Batches

The simplest approach for parallel processing is passing a vector to quick_llama():

# quick_llama automatically uses parallel mode for vectors
prompts <- c(
  "Summarize: Climate change is affecting global weather patterns...",
  "Summarize: The stock market reached new highs today...",
  "Summarize: Scientists discovered a new species of deep-sea fish..."
)

results <- quick_llama(prompts, max_tokens = 50)
print(results)

Performance Considerations

Context Size and n_seq_max

The context window is shared across parallel sequences:

# If n_ctx = 2048 and n_seq_max = 8
# Each sequence gets approximately 2048/8 = 256 tokens

# For longer prompts, increase n_ctx proportionally
ctx <- context_create(
  model,
  n_ctx = 4096,   # Larger context
  n_seq_max = 8   # 8 parallel sequences
)

Memory Usage

Parallel processing uses more memory. Monitor with:

hw <- hardware_profile()
cat("Available RAM:", round(hw$ram_total / 1e9, 1), "GB\n")
cat("GPU:", hw$gpu$name, "\n")

Batch Size Recommendations

Dataset Size Recommended n_seq_max
< 100 4-8
100-1000 8-16
> 1000 16-32 (memory permitting)

Error Handling

If a prompt fails, the result will contain an error message:

results <- generate_parallel(ctx, prompts, max_tokens = 50)

# Check for errors
for (i in seq_along(results)) {
  if (grepl("^Error:", results[i])) {
    cat("Prompt", i, "failed:", results[i], "\n")
  }
}

Complete Workflow

library(localLLM)

# 1. Setup
model <- model_load("Llama-3.2-3B-Instruct-Q5_K_M.gguf", n_gpu_layers = 999)
ctx <- context_create(model, n_ctx = 2048, n_seq_max = 10)

# 2. Prepare prompts
data("ag_news_sample", package = "localLLM")

prompts <- sapply(seq_len(nrow(ag_news_sample)), function(i) {
  messages <- list(
    list(role = "system", content = "Classify news articles."),
    list(role = "user", content = paste0(
      "Category (World/Sports/Business/Sci/Tech): ",
      ag_news_sample$title[i]
    ))
  )
  apply_chat_template(model, messages)
})

# 3. Process in batches with progress
results <- generate_parallel(
  ctx, prompts,
  max_tokens = 10,
  seed = 42,
  progress = TRUE,
  clean = TRUE
)

# 4. Extract and evaluate
predictions <- sapply(results, function(x) trimws(gsub("\\n.*", "", x)))
accuracy <- mean(predictions == ag_news_sample$class)
cat("Accuracy:", round(accuracy * 100, 1), "%\n")

Summary

Function Use Case
generate() Single prompts, interactive use
generate_parallel() Batch processing, large datasets
quick_llama(vector) Quick batch processing
explore() Multi-model comparison with batching

Tips

  1. Set n_seq_max when creating context for parallel use
  2. Scale n_ctx with n_seq_max to give each sequence enough space
  3. Progress is shown automatically in interactive sessions; set progress = TRUE to force it in scripts
  4. Use clean = TRUE to automatically strip control tokens
  5. Set consistent seed for reproducibility across batches
  6. Set verbosity = 0 in scripts and automated pipelines to prevent backend log lines from appearing in output files or R CMD check output. generate_parallel() already defaults to verbosity = 0; suppress loading output by passing verbosity = 0 to model_load() and context_create() as well:
# Fully silent batch pipeline
model   <- model_load("model.gguf",          verbosity = 0)
ctx     <- context_create(model, n_seq_max = 8, verbosity = 0)
results <- generate_parallel(ctx, prompts, max_tokens = 50, progress = FALSE)

Next Steps

Need a high-speed mirror for your open-source project?
Contact our mirror admin team at info@clientvps.com.

This archive is provided as a free public service to the community.
Proudly supported by infrastructure from VPSPulse , RxServers , BuyNumber , UnitVPS , OffshoreName and secure payment technology by ArionPay.