Documentation for AI-Flux - LLM Batch Processing Pipeline for HPC Systems
This document lists all models supported by AI-Flux, along with their hardware requirements and configuration details.
Models in AI-Flux follow the naming convention family:size
, for example:
llama3.2:3b
- Llama 3.2 model with 3 billion parametersgemma3:27b
- Gemma 3 model with 27 billion parametersphi3:small
- Phi-3 Small modelAdvanced general-purpose model from Meta.
Size | Config File | Min GPU Memory | Recommended | Notes |
---|---|---|---|---|
3b | llama3.2/3b.yaml | 8GB | Any CUDA GPU | Best balance of performance/resource usage |
7b | llama3.2/7b.yaml | 13GB | A40/A100 | Good for general use |
70b | llama3.2/70b.yaml | 35GB | A100 80GB | High-performance, requires high-end GPU |
Vision-capable variant of Llama 3.2.
Size | Config File | Min GPU Memory | Recommended | Notes |
---|---|---|---|---|
8b | llama3.2-vision/8b.yaml | 16GB | A40/A100 | Vision capabilities require more memory |
70b | llama3.2-vision/70b.yaml | 40GB | A100 80GB | Handles complex images and reasoning |
Latest generation of Llama optimized for reasoning.
Size | Config File | Min GPU Memory | Recommended | Notes |
---|---|---|---|---|
8b | llama3.3/8b.yaml | 14GB | A40/A100 | Good general-purpose reasoner |
70b | llama3.3/70b.yaml | 38GB | A100 80GB | State-of-the-art reasoning capabilities |
Google’s efficient and high-quality models.
Size | Config File | Min GPU Memory | Recommended | Notes |
---|---|---|---|---|
1b | gemma3/1b.yaml | 6GB | Any CUDA GPU | Extremely efficient, good for basic tasks |
4b | gemma3/4b.yaml | 10GB | Any CUDA GPU | Good performance/resource balance |
12b | gemma3/12b.yaml | 16GB | A40/A100 | High quality mid-range option |
27b | gemma3/27b.yaml | 24GB | A100 | High performance, vision-capable |
Production-quality models from Alibaba.
Size | Config File | Min GPU Memory | Recommended | Notes |
---|---|---|---|---|
7b | qwen2.5/7b.yaml | 15GB | A40/A100 | Default setup, good general model |
72b | qwen2.5/72b.yaml | 35GB | A100 80GB | High performance, high resource usage |
Microsoft’s efficient models with strong reasoning capabilities.
Size | Config File | Min GPU Memory | Recommended | Notes |
---|---|---|---|---|
mini | phi3/mini.yaml | 6GB | Any CUDA GPU | Extremely efficient 3.8B model |
small | phi3/small.yaml | 12GB | Any CUDA GPU | 7B parameters, good performance |
medium | phi3/medium.yaml | 16GB | A40/A100 | 14B parameters, balanced option |
vision | phi3/vision.yaml | 18GB | A40/A100 | Vision-capable 14B parameter model |
Family of high-quality open source models.
Model | Size | Config File | Min GPU Memory | Notes |
---|---|---|---|---|
Mistral | 7b | mistral/7b.yaml | 13GB | Original Mistral model |
Mistral | 8x7b | mistral/8x7b.yaml | 36GB | Mixture of experts model |
Mistral-Small | 7b | mistral-small/7b.yaml | 12GB | Optimized for inference speed |
Mistral-Large | 33b | mistral-large/33b.yaml | 28GB | Large capacity model |
Mistral-Lite | 3b | mistral-lite/3b.yaml | 8GB | Small footprint model |
Mistral-NeMo | 12b | mistral-nemo/12b.yaml | 16GB | NVIDIA optimized model |
Mistral-OpenOrca | 7b | mistral-openorca/7b.yaml | 14GB | Research tuned version |
Mixture-of-experts models with strong performance.
Size | Config File | Min GPU Memory | Recommended | Notes |
---|---|---|---|---|
8x7b | mixtral/8x7b.yaml | 24GB | A100 | Original MoE model |
8x22b | mixtral/8x22b.yaml | 40GB | A100 80GB | Higher parameter version |
Each model specifies a minimum GPU memory requirement based on practical use cases. For optimal performance:
For each model, you can adjust the batch size based on your memory constraints. Higher batch sizes require more memory but enable much faster throughput, while lower batch sizes use less memory but process requests more slowly.
As a rule of thumb, for every increase of 1 in batch size, you’ll need approximately:
When using a smaller GPU or a large model:
# Using code arguments
runner.run(
input_path="prompts.jsonl",
output_path="results.json",
model="llama3.2:7b",
batch_size=2 # Reduced from default 4 to use less memory
)
# Or using environment variables
# In .env file:
# MODEL_NAME=llama3.2:7b
# BATCH_SIZE=2
When using a high-end GPU like A100 (80GB), you can significantly increase batch size for better throughput:
# Using code arguments
# First, configure SLURM to use the right resources
config = Config()
slurm_config = config.get_slurm_config()
slurm_config.account = "myaccount"
slurm_config.partition = "a100" # A100 partition
slurm_config.mem = "80G" # Request full node memory
slurm_config.gpus_per_node = 1 # 1 GPU (A100 80GB)
# Then run with high batch size
runner = SlurmRunner(config=slurm_config)
job_id = runner.run(
input_path="prompts.jsonl",
output_path="results.json",
model="llama3.2:7b",
batch_size=16 # 4x default for much higher throughput
)
# Or using environment variables
# In .env file:
# MODEL_NAME=llama3.2:7b
# BATCH_SIZE=16
# SLURM_MEM=80G
# SLURM_PARTITION=a100
Model Size | GPU Type | Recommended Batch Size | SLURM Memory Setting |
---|---|---|---|
3-7B | A100 (40/80GB) | 16-24 | 40G-80G |
3-7B | A40 (48GB) | 12-16 | 48G |
3-7B | Mid-range (24GB) | 6-8 | 24G |
7-20B | A100 (80GB) | 8-12 | 80G |
7-20B | A100 (40GB) | 4-6 | 40G |
20-40B | A100 (80GB) | 4-6 | 80G |
40-70B | A100 (80GB) | 2-3 | 80G |
When running large models (70B+) on a single A100 (80GB) GPU:
# Using code arguments
config = Config()
slurm_config = config.get_slurm_config()
slurm_config.account = "myaccount"
slurm_config.partition = "a100" # A100 partition
slurm_config.mem = "80G" # Request full node memory
slurm_config.gpus_per_node = 1 # 1 A100 80GB GPU
slurm_config.cpus_per_task = 16 # Increase CPU cores for preprocessing
runner = SlurmRunner(config=slurm_config)
job_id = runner.run(
input_path="prompts.jsonl",
output_path="results.json",
model="llama3.3:70b",
batch_size=2 # Even a batch size of 2 is significant for 70B models
)
# Or using environment variables
# In .env file:
# MODEL_NAME=llama3.3:70b
# BATCH_SIZE=2
# SLURM_MEM=80G
# SLURM_CPUS_PER_TASK=16
# SLURM_PARTITION=a100
For large models with higher throughput, you can leverage multiple GPUs:
# Multi-GPU setup for maximum performance
config = Config()
slurm_config = config.get_slurm_config()
slurm_config.account = "myaccount"
slurm_config.partition = "a100" # A100 partition
slurm_config.mem = "160G" # Request memory for multiple GPUs
slurm_config.gpus_per_node = 2 # Request 2 A100 GPUs
slurm_config.cpus_per_task = 32 # Increase CPU cores for multi-GPU processing
runner = SlurmRunner(config=slurm_config)
job_id = runner.run(
input_path="prompts.jsonl",
output_path="results.jsonl",
model="llama3.3:70b",
batch_size=4 # Higher batch size possible with multiple GPUs
)
# Or using environment variables
# In .env file:
# MODEL_NAME=llama3.3:70b
# BATCH_SIZE=4
# SLURM_MEM=160G
# SLURM_CPUS_PER_TASK=32
# SLURM_PARTITION=a100
# SLURM_GPUS_PER_NODE=2
You can customize model parameters by creating your own YAML configuration files:
# custom/my-model.yaml
name: "custom:my-model"
resources:
gpu_layers: 32
gpu_memory: "16GB"
batch_size: 8
max_concurrent: 2
parameters:
temperature: 0.5
top_p: 0.9
max_tokens: 4096
Then load it in your code: ```python from aiflux.core.config import Config
config = Config() model_config = config.load_model_config( model_type=”custom”, model_size=”my-model”, custom_config_path=”path/to/custom/my-model.yaml” )