Note: This repository is included as a git submodule in OpenBMB/sglang (minicpm_sala branch). For the latest setup instructions and usage guide, please refer to the main repository.
python3 setup.py install- CUDA 12.x or higher
gcc/g++compileruvpackage manager (script will check)
# Clone repository
git clone -b minicpm_sala https://github.com/OpenBMB/sglang.git
cd sglang
# One-click installation (creates venv and compiles all dependencies)
bash install_minicpm_sala.sh
# Or specify PyPI mirror
bash install_minicpm_sala.sh https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simpleThe installation script performs the following steps:
- Creates
sglang_minicpm_sala_envvirtual environment (Python 3.12) - Clones dependencies to
3rdparty/(infllmv2) and initializes submodules (sparse_kernel) - Installs MiniCPM-SALA (current repo)
- Compiles and installs
infllmv2_cuda_impl - Compiles and installs
sparse_kernel - Installs
tilelang&flash-linear-attention
# Activate environment
source sglang_minicpm_sala_env/bin/activate
# Launch Inference Server (Replace MODEL_PATH with actual path)
MODEL_PATH=/path/to/your/model
python3 -m sglang.launch_server \
--model ${MODEL_PATH} \
--trust-remote-code \
--disable-radix-cache \
--attention-backend minicpm_flashinfer \
--chunked-prefill-size 8192 \
--max-running-requests 32 \
--skip-server-warmup \
--port 31111 \
--dense-as-sparse| Parameter | Description |
|---|---|
--trust-remote-code |
Allow custom code in model |
--disable-radix-cache |
Disable RadixAttention prefix cache |
--attention-backend minicpm_flashinfer |
Use MiniCPM FlashInfer backend |
--chunked-prefill-size 8192 |
Chunked prefill size |
--max-running-requests 32 |
Max concurrent requests |
--skip-server-warmup |
Skip server warmup |
--port 31111 |
Server port |
--dense-as-sparse |
Use dense-as-sparse mode |
Tip: For best generation quality, we recommend setting
temperature=0.9when sending requests to the server.
If the script doesn't work for you, follow these steps:
# 0. Ensure uv is installed
pip install uv
# 1. Create venv
uv venv --python 3.12 sglang_minicpm_sala_env
source sglang_minicpm_sala_env/bin/activate
# 2. Install SGLang
uv pip install --upgrade pip setuptools wheel
uv pip install -e ./python[all]
# 3. Compile CUDA Extensions
# (Ensure dependencies are cloned to 3rdparty/)
cd 3rdparty/infllmv2_cuda_impl && python setup.py install && cd ../..
cd 3rdparty/sparse_kernel && python setup.py install && cd ../..
# 4. Install extra deps
uv pip install tilelang flash-linear-attentionQ: CUDA extension compilation failed?
- Ensure CUDA 12+ is installed (
nvcc --version). - Ensure
gcc/g++are available. - If
CXXis set toclang++ -pthread, manuallyexport CXX=g++.