Getting started¶
This document contains quick run instructions on how to run the repository using uv
and using Docker.
Run with uv¶
Install uv¶
curl -LsSf https://astral.sh/uv/install.sh | sh
For more information on uv, please refer to https://github.com/astral-sh/uv
Install Virtual Environment with uv
¶
cd light-splade
uv venv --seed .venv
source .venv/bin/activate
uv sync
Run lint via pre-commit under uv¶
uv run pre-commit run --all-files
Pre-commit is configured in the light-splade/.pre-commit-config.yaml
file.
Run Unit Tests under uv¶
uv run pytest
Train with mMARCO-ja (triplet-only)¶
Full pipeline¶
export SPLADE_CONFIG_NAME=splade_mmarco_ja_triplet
nohup examples/train_splade_triplet_pipeline.sh &
Note: This pipeline includes data building and several model training step, so a GPU is required.
Step-by-step¶
STEP 1 — convert the mMARCO-ja dataset into light-splade triplet format:
nohup uv run examples/run_convert_mmarco_ja_triplet.py > logs/1_run_convert_mmarco_ja_triplet.txt 2>&1 &
STEP 2 — train SPLADE from triplet dataset:
nohup uv run examples/run_train_splade_triplet.py --config-name splade_mmarco_ja_triplet > logs/2_run_train_splade_triplet.txt 2>&1 &
Train with mMARCO-ja (distillation)¶
Full pipeline¶
export SPLADE_CONFIG_NAME=splade_mmarco_ja_distil
nohup examples/train_splade_distil_pipeline.sh &
Note: This pipeline includes several model training steps, so a GPU is required.
Step-by-step¶
STEP 1 — convert the mMARCO-ja dataset into light-splade distil format:
nohup uv run examples/run_convert_mmarco_ja_distil.py > logs/1_run_convert_mmarco_ja_distil.txt 2>&1 &
STEP 2 — train a Cross-Encoder (teacher):
nohup uv run examples/run_train_cross_encoder.py --config-file config/cross_encoder_train.yaml > logs/2_run_train_cross_encoder.txt 2>&1 &
STEP 3 — infer similarity scores with Cross-Encoder:
nohup uv run examples/run_predict_cross_encoder.py --config-file config/cross_encoder_predict.yaml > logs/3_run_predict_cross_encoder.txt 2>&1 &
STEP 4 — train SPLADE using predicted similarity scores:
nohup uv run examples/run_train_splade_distil.py --config-name splade_mmarco_ja_distil > logs/4_run_train_splade_distil.txt 2>&1 &
Run with Docker¶
Build Docker image¶
cd light-splade
docker compose build
Train with Docker — triplet (no distillation)¶
Full pipeline¶
docker compose up all-triplet -d
Step-by-step¶
STEP 1 — convert mMARCO-ja collection:
docker compose up convert-mmarco-ja-triplet -d
STEP 2 — train SPLADE from triplets:
docker compose up train-splade-triplet -d
Train with Docker — distillation¶
Full pipeline¶
docker compose up all-distil -d
Step-by-step¶
STEP 1 — convert mMARCO-ja collection
docker compose up convert-mmarco-ja-distil -d
STEP 2 — train Cross-Encoder (teacher)
docker compose up train-cross-encoder -d
STEP 3 — predict similarity with Cross-Encoder
docker compose up predict-cross-encoder -d
STEP 4 — train SPLADE using predicted similarity scores
docker compose up train-splade-distil -d