SPLADE Triplet-based Input Data Specification¶

This document describes the triplet-based input data format required for training SPLADE v2 (Sparse Lexical and Expansion) models using the light-splade framework.

Overview¶

SPLADE triplet-based dataset consists of four types of data files, all in NDJSON (Newline Delimited JSON) format:

Query Master - Contains query texts and their IDs
Document Master - Contains document texts and their IDs
Positive Lists - Maps queries to their relevant documents
Triplets - Contains training triplets (query, positive doc, negative doc)

File Format: NDJSON¶

All data files must be in NDJSON format, where: - Each line is a valid JSON object - Lines are separated by newline characters (\n) - Each file can contain multiple JSON objects, one per line - File extension is .ndjson for raw text. - File can be gzip-compressed for space efficiency (.ndjson.gz)

Data File Specifications¶

1. Query Master File¶

File naming convention: query_master.ndjson

Purpose: Stores all query texts with their unique identifiers.

Schema

{
  "qid": <integer>,
  "text": <string>
}

Fields

qid (int): Unique query identifier
text (str): The query text

Example

{"qid": 3, "text": "Gitでコミットを取り消すにはどうすればよい？"}
{"qid": 4, "text": "睡眠の質を改善するための習慣を教えて。"}
{"qid": 5, "text": "Dockerコンテナのリソース使用量を制限したい。"}

2. Document Master File¶

File naming convention: doc_master.ndjson

Purpose: Stores all document texts with their unique identifiers.

Schema

{
  "doc_id": <integer>,
  "text": <string>
}

Fields

doc_id (int): Unique document identifier
text (str): The document text

Example

{"doc_id": 11, "text": "直前のコミットを取り消すにはgit reset --softやgit revertを状況に応じて使い分けます。"}
{"doc_id": 12, "text": "雨天時は運動会が体育館で実施される予定です。"}
{"doc_id": 13, "text": "紅葉の名所は朝の光でより美しく見えます。"}

3. Positive Lists File¶

File naming convention: positive_lists.ndjson

Purpose: Maps each query to its list of relevant (positive) documents.

Schema

{
  "qid": <integer>,
  "positive_doc_ids": [<integer>, ...]
}

Fields

qid (int): Query identifier (must exist in query master)
positive_doc_ids (list[int]): List of document IDs that are relevant to this query (must exist in document master)

Example:

{"qid": 3, "positive_doc_ids": [11, 100]}
{"qid": 4, "positive_doc_ids": [16]}
{"qid": 5, "positive_doc_ids": [21]}

Requirements

Each query must have at least one positive document
All document IDs must exist in the document master

4. Triplets File¶

File naming convention: triplets.ndjson

Purpose: Contains training triplets for contrastive learning. Each triplet consists of a query, a positive (relevant) document, and a negative (non-relevant) document.

Schema

{
  "qid": <integer>,
  "pos_doc_id": <integer>,
  "neg_doc_id": <integer>
}

Fields

qid (int): Query identifier (must exist in query master)
pos_doc_id (int): Positive document ID (must exist in document master and in the positive list for this query)
neg_doc_id (int): Negative document ID (must exist in document master)

Example

{"qid": 3, "pos_doc_id": 11, "neg_doc_id": 12}
{"qid": 4, "pos_doc_id": 16, "neg_doc_id": 17}
{"qid": 4, "pos_doc_id": 16, "neg_doc_id": 18}
{"qid": 5, "pos_doc_id": 21, "neg_doc_id": 22}

Notes

The number of triplets determines the dataset size for training
Multiple triplets can share the same query ID
The positive document must be in the query's positive list
This triplet data is available in training set only
As in contrastive learning, the relative distance between a query and its positive document versus negative documents is crucial. The way positive and negative document pairs are selected to form triplets can significantly affect model quality.

Data Organization¶

Directory Structure¶

For a complete training setup, organize your data as follows:

data/
├── train/
│   ├── query_master.ndjson
│   ├── doc_master.ndjson
│   ├── positive_lists.ndjson
│   └── triplets.ndjson
└── validation/
    ├── query_master.ndjson
    ├── doc_master.ndjson
    └── positive_lists.ndjson

Configuration Example¶

Reference your data files in the YAML configuration:

DATA_PATH: data/mmarco_ja_4_splade_triplet

train_doc_master: ${.DATA_PATH}/train/doc_master.ndjson
train_query_master: ${.DATA_PATH}/train/query_master.ndjson
train_positives: ${.DATA_PATH}/train/positive_lists.ndjson
train_triplets: ${.DATA_PATH}/train/triplets.ndjson

validation_doc_master: ${.DATA_PATH}/validation/doc_master.ndjson
validation_query_master: ${.DATA_PATH}/validation/query_master.ndjson
validation_positives: ${.DATA_PATH}/validation/positive_lists.ndjson

We use Hydra configuration, which supports interpolation in YAML files. For example, ${.DATA_PATH} will be dynamically replaced with the value of the key DATA_PATH.

Data Validation Rules¶

The TripletDataset class enforces the following validation rules:

Query Coverage: Every query ID in the positive list must exist in the query master
Query Completeness: Every query ID in the query master must have entries in the positive list
Document Existence: All document IDs referenced in the positive lists must exist in the document master
Positive Requirement: Every query must have at least one positive document

If any validation rule fails, the system will raise a ValueError with a specific error message including the problematic ID for easy debugging.

Dataset Output¶

When loading a sample from the TripletDataset, you receive a 3-tuple of strings:

(query_text, positive_doc_text, negative_doc_text)

Example:

(
  "Gitでコミットを取り消すにはどうすればよい？",
  "直前のコミットを取り消すにはgit reset --softやgit revertを状況に応じて使い分けます。",
  "雨天時は運動会が体育館で実施される予定です。"
)

Use Case¶

This data format is designed for SPLADE v2 training without distillation, using in_batch_negatives or pairwise_contrastive for contrastive learning. The model learns to: - Score queries and documents in a sparse lexical space - Rank positive documents higher than negative documents - Learn sparse, interpretable representations

Reference¶

Implementation: triplet_dataset.py
Data schemas: schemas/config/data_training.py and schemas/data/init.py
Example config: splade_data_mmarco_ja_triplet.yaml