SPLADE Triplet-based Input Data Specification¶
This document describes the triplet-based input data format required for training SPLADE v2 (Sparse Lexical and Expansion) models using the light-splade framework.
Overview¶
SPLADE triplet-based dataset consists of four types of data files, all in NDJSON (Newline Delimited JSON) format:
- Query Master - Contains query texts and their IDs
- Document Master - Contains document texts and their IDs
- Positive Lists - Maps queries to their relevant documents
- Triplets - Contains training triplets (query, positive doc, negative doc)
File Format: NDJSON¶
All data files must be in NDJSON format, where:
- Each line is a valid JSON object
- Lines are separated by newline characters (\n
)
- Each file can contain multiple JSON objects, one per line
- File extension is .ndjson
for raw text.
- File can be gzip-compressed for space efficiency (.ndjson.gz
)
Data File Specifications¶
1. Query Master File¶
File naming convention: query_master.ndjson
Purpose: Stores all query texts with their unique identifiers.
Schema
{
"qid": <integer>,
"text": <string>
}
Fields
qid
(int): Unique query identifiertext
(str): The query text
Example
{"qid": 3, "text": "Gitでコミットを取り消すにはどうすればよい?"}
{"qid": 4, "text": "睡眠の質を改善するための習慣を教えて。"}
{"qid": 5, "text": "Dockerコンテナのリソース使用量を制限したい。"}
2. Document Master File¶
File naming convention: doc_master.ndjson
Purpose: Stores all document texts with their unique identifiers.
Schema
{
"doc_id": <integer>,
"text": <string>
}
Fields
doc_id
(int): Unique document identifiertext
(str): The document text
Example
{"doc_id": 11, "text": "直前のコミットを取り消すにはgit reset --softやgit revertを状況に応じて使い分けます。"}
{"doc_id": 12, "text": "雨天時は運動会が体育館で実施される予定です。"}
{"doc_id": 13, "text": "紅葉の名所は朝の光でより美しく見えます。"}
3. Positive Lists File¶
File naming convention: positive_lists.ndjson
Purpose: Maps each query to its list of relevant (positive) documents.
Schema
{
"qid": <integer>,
"positive_doc_ids": [<integer>, ...]
}
Fields
qid
(int): Query identifier (must exist in query master)positive_doc_ids
(list[int]): List of document IDs that are relevant to this query (must exist in document master)
Example:
{"qid": 3, "positive_doc_ids": [11, 100]}
{"qid": 4, "positive_doc_ids": [16]}
{"qid": 5, "positive_doc_ids": [21]}
Requirements
- Each query must have at least one positive document
- All document IDs must exist in the document master
4. Triplets File¶
File naming convention: triplets.ndjson
Purpose: Contains training triplets for contrastive learning. Each triplet consists of a query, a positive (relevant) document, and a negative (non-relevant) document.
Schema
{
"qid": <integer>,
"pos_doc_id": <integer>,
"neg_doc_id": <integer>
}
Fields
qid
(int): Query identifier (must exist in query master)pos_doc_id
(int): Positive document ID (must exist in document master and in the positive list for this query)neg_doc_id
(int): Negative document ID (must exist in document master)
Example
{"qid": 3, "pos_doc_id": 11, "neg_doc_id": 12}
{"qid": 4, "pos_doc_id": 16, "neg_doc_id": 17}
{"qid": 4, "pos_doc_id": 16, "neg_doc_id": 18}
{"qid": 5, "pos_doc_id": 21, "neg_doc_id": 22}
Notes
- The number of triplets determines the dataset size for training
- Multiple triplets can share the same query ID
- The positive document must be in the query's positive list
- This triplet data is available in training set only
- As in contrastive learning, the relative distance between a query and its positive document versus negative documents is crucial. The way positive and negative document pairs are selected to form triplets can significantly affect model quality.
Data Organization¶
Directory Structure¶
For a complete training setup, organize your data as follows:
data/
├── train/
│ ├── query_master.ndjson
│ ├── doc_master.ndjson
│ ├── positive_lists.ndjson
│ └── triplets.ndjson
└── validation/
├── query_master.ndjson
├── doc_master.ndjson
└── positive_lists.ndjson
Configuration Example¶
Reference your data files in the YAML configuration:
DATA_PATH: data/mmarco_ja_4_splade_triplet
train_doc_master: ${.DATA_PATH}/train/doc_master.ndjson
train_query_master: ${.DATA_PATH}/train/query_master.ndjson
train_positives: ${.DATA_PATH}/train/positive_lists.ndjson
train_triplets: ${.DATA_PATH}/train/triplets.ndjson
validation_doc_master: ${.DATA_PATH}/validation/doc_master.ndjson
validation_query_master: ${.DATA_PATH}/validation/query_master.ndjson
validation_positives: ${.DATA_PATH}/validation/positive_lists.ndjson
We use Hydra configuration, which supports interpolation in YAML files. For example, ${.DATA_PATH} will be dynamically replaced with the value of the key DATA_PATH.
Data Validation Rules¶
The TripletDataset
class enforces the following validation rules:
- Query Coverage: Every query ID in the positive list must exist in the query master
- Query Completeness: Every query ID in the query master must have entries in the positive list
- Document Existence: All document IDs referenced in the positive lists must exist in the document master
- Positive Requirement: Every query must have at least one positive document
If any validation rule fails, the system will raise a ValueError
with a specific error message including the problematic ID for easy debugging.
Dataset Output¶
When loading a sample from the TripletDataset
, you receive a 3-tuple of strings:
(query_text, positive_doc_text, negative_doc_text)
Example:
(
"Gitでコミットを取り消すにはどうすればよい?",
"直前のコミットを取り消すにはgit reset --softやgit revertを状況に応じて使い分けます。",
"雨天時は運動会が体育館で実施される予定です。"
)
Use Case¶
This data format is designed for SPLADE v2 training without distillation, using in_batch_negatives
or pairwise_contrastive
for contrastive learning. The model learns to:
- Score queries and documents in a sparse lexical space
- Rank positive documents higher than negative documents
- Learn sparse, interpretable representations
Reference¶
- Implementation: triplet_dataset.py
- Data schemas: schemas/config/data_training.py and schemas/data/init.py
- Example config: splade_data_mmarco_ja_triplet.yaml