Piccolo2: Task-Aware Loss Selection Lets One Model Excel Across All Embedding Tasks

Piccolo2: Task-Aware Loss Selection Lets One Model Excel Across All Embedding Tasks

SenseTime Tech Report 2024

Tech Report 2024Text EmbeddingFoundation ModelContrastive LearningMulti-task LearningMatryoshka RepresentationChinese NLPInformation RetrievalOpen SourceHuggingFace

Situation: Text embedding models are trained with a single loss regardless of task type, but retrieval, sentence similarity, and classification tasks have fundamentally different data structures — making any one objective a poor fit for all of them simultaneously. Task: Build a general-purpose text embedding model that achieves top performance across all six CMTEB task categories at once. Action: Designed a training framework that routes each data batch to the loss function best suited to its task type, combined with expanded embedding dimensions and multi-resolution training for flexible deployment. Result: Achieves a CMTEB average of 70.95 — surpassing all prior models at the 300M parameter scale and outperforming models 20× larger.


Background & Challenge

Text embedding benchmarks span six heterogeneous task families, yet existing models train with a single loss:

1. Loss-task mismatch. The dominant contrastive loss works well for retrieval but is a poor fit for sentence similarity tasks, where labels are continuous scores rather than binary positives. Forcing all task types through the same objective creates a consistent performance ceiling across non-retrieval categories.

2. Representation capacity bottleneck. Prior BERT-based embedding models fix output size at 768 dimensions. Larger dimensions improve performance, but production systems often need smaller vectors for speed and storage — with no principled way to cover the full deployment range from a single set of weights.


Methodology

Task-Matched Loss Routing

Instead of applying one loss to all data, each training batch is routed to the objective that matches its task type:

  • Retrieval and Reranking — standard in-batch contrastive loss; pulls query and positive passage together while treating other in-batch samples as negatives.
  • Sentence Similarity — a ranking loss that directly optimizes cosine similarity against fine-grained numerical labels, avoiding the information loss of binarizing continuous scores into triplets.
  • Classification and Clustering — each text is paired with its target label as the positive and other labels as negatives, converting supervised classification data into contrastive training without requiring cross-device negatives.

Ablations confirm each routing decision matters: adding the sentence similarity loss raises the average from 68.75 to 69.87, and adding the classification loss further raises it to 70.95.

Dimension Scaling and Flexible Inference

A learnable projection expands the backbone's 768-dim output to 1792, improving capacity without modifying pretrained weights. The model is simultaneously trained at multiple dimension checkpoints (256 to 1792), so any prefix of the full embedding vector is itself a valid, high-quality embedding — no retraining needed to match different deployment constraints.

MRL stability: reducing eval dimension from 1792 to 256 degrades average performance by under 1 point (70.95 → 69.99).

Data Pipeline

Piccolo2 trains on approximately 3.7M samples across all six task categories:

TaskFormatVolume
STStext, text pair, score730k
Pair Classificationtext, text pair, score440k
Retrieval (real)query, pos, hard neg1.1M
Retrieval (synthetic)query, pos, hard neg200k
Clusteringtext, pos label, neg label1M
Classificationtext, pos label, neg label220k

Hard negatives for retrieval are mined from rank 50–100 to avoid annotation noise at rank 1.


Results

Empirical Performance

Evaluated on CMTEB (31 datasets, 6 task categories):

TaskPiccolo2Previous SOTADelta
Classification74.5973.35+1.24
Clustering62.1767.08−4.91
Pair Classification90.2488.52+1.72
Reranking70.0069.67+0.33
Retrieval74.3674.05+0.31
STS63.5062.46+1.04
Average70.9569.07+1.88

Piccolo2 (300M parameters) beats gte-Qwen1.5-7B-instruct (7B parameters) by 1.39 points — demonstrating that task-matched training objectives outperform raw model scale for embedding evaluation.

Flexible Dimension Inference

Eval DimAverage Score
179270.95
128070.87
76870.69
51270.41
25669.99

A single trained model covers the full production range with under 1-point degradation end-to-end.

Field Contribution

Piccolo2 establishes that task-type-aware loss routing — matching each batch to the objective that fits its data structure — is a simple, reproducible win that requires no architectural changes. A single training run produces embeddings valid across the full deployment range, from high-dimension retrieval systems to storage-constrained applications.


Open-Source Access

Piccolo2 is fully open-sourced through SenseTime's HuggingFace organization:

Quick Start

pip install -r requirements.txt

Load Model for Embedding

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("selmisskilig/piccolo-gpt-zh")
embeddings = model.encode(["你好世界", "深度学习很有趣"])

Run Finetuning

bash scripts/ft.sh

For training configuration, dataset layout, and DeepSpeed distributed training setup, see the GitHub README →


Citation

@misc{2405.06932,
  Author = {Junqin Huang and Zhongjie Hu and Zihao Jing and Mengya Gao and Yichao Wu},
  Title  = {Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training},
  Year   = {2024},
  Eprint = {arXiv:2405.06932},
}

Contact

Zihao Jing (co-author) — zjing29@uwo.ca

Questions about training recipes, dataset preparation, or using Piccolo2 as a baseline? Feel free to reach out.