---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:120000
- multilingual
base_model: Alibaba-NLP/gte-multilingual-base
widget:
- source_sentence: Who is filming along?
  sentences:
  - Wién filmt mat?
  - >-
    Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
    krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt.
  - Brambilla 130.08.03 St.
- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
  sentences:
  - >-
    Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do
    gëtt jo een ganz neie Wunnquartier gebaut.
  - >-
    D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re
    eso'gucr me' we' 90 prozent.
  - Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
- source_sentence: >-
    Non-profit organisation Passerell, which provides legal council to refugees
    in Luxembourg, announced that it has to make four employees redundant in
    August due to a lack of funding.
  sentences:
  - Oetringen nach Remich....8.20» 215»
  - >-
    D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe
    këmmert, wäert am August mussen hir véier fix Salariéen entloossen.
  - D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
- source_sentence: This regulation was temporarily lifted during the Covid pandemic.
  sentences:
  - Six Jours vu New-York si fir d’équipe Girgetti — Debacco
  - Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
  - ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
- source_sentence: The cross-border workers should also receive more wages.
  sentences:
  - D'grenzarbechetr missten och me' lo'n kre'en.
  - >-
    De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck
    gemâcht!
  - >-
    D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
    verlooss, et war den Optakt vun der Zäit am Exil.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model-index:
- name: >-
    SentenceTransformer based on
    Alibaba-NLP/gte-multilingual-base
  results:
  - task:
      type: contemporary-lb
      name: Contemporary-lb
    dataset:
      name: Contemporary-lb
      type: contemporary-lb
    metrics:
    - type: accuracy
      value: 0.6216
      name: SIB-200(LB) accuracy
    - type: accuracy
      value: 0.6282
      name: ParaLUX accuracy
  - task:
      type: bitext-mining
      name: LBHistoricalBitextMining
    dataset:
      name: LBHistoricalBitextMining
      type: lb-en
    metrics:
    - type: accuracy
      value: 0.9683
      name: LB<->FR accuracy
    - type: accuracy
      value: 0.9715
      name: LB<->EN accuracy
    - type: mean_accuracy
      value: 0.9793
      name: LB<->DE accuracy
license: agpl-3.0
datasets:
- impresso-project/HistLuxAlign
- fredxlpy/LuxAlign
language:
- lb
---

# THIS IS A PREVIEW MODEL for the IMPRESSO HALLOWEEN WORKSHOP

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.


## Model Details

This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.

This is an [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) model that was further adapted by (Michail et al., 2025)

## Limitations

We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use [histlux-paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2)

### Model Description
- **Model Type:** GTE-Multilingual-Base
- **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base)
- **Maximum Sequence Length:** 8192 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Training Dataset:** See below


## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('impresso-project/halloween_workshop_ocr_robust_with_lux_preview', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)
```


## Training Details

### Training Dataset

The parallel sentences data mix is the following:

impresso-project/HistLuxAlign:
  - LB-FR (x20,000)
  - LB-EN (x20,000)
  - LB-DE (x20,000)

fredxlpy/LuxAlign:
  - LB-FR (x40,000)
  - LB-EN (x20,000)

Total: 120 000 Sentence pairs in mixed batches of size 8


### Contrastive Training
The model was trained with the parameters:
```
**Loss**:

`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
  ```
  {'scale': 20.0, 'similarity_fct': 'cos_sim'}
  ```

Parameters of the fit()-Method:
```
{
    "epochs": 1,
    "evaluation_steps": 520,
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
}
```
```

## Citation

### BibTeX

#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)

```bibtex
@inproceedings{michail-etal-2025-adapting,
    title = "Adapting Multilingual Embedding Models to Historical {L}uxembourgish",
    author = "Michail, Andrianos  and
      Racl{\'e}, Corina  and
      Opitz, Juri  and
      Clematide, Simon",
    editor = "Kazantseva, Anna  and
      Szpakowicz, Stan  and
      Degaetano-Ortlieb, Stefania  and
      Bizzoni, Yuri  and
      Pagel, Janis",
    booktitle = "Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)",
    month = may,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.latechclfl-1.26/",
    doi = "10.18653/v1/2025.latechclfl-1.26",
    pages = "291--298",
    ISBN = "979-8-89176-241-1"
}
```

#### Original Multilingual GTE Model

```bibtex
@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}
```

## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2025 The Impresso team.

### License

This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>