--- tags: - sentence-transformers - sentence-similarity - dataset_size:120000 - multilingual base_model: Alibaba-NLP/gte-multilingual-base widget: - source_sentence: Who is filming along? sentences: - Wién filmt mat? - >- Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt. - Brambilla 130.08.03 St. - source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.' sentences: - >- Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do gëtt jo een ganz neie Wunnquartier gebaut. - >- D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re eso'gucr me' we' 90 prozent. - Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen. - source_sentence: >- Non-profit organisation Passerell, which provides legal council to refugees in Luxembourg, announced that it has to make four employees redundant in August due to a lack of funding. sentences: - Oetringen nach Remich....8.20» 215» - >- D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe këmmert, wäert am August mussen hir véier fix Salariéen entloossen. - D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent. - source_sentence: This regulation was temporarily lifted during the Covid pandemic. sentences: - Six Jours vu New-York si fir d’équipe Girgetti — Debacco - Dës Reegelung gouf wärend der Covid-Pandemie ausgesat. - ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert. - source_sentence: The cross-border workers should also receive more wages. sentences: - D'grenzarbechetr missten och me' lo'n kre'en. - >- De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht! - >- D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land verlooss, et war den Optakt vun der Zäit am Exil. pipeline_tag: sentence-similarity library_name: sentence-transformers model-index: - name: >- SentenceTransformer based on Alibaba-NLP/gte-multilingual-base results: - task: type: contemporary-lb name: Contemporary-lb dataset: name: Contemporary-lb type: contemporary-lb metrics: - type: accuracy value: 0.6216 name: SIB-200(LB) accuracy - type: accuracy value: 0.6282 name: ParaLUX accuracy - task: type: bitext-mining name: LBHistoricalBitextMining dataset: name: LBHistoricalBitextMining type: lb-en metrics: - type: accuracy value: 0.9683 name: LB<->FR accuracy - type: accuracy value: 0.9715 name: LB<->EN accuracy - type: mean_accuracy value: 0.9793 name: LB<->DE accuracy license: agpl-3.0 datasets: - impresso-project/HistLuxAlign - fredxlpy/LuxAlign language: - lb --- # THIS IS A PREVIEW MODEL for the IMPRESSO HALLOWEEN WORKSHOP This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections. This is an [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) model that was further adapted by (Michail et al., 2025) ## Limitations We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use [histlux-paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2) ### Model Description - **Model Type:** GTE-Multilingual-Base - **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) - **Maximum Sequence Length:** 8192 tokens - **Output Dimensionality:** 768 dimensions - **Similarity Function:** Cosine Similarity - **Training Dataset:** See below ## Usage (Sentence-Transformers) Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: ``` pip install -U sentence-transformers ``` Then you can use the model like this: ```python from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('impresso-project/halloween_workshop_ocr_robust_with_lux_preview', trust_remote_code=True) embeddings = model.encode(sentences) print(embeddings) ``` ## Training Details ### Training Dataset The parallel sentences data mix is the following: impresso-project/HistLuxAlign: - LB-FR (x20,000) - LB-EN (x20,000) - LB-DE (x20,000) fredxlpy/LuxAlign: - LB-FR (x40,000) - LB-EN (x20,000) Total: 120 000 Sentence pairs in mixed batches of size 8 ### Contrastive Training The model was trained with the parameters: ``` **Loss**: `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters: ``` {'scale': 20.0, 'similarity_fct': 'cos_sim'} ``` Parameters of the fit()-Method: ``` { "epochs": 1, "evaluation_steps": 520, "max_grad_norm": 1, "optimizer_class": "", "optimizer_params": { "lr": 2e-05 }, "scheduler": "WarmupLinear", } ``` ``` ## Citation ### BibTeX #### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper) ```bibtex @inproceedings{michail-etal-2025-adapting, title = "Adapting Multilingual Embedding Models to Historical {L}uxembourgish", author = "Michail, Andrianos and Racl{\'e}, Corina and Opitz, Juri and Clematide, Simon", editor = "Kazantseva, Anna and Szpakowicz, Stan and Degaetano-Ortlieb, Stefania and Bizzoni, Yuri and Pagel, Janis", booktitle = "Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)", month = may, year = "2025", address = "Albuquerque, New Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.latechclfl-1.26/", doi = "10.18653/v1/2025.latechclfl-1.26", pages = "291--298", ISBN = "979-8-89176-241-1" } ``` #### Original Multilingual GTE Model ```bibtex @inproceedings{zhang2024mgte, title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others}, booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track}, pages={1393--1412}, year={2024} } ``` ## About Impresso ### Impresso project [Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891. ### Copyright Copyright (C) 2025 The Impresso team. ### License This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later. ---

Impresso Project Logo