SimilEx | Italian Natural Language Processing Lab

SimilEx is the first Italian dataset comprising 2,112 sentence pairs manually annotated for semantic similarity. 907 sentence pairs are further enriched with free-form, human-written explanations that justify the assigned similarity scores. The sentence pairs in SimilEx are derived from a collection of novels translated into Italian from the late 19th century. The dataset also includes the results of a stylistic analysis of the paired sentences and their corresponding explanations.

Download

Click here to download the corpus. (Note: after filling in the request form, the download link will appear at the bottom of the page.)

References

Alzetta C., Dell’Orletta F., Fazzone C., Venturi G. (2024) SimilEx: the First Italian Dataset for Sentence Similarity with Natural Language Explanations, In Proceedings of 10th of Italian Conference on Computational Linguistics (CLiC-it), 4-6 December 2024, Pisa, Italy.

(Please cite the paper above if you make use of this corpus in your research)