PaCCSS-IT is a corpus of Complex-Simple Aligned Sentences for ITalian of about 63,000 pairs of sentences extracted from the ItWaC corpus, the largest copy-right free corpus of contemporary Italian web texts. To build the resource we developed a new approach for automatically acquiring large corpora of paired sentences able to intercept structural transformations (such as deletion, reordering, etc.) and particularly suitable for text simplification.
Click here to download the corpus.
Brunato D., Cimino A., Dell’Orletta F., Venturi G. (2016) “PaCCSS–IT: A Parallel Corpus of Complex-Simple Sentences for Automatic Text Simplification“. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 1-5 November, Austin, Texas, USA, pp. 351-361.
(Please cite the paper above if you make use of this corpus in your research)