CItA (Corpus Italiano di Apprendenti L1) | Italian Natural Language Processing Lab

CItA (Corpus Italiano di Apprendenti L1), is the first freely available and digitalized corpus of essays written by Italian L1 learners. It was collected in 7 different lower secondary schools located in different areas of Rome: 3 schools are in the historical center and 4 schools in suburbs. The current version of the corpus contains 1,353 not-scored essays (for a total of 369,456 tokens) manually annotated for errors and corrections, but it is constantly updated. It is also accompanied by a questionnaire including 34 questions about biographical, socio-cultural and sociolinguistic background of students.
The resource was jointly compiled by the ItaliaNLP Lab and the experimental pedagogists of the “Dipartimento di Psicologia dei processi di Sviluppo e socializzazione, Università di Roma “La Sapienza”.

Download

Click here to download the corpus. (Note: after filling in the request form, the download link will appear at the bottom of the page.)

References

Barbagli A., Lucisano P., Dell’Orletta F., Montemagni S., Venturi G. (2016) CItA: an L1 Italian Learners Corpus to Study the Development of Writing Competence, In Proceedings of 10th Edition of International Conference on Language Resources and Evaluation (LREC 2016), 23-28 May, Portorož, Slovenia.

(Please cite the paper above if you make use of this corpus in your research)