“First Shared Task on Dependency Parsing of Legal Texts” at SPLeT 2012

We organised the shared task in the framework of the 4th Workshop on “Semantic Processing of Legal Texts” (SPLeT 2012). It aims on the one hand at identifying the specific challenges posed by the analysis of legal texts across different languages, i.e. Italian and English; on the other hand at obtaining a clearer idea of the current performance of state-of-the-art dependency parsing systems. Moreover, we had the opportunity to develop and share multi-lingual domain-specific resources. Two different sub-tasks have been foreseen:

  1. dependency parsing: this represents the basic and mandatory subtask, focusing on dependency parsing of legal texts;
  2. domain adaptation: this is a more challenging (and optional) subtask, focusing on the adaptation of general purpose dependency parsers to the legal domain.

For detailed documentation about the task, see the First Shared Task on Dependency Parsing of Legal Texts home page.

Download

Click here to download the following Italian datasets:

  • Source domain training and development data
    • Data for training and testing base parsing systems; includes articles from newspapers exemplifying general language.
  • Target domain development data
    • Includes data for testing in system development and it gathers laws enacted by Italian State and Regions. It contains:
      • a wide target corpus of legislative texts including automatically generated sentence splitting, tokenization, morpho-syntactic tagging and lemmatization. It does not contain labeled dependency relations
      • a manually annotated test set, also including labeled dependency relations.
  • Target domain test data
    • Includes official test data and it gathers laws enacted by European Commission. It contains a wide target corpus of legislative texts including automatically generated sentence splitting, tokenization, morpho-syntactic tagging and lemmatization. It does not contain labeled dependency relations
  • Target domain gold data
    • Test data with gold standard annotation.

Click here to download the following English datasets:

  • Target domain test data
    • Includes official test data and it gathers laws enacted by European Commision. It contains a wide target corpus of legislative texts including automatically generated sentence splitting, tokenization, morpho-syntactic tagging and lemmatization. It does not contain labeled dependency relations
  • Target domain gold data
    • Test data with gold standard annotation.

(Note: For all datasets, after filling in the request form the download link will appear at the bottom of the page.)

Send an e-mail to ldc@ldc.upenn.edu to have the following English datasets distributed by the Linguistic Data Consortium:

  • Source domain training and development data:
    • Data for training and testing base parsing systems; includes data extracted from the Penn Treebank (PTB) and used as training and test data distributed in the CoNLL 2007 Shared Task
  • Target domain development data:
    • Includes data for testing in system development and it contains the files used for thefinal testing of the systems in the “Domain Adaptation Track” of the CoNLL 2007 Shared task, namely:
      • a wide target corpus of chemical abstracts (CHEM corpus) including automatically generated sentence splitting, tokenization, morpho-syntactic tagging and lemmatization. It does not contain labeled dependency relations
      • a manually annotated test set, also including labeled dependency relations.

Publication

Dell’Orletta F., Marchi S., Montemagni S., Plank B., Venturi G. (2012) The SPLeT-2012 Shared Task on Dependency Parsing of Legal Texts. In Proceedings of the 4th Workshop on Semantic Processing of Legal Texts (SPLeT 2012), held in conjunction with LREC 2012, Istanbul, Turkey, 27th May, pp. 42-51.