The goal

Design and development of:
- linguistic technologies aimed at the extraction and the dynamic structuring of content (either linguistic or domain knowledge) embedded in texts;
- linguistic models for investigating linguistic varieties and language variation.

Main lines of work (in a nutshell)

Automatic linguistic analysis of texts

  • design and development of multi-lingual tools for multi-level linguistic analysis of texts
  • design and construction of corpora to be used by machine learning algorithms and as benchmark resources
  • development of techniques for adapting NLP tools to domain-specific and non-canonical varieties of language

Domain knowledge extraction

  • extraction of domain relevant entities, single nominal terminology and complex nominal structures
  • semantic annotation of named and domain relevant entities
  • identification of domain-specific properties, events as well as extraction of inter-entity and inter-event relational information
  • knowledge organization and knowledge graph construction
  • sentiment analysis

Linguistic knowledge extraction

  • extraction of linguistic features spanning across different levels of linguistic description
  • modeling and monitoring of linguistic variation across domains, textual genres and registers; models of dialectal and sociolinguistic variation
  • linguistic profiling of text for: textual genre assessment, readability assessment, native language identification
  • design and development of linguistic complexity models and text simplification algorithms

Design and development of application prototypes

  • LinguA (Linguistic Annotation pipeline): a state-of-the-art linguistic annotation pipeline for Italian and English texts which combines rule-based and machine learning algorithms
  • T2K (Text-To-Knowledge): a system offering a battery of NLP tools, statistical text analysis and machine language learning, which are dynamically integrated to provide an accurate representation of the content of vast repositories of unstructured documents for Italian and English
  • MELT (Metadata Extraction from Legal Texts): component integrated in the xmLeges-Editor for the automatic consolidation of legal texts
  • READ–IT (Assessing Readability of Italian Texts): the first advanced readability assessment tool for the Italian language supporting the simplification of texts with respect to the characteristics of the target audience