It allows the extraction of more than 130 features, spanning across different levels of linguistic description and it has been specifically devised to be multilingual since it is based on the Universal Dependencies framework.
The tool implements a two-stage process: linguistic annotation and linguistic profiling. The annotation of the text(s) is performed by UDPipe (Straka et al. 2016) using the available UD model(s), version 2.5, for the input language (note that the model trained on the biggest treebank is used by default, when more than one is available for a language.)
The automatically annotated text(s) are used as input to the further step, performed by the linguistic profiling component defining the rules to extract and quantify the formal properties.
In the web-based interface, you can either upload a plain text file (or a collection of files as a zipped folder) or copying the text for the analysis. Before running the analysis, you are required to specify the language of the input text and the unit of analysis you want to carry out linguistic profiling on (i.e. document or sentence). If you want to keep the sentence segmentation of your text(s), you have to select the `presegmented’ option.
For each uploaded text (or collection), Profiling UD outputs three downloadable files:
- a CoNLLU-tab-separated format containing the results of the automatic annotation stage;
- a file in csv format containing the results of the linguistic profiling with each monitored feature in a separate column;
- a file in txt format containing the legend of the features (explained in Section 2.1 of the paper).
Click here to try the demo.
Brunato D., Cimino A., Dell’Orletta F., Montemagni S., Venturi G. (2020) “Profiling-UD: a Tool for Linguistic Profiling of Texts”. In Proceedings of 12th Edition of International Conference on Language Resources and Evaluation (LREC 2020), 11-16 May, 2020, Marseille, France.
(Please cite the paper above if you use Profiling-UD in your research)