Multilingual data in ELTeC: enacting European literary traditions
The Multilingual data in ELTeC: enacting European literary traditions workshop introduces participants to the Distant Reading for European Literary History project (COST Action CA16204) by exploring the linguistical and computational challenges that arise in the creation of the ELTeC (European Literary Text Collection) multilingual corpus.
This free workshop will consist of two parts:
A presentation of ELTeC that will focus on: the 15 collections in ELTeC, sampling principles, encoding principles (level 0, level 1), challenges in the comparison of various collections belonging to different linguistic typologies and literary traditions;
A case study on how to deal with a multilingual literary corpus: presentation of the project on ELTeC titles, presentation of the collections chosen for annotation, discussion of annotation guidelines and a few examples;
The workshop will take place on Wednesday 17th June, 12.00-13.30 (BST).
This workshop aims to present challenges that arose during the implementation of the CA 16204: Distant Reading for European Literary History and several solutions for fostering a culturally-informed and linguistically-aware use of data, which the project members – representing 30 participating countries – have identified together through close collaboration within the 4 working groups. The first part is a discussion of pros and cons regarding the application of strict sampling principles to heterogeneous and non-synchronously-developed literary traditions, many of them defined as “emergent”, thus having an intermittent dynamic between 1840 and 1920. While for the relatively young literary traditions of the Central and South-Eastern Europe, the metadata-based approach, “the distance as a condition of knowledge,” the forced estrangement from a novel’s content, and the ELTeC selection criteria (e.g. at least 10/15 % female authored novels, 20 % long novels) might look like “a bed of Procrustes,” the encoding schema (Level 1 and 2 in particular) allows for enough illustration of linguistic and cultural specificity. In fact, ELTeC presently accommodates 14 typologically-diverse languages such as Romance, Balto-Slavic, and Germanic (for the current status of ELTeC, check https://distantreading.github.io/ELTeC/), and it is used as a benchmark corpus to test the performance of tools on lesser resourced languages. Moreover, new entries (Ukrainian, Belorussian), extensions to collections that contain texts published before or after the indicated time span, as well as multilingual collections (e.g. Swiss) are encouraged.
The second part is a case study on the ELTeC paratext (titles as “thresholds” to “the great unread”) aiming to illustrate how the ELTeC multilingual diversity can be employed in order combat language indifference and to devise more comprehensive research questions questioning theoretical assumptions and integrating both linguistic and literary concepts. While it includes first editions as well as later ones, ELTeC reflects literary and cultural conventions (e.g. genres, periods), rather than linguistic features. Nevertheless, tokenization and POS tagging tests that have been done on the titles from the Romance language collections (French, Italian, Portuguese, Spanish, and Romanian) brought to the fore several situations that should challenge current tokenization practices.
Ioana Alexandra Lionte is currently an assistant professor at the “Grigore T. Popa” University of Medicine and Pharmacy where she teaches English and French. In 2018, she benefited from a COST grant (COST Action "Distant Reading for European Literary History" (CA16204) ) and attended the "Optical Character Recognition and Text Encoding for the production of ELTeC contributions" Training School in Würzburg, Germany. In 2019 she became a member of the research team conducting the ongoing HAI-RO project (PN-III-P3-3.1-PM-RO-FR-2019-0063) entitled Hajduk Novels in Romania During the Long Nineteenth Century: Digital Edition and Corpus Analysis Assisted by Computational Tools that seeks to remedy the shortcomings of the resources/digital instruments specially dedicated to literary research. She is also a member of the Digital Humanities Laboratory (”Alexandru Ioan Cuza” University of Iași).
Roxana Patraș, PhD in Philology (2012), is a Senior Researcher (Cercetător Științific gr. II) at the Institute of Interdisciplinary Research, “Alexandru Ioan Cuza” University of Iași.
Visiting scholar of Trier Center for Digital Humanities, Antwerp Center for Digital Humanities and Literary Criticism, Universite Sorbonne Nouvelle-LATTICE. Member of Cost Action16204: Distant Reading for European Literary History (2017-2021). In collaboration with University “Sorbonne Nouvelle”-Paris 3 and LATTICE Laboratory, Roxana has started a project on Romanian popular fiction entitled Hajduk Novels in Romania During the Long Nineteenth Century: Digital Edition and Corpus Analysis Assisted by Computational Tools.
Roxana does research in History of Romanian Literature, 19th-century European Literature, Literary Theory, Rhetoric, and recently DLS. Her current research interest is in 19th-century Romanian Popular Fiction (emerging novel genres) and in the theory of paratext.
Books: Cântece dinaintea Decadenţei. A.C. Swinburne şi declinul Occidentului (2013); Spații eminesciene. Studii de poetică și stilistică (2017), The Remains of the Day: political oratory and literature in 19th-century Romania (2018).
Scholarly editions: G. Ibrăileanu, Scrieri alese (2010); Oratorie politică românească (1847-1899), 3 vol. (2016).