Bunty Avieson - University of Sydney. Wikipedia as pharmakon: poison and cure for minority languages
Wikipedia launched in 2001 as a free online platform seeking to amass all human knowledge in digital form. The intellectual wealth of the world would be collated and freely shared, dissolving knowledge hierarchies of education, economics, culture and geography. In many ways it has been an extraordinary success. In November 2019 English Wikipedia had nearly six million articles, written by its 122,000 volunteer editors, and the world visited more than 12 billion times. The site is also a triumph of collaboration - the only not-for-profit website in the top 50, and the biggest example in history of infrastructure created and maintained by volunteer labour. However, such popularity has created new gatekeepers of knowledge with the power to determine not only what constitutes knowledge, but whose knowledge is privileged.
English Wikipedia is dominated by editors from the Global North, who are most likely white males (85-90%), technically skilled, under 30, white-collar and Christian. Research shows the multiplicity of ways this skews content towards their world view as well as to first-world topics. Less represented are women, minority groups and countries in the Global South. These biases result in western contributors overwriting local knowledge, perpetuating western narratives and presenting non-western cultures through a western lens. The structural inequalities of the offline world are being transferred online, creating new sites for colonialism.
But Wikipedia can be viewed as a pharmakon, offering both poison and cure. While English Wikipedia dominates, there are more than 300 other language Wikipedias, including Japanese, Vietnamese, Korean, Urdu, Tamil, Sinhalese, and Dzongkha, the national language of Bhutan. Each offers potential for cultural determination and resilience, in ways not possible before the Internet. The platform has multi-media capabilities, which can be utilized by oral cultures, and online meeting places, where communities can collaborate on content creation and continue their national imagining, not beholden to outside influences. They can become sites of inclusion for refugees, diasporas and communities that exist across geo-political borders.
This paper presents preliminary work on a three-year, Australian Research Council-funded project that takes the kingdom of Bhutan’s experiences with Wikipedia, both the English site and the Dzongkha site, as a case study to consider 1) some of the impacts of English Wikipedia on global knowledge diversity and 2) the potential for minority language Wikipedias to act as cultural bulwark. Its underlying thesis is that English Wikipedia’s size and dominance pose a threat to global diversity and cognitive justice, while the free online platform offers other opportunities for cultural renewal and resilience. As the world increasingly meets online, this project seeks to identify ways we risk replicating the colonialism of previous eras and instead take advantage of online digital affordances, such as the Wikipedia platform, to offer some solutions for a more inclusive future.
Andiswa Bukula - Decolonizing the Internet's Languages and some questions of epistemic (in)justice
There are many definitions of multilingualism. According to Li (2008) a multilingual individual is anyone who can communicate in more than one language, be it active, through speaking and writing, or passive, through listening and reading. Another well-known definition of multilingualism is given by the European Commission (2007) which describes multilingualism as the ability of societies, institutions, groups, and individuals to engage, on a regular basis, with more than one language in their day-to-day lives.
The talk aims to focus on the value of multilingualism to advance inclusiveness in the education sector, especially in countries that have more than one official language. Most countries around the world have at most two or three official languages, unlike South Africa. South Africa is a very diverse country, with 11 officially recognized languages. Two of these languages, English and Afrikaans, are of European origin, whilst the other nine languages are indigenous and hold a lesser status. Due to this status, the government has put in place practical systems that could elevate the standard of these languages. The Department of Higher Education and Training (DHET) is working to decolonize these languages in order to make knowledge more available to South African people whose mother tongue is not English or Afrikaans.
The lightning talk will illustrate the key emphasis on decolonizing language use in higher education. There is still a controversy as to whether universities should adhere to the use of the academic language, which is English, in all universities in South Africa rather than providing lessons in all official languages and making all university websites available in all official languages.
On a more general note about an initiative to cater for all citizens of the country, is the translation of medical, traffic and road sign pamphlets that are circulated to the public, to all official languages, so that the information is readily accessible. Most notably, this is because for many South Africans English is their third or fourth language. In this digital era one way to provide information in all languages is to get as many of these documents to be translated into all these languages. Machine translation systems would aid in the process, due to the shortage of translators, bearing in mind the quality of these systems. This would not only speed up the process but also supply multilingual data for additional research purposes. It is acknowledged that languages are an important part of the intangible culture and heritage of humanity, therefore this paper seeks ideas on how we can preserve indigenous languages through multilingual research and projects.
Keywords: Heritage and cultural preservation, multilingualism, indigenous languages, South Africa
Matteo Dutto Monash University - School of Languages, Literatures, Cultures and Linguistics. #YouthintheCity: Re-mapping Transcultural Spaces through the Voices of Multilingual Migrant Youth
Prato (Italy), 5 October 2019: The room is buzzing with excitement as the Youth in the City: One Place, Many Cultures exhibition nears its launch. Over 200 people navigate across pictures, maps, interactive artworks and a multi-screen video installation, guided by the forty-eight young students who collectively produced these artworks over the space of five days. Suddenly the space is filled with voices in different languages, as a group of youth takes centre stage to welcome everyone into their own re-imagined city. This powerful moment is the culmination of an intensive week of creative workshops, in which forty-eight high-school students of nine different cultural and linguistic background joined a team of researchers from Monash University, Aalborg University and Human Ecosystems Relazioni to reclaim Prato, Italy’s most multicultural city, and re-envision it through their own unique voices and perspectives.
This presentation reflects on the methodology developed for this pilot study to map the possibilities and challenges that digital storytelling offers when working with multilingual youth. It argues that the linguistic and cultural diversity of participants allowed them to critically approach digital mapping and storytelling techniques, developing their own ways of sharing new perspectives on transcultural places. Showcasing a selection of the digital artworks produced by students and the interactive storytelling experience developed by the research team, this presentation aims to stimulate a transdisciplinary and creative dialogue on how participatory action and digital storytelling techniques can open up new approaches to research on the translingual and transcultural dynamics experienced by migrant youth.
Sarah McMonagle - University of Hamburg. Which (mis)perceptions matter in minority language media research? Reflections on/from an enquiry of digital language practices among Sorbian adolescents
This talk will present recent research on the online language practices of Sorbian-speaking adolescents (McMonagle 2019). Upper Sorbian is a west Slavic language that is recognised in the German state of Saxony and is classified as ‘definitely endangered’ (UNESCO n.d.). Minority language researchers (including this one) often hold the normative position of the potential of digital spaces to be multilingual spaces – after all, we must be concerned with how new media can help such endangered languages (Cormack 2013). In the context of this study, this position was challenged by participants who perceive the ‘multilingual’ internet to be a space for just ‘larger’ languages, such as German. While this perception is substantiated by an actual lack of Sorbian-language software and digital content, some participants also questioned the value of (hypothetically) developing digital domains for Sorbian. Confronted with such a mismatch in perceptions, this talk will include a reflection on the question: What, then, should be the role of minority language scholarship in disrupting digital monolingualism?
Pascal Belouin and Sean Wang - Max Planck Institute for the History of Science. RISE and SHINE: An API-based Infrastructure for Multilingual Textual Resources
Digital humanities (DH) as a field has been grappling with the significant issue of textual interoperability. DH tools and resources are often quite specialized and developed in relative isolation from each other. This results in the multiplication of “silos" where one particular tool will only be compatible with a limited number of textual resources. In this context, many have proposed that DH needs basic infrastructures behind research projects to ensure its long-term success: in Europe, for instance, CLARIN and DARIAH are two such large-scale research infrastructures for humanities. While they have done a tremendous job in centralizing available digital resources, their generic coverage across the entire humanities meant that their utility for smaller disciplines and non-European languages is limited.
Furthermore, while their focus on open-access resources should be lauded, many textual resources—notably in the Asian context—remain licensed and protected. In this lightning talk, we present our technical answers to this question. RISE (https://rise.mpiwg-berlin.mpg.de/) is a pioneering approach for resource dissemination and emerging data analytics, such as text mining and other fair-use but consumptive research techniques, in the humanities.
We developed RISE and its related API exchange format SHINE to facilitate the secure linkage between third-party research tools to various third-party textual collections. SHINE is a set of standardized APIs for exchanging textual resources, both open-access ones and protected (or licensed) ones that require authentication and authorization. RISE is a middleware that protects resource exchanges via SHINE. It authenticates and authorizes these exchanges, especially for protected (or licensed) resources. It is worth noting that SHINE, as an exchange format, could be adopted independently to facilitate interoperability without RISE. By designing a set of standardized APIs to link texts to digital research tools, we allow scholars to apply digital research tools on texts, regardless of their locations or formats.
In this lightning talk, we focus on how the SHINE exchange format could facilitate multilingual resource interoperability and multilingual DH research. First, we discuss SHINE’s API capabilities, which are designed to maximize the ease of implementation by resource providers and research tool developers. This provides end users (i.e., DH scholars and researchers) with easy mechanisms to select and combine textual resources in different languages and from multiple providers for analyses. Second, we discuss SHINE’s generic metadata schema, which are designed to ensure accommodation for multilingual resources, including functions like metadata inheritance, versioning for translations, intra-resource language field, etc. Finally, we introduce a suite of free open-source software modules we developed that others could freely adopt and adapt for their own purposes to encourage third-party development of SHINE-compatible technical solutions.
Michael Castelle - University of Warwick. Multilingual Transformers: Linguistic Relativity for the 21st Century
Recent developments in neural architectures for natural language processing—specifically, the so-called Transformer architecture (Vaswani et al., 2017)—have inspired new models which, instead of being focused on a task for a single language (as in, e.g., sentiment classification) or a pair of languages (as in machine translation between, e.g., English and German), instead provide the ability to classify for, or translate between, multiple languages or pairs of languages. Examples include the 110 million-parameter mBERT (Devlin et al., 2019; Wu & Dredze, 2019) and Google’s 470 million-parameter “massively multilingual” M4 translation model (Aharoni et al., 2019; Bapna & Firat, 2019). This means that, unlike past approaches to multilingual classification or machine translation which might have transformed various languages into an intermediary ‘interlingua’ form (Richens, 1958) or approaches to classification which depended on an overt alignment to and from English (Upadhyay et al., 2016), these models incorporate knowledge about many dozens of languages in a shared, large, and somewhat opaque set of neural parameter weights.
However, contemporary NLP practitioners are well-known for their laser-like focus on achieving state-of-the-art (SOTA) results on specific standardized tasks (Church, 2017; Lipton & Steinhardt, 2018) and less well-known for their reflections on the potential implications and reflections of their models on extant theories of human language. In part, this reflects a disjunction between the subset of language represented in the data used by NLP, which historically takes the form of what linguistic anthropologist Michael Silverstein describes as the “referring and modally-predicating sentence-type” (Silverstein, 2014) (e.g., consider the first datum in the Penn Treebank: “Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29”.) It is, of course, the case that much human language-in-action transcends the arena of referential and predicational text-sentences by virtue of any communicative act’s deep embeddedness within and among social and interactional context and interpretation (Jakobson, 1960), and hence even the most apparently spectacular and human-like achievements of Transformer-based NLP (such as OpenAI’s GPT-2 language generation model (OpenAI, 2019)) can still be denigrated as “[having] no idea what it is talking about” (Marcus, 2020).
But because these Transformer-based models dispense with the entire mainstream apparatus of 20th-century linguistics (namely, the assumptions of tree-structured syntax derived from Chomsky (1957)), they open up the question of which theory of grammar they might remotely correspond to. I argue that by virtue of projecting sentences into a distinctively high-dimensional space, and by inducing different trajectories through that high-dimensional space depending on contextually-sensitive word embedding vectors, that these models provide a new perspective on the question of linguistic relativity as depicted in the original formulations of Sapir (1921) and Whorf (1940). Sapir conceived of “linguistic relativity” in reference to Einstein’s four-dimensional special relativity, and for him it meant that the linguistic processes of thinking and speaking were analogous to paths along “thought-grooves” which differed for speakers of different languages; while Whorf analogized certain Hopi grammatical categories as “tensors” with uniquely transformative effects on thought. I propose to explore this intriguing correspondence as a way to bridge novel conceptions of multilingualism a century apart.
Leonore Lukschy - SOAS University of London. Making endangered languages archives linguistically accessible
Digital endangered languages archives hold vast amounts of data of spoken, signed and whistled endangered languages. Paradoxically, these archives filled with hundreds of languages, are usually packaged in just one or sometimes two languages. That is, their interfaces and the metadata that make it possible to access the data, are in English or another major language, constituting a linguistic barrier.
The communities whose languages and cultural practices are represented in such archives, are therefore often unable to access the materials held therein. In an ideal world an archive’s interface would be available in every language represented in its collections. Even a watered down version of this, namely an interface available in all major contact languages of the languages represented, is currently too expensive and not a feasible solution for most archives.
This talk will propose archive guide templates incorporating screenshots which can easily be translated, and made available in pdf form on the archives’ homepages. Users will thus be able to navigate an archive even if they are unfamiliar with the language or even writing system of the interface.
In addition, this talk will propose best practices regarding making collections within an archive more linguistically accessible, by providing metadata in more than one language, and collection guides similar to the archive guides mentioned above. In order to overcome the written bias, as a next step, these guides should also be available in video form.
Ernesto Priani Saisó - Universidad Nacional Autónoma de México. Challenges of not using English as the dominant language in DH international projects
For many multilingual projects in Digital Humanities, the way to solve communication problems, to share results and to write papers is to use English as lingua franca, without considering the opportunities and the benefits of other ways.
In this lighting talk, I want to propose an alternative to the widespread use of English as a working language after my experience in the Oceanic Exchanges project.
The main goal of Oceanic Exchanges was to study the flow of news in XIX Century newspapers using and creating Digital Humanities tools. A central aspect of the project was its multilingualism since the research team came from six countries: United States, England, Netherlands, Germany, Finland and Mexico. The corpus was extracted from 8 national libraries and in five different languages English, Spanish, German, Dutch and Suomi. In some specific cases, other languages were added, such as Russian, Swedish and French. Likewise, one of its objectives was to create a multilingual tool for data mining in a multilingual corpus.
Notwithstanding these characteristics, the problem of multilingualism and the possibility of using any languages other than English to work in was never discussed in the team. How could we do this differently?
In one sentence, the alternative is to avoid the easy path.
If you are in a multilingual Project, consider this:
A) Be conscious. A multilingual project is not multilingual if you decide to work mono-linguistically .
B) Enrich your work: Many academics are multilingual. Identify subgroups with a common language and promote the communication in those languages too.
C) Yes, translate. If you are planning to translate anything to English, consider translating it in as many languages as you are working in. Documents, results, tools…
D) And finally, a very important one: Publish in as many languages as possible.
Multilingual projects require more work. But this work is at the same time a recognition of the importance of the languages you are working in and a way to produce significant knowledge in them.
Carlos Yebra Lopez - New York University. How to Use Digital-Homelands in order to Revitalise Diasporic Languages
By virtue of their nature, diasporic languages often lack a unified geographical territory, which prevents their intergenerational transmission. However, in recent decades there has been a significant trend in the emergence of digital homelands (Held, 2010), i.e., virtual communities where the diasporic language in question is used as the only means of communication between users.
According to Held, virtual communities are not just mere spaces for communication, but they have the potential to become “a territory where a culture may be revitalized after having faced a state of severe decline”, which has also been referred to as a “kingdom of the word” (Shandler, 2004), or “a national language of nowhere” (idem).
This contemporary trend is crucial for the purpose of reversing language shift, as it fulfills the two key conditions envisaged by sociolinguist Joshua Fishman (1991) for the purpose of effectively countering language attrition: intergenerational transmission and diglossic, functional isolation.
By way of illustration, I will be discussing three case studies of Ladino-only online communities: Ladinoforever (Halphie, 2015-), Ladino 21 (Yebra, Acero, Aguado, 2017-) and uTalk's recent course in Ladino (2019-).
Isabelle Zaugg - Columbia University's Data Science Institute. Let's Talk About Scripts
When digital support for a language is lacking, speakers of that language are often forced to communicate digitally using a globally dominant language, like English, if they know one. A particular challenge awaits users of languages written in non-Latin scripts. While some, like Chinese, Japanese, and Korean are well-supported and tools like predictive typing make tech use easy, even well-supported languages like Arabic still see a common trend - users switching not to another language, but to writing their language in Latin characters (a,b,c...), called transliteration. This lightning talk presents my research on script-switching in Amharic, the national language of Ethiopia. While Amharic’s native script is Ethiopic, more than 50% of Facebook users transliterate Amharic into Latin characters (Zaugg 2017). These trends, happening globally, have implications for fluency in native writing traditions, the usability of tools for digitally-disadvantaged language speakers, and ongoing “Romanization” movements globally that advocate for the use of the Latin/Roman script as the most “modern,” “convenient,” and “neutral” script available, using its high level of digital support as evidence for this claim. This talk seeks to inspire the audience to think more deeply not just about the language constraints within the digital sphere, but the script constraints, and how those are shaping our future towards homogeneity not only under English and other European colonial languages, but also under the Latin script. I will conclude with a brief exploration of the three-level meaning the Ethiopic script contains, and how these meanings cannot be fully captured through transliteration into Latin. This is an example of how much we have to lose by allowing the limitations within our tech limit human communication, rather than actively designing tech to serve the diversity of linguistic and script needs.
Peter Chonka – King’s College London. Search as research in African indigenous languages: potentials and problematics of enquiry into auto-complete predictions/suggestions for Af Soomaaliga
This lightning talk will demonstrate my ongoing practical attempts to research and conceptualize the algorithmic power of Google search engine auto-complete 'predictions' (or suggestions) for search terms in African indigenous languages. This talk aims to respond to the workshop's General Theme, and specifically the question of 'how digital research practices reflect/enact linguistic and geocultural diversity'. I will present a particular phenomenon whereby the use of Somali orthography for names in search engines results in auto-complete predictions/suggestions relating to controversial markers of clan or 'tribal' identity. I will briefly explain why this is important in the context of the Somali digital media ecology, and outline my attempts to document and further test these processes. The presentation will demonstrate how digital research practices can productively leverage linguistic diversity to explore the ways in which non-English input languages are both potentially subject to the algorithmic power of dominant global tech companies, whilst also being somewhat 'free' of certain aspects of data-harvesting or censorship due to currently varying degrees of machine readability. The presentation will invite feedback and knowledge exchange on the practical aspects of research into search engine operation in diverse linguistic contexts, with a focus on Africa/the 'Global South'.
Anna Jørgensen - University of Amsterdam. Newswork on Wikipedia – the Case of the Coronavirus
Wikipedia is frequently visited for information about ongoing events, such as the COVID-19 pandemic. The different language versions of the encyclopedia are specific to a community of speakers rather than to a particular nation or location. In traditional news cultures, the proximity of an event to the consumers of the news increases the likelihood of the event being reported. For language versions with several centres of speakers, such as French and Spanish, proximity to the event becomes complicated. In this talk, I discuss to what extent traditional journalistic practices, and specifically event proximity, influence the news reporting of current events on Wikipedia in the context of a global event, namely the COVID-19 pandemic. Furthermore, I explore whether the importance on language community rather than geographical place on Wikipedia influences which information is created and shared on the online platform. I present a quantitative, diachronic analysis of the revision histories of this event in diverse language versions as well as a close reading of the content on the pages.
Pedro Nilsson-Fernàndez - University College Cork, Ireland. Digital Peripheries: A Postcolonial Digital Humanities Approach to Catalan 20th Century Literary Spaces
The early Digital Humanities manifestos (2008, 2010) brought with them the promise of interdisciplinarity, transdisciplinarity, and multidisciplinarity – and most importantly, multilingualism. The reality a decade later is that in general terms, and despite the encouraging growing presence of Latin American and Iberian DH scholarship (Ortega 2014, Galina 2019), the field is still characterised by a heavy dominance of monolingual approaches.
In this lightning talk I will discuss how postcolonial approaches fostered by scholars such as Roopika Risam (2019) – who has highlighted the colonial violence embedded in current digital scholarship practices – can be used to remediate existing disruptions in the digital record. By extrapolating Michelle Lee Brown’s (Re)Mapping exercise of Never Alone to the GIS mapping of 20th Century Catalan writer Manuel de Pedrolo’s oeuvre, I will examine the ways in which digital literary cartography can effectively function not only as an exercise of (re)construction of national and literary spaces but also as a remediating and decolonising practice. The particular examples used in this talk will be GIS visualisations produced by geolocating sixteen Catalan novels – Manuel de Pedrolo’s crime fiction corpus (1953-1972) and his eleven-novel realist cycle Temps obert (1963-69) – in order to illustrate how the city of Barcelona can be examined as a digital palimpsest permeated by notions of collective cultural and linguistic trauma.
Elizabeth Marie Thaut - SOAS, University of London. Language documentation and description with(in) a digital diaspora: The Sylheti Project - SOAS in Camden
It is estimated that between 50-90% of the world’s 7000+ languages will ‘die out’ before 2100 (Hale et al 1992). Most of the world’s languages are insufficiently documented. The current practice of language documentation and description (LDD) has traditionally involved a 'lone-wolf' linguist venturing into the field, in a far-off land, to spend several months at a time within an isolated community of speakers of the target language and ‘bring back’ records to be analysed in academia (which is still today the dominant funding scheme). However, there are few truly isolated minority language communities these days that don't have internet connections where text, images, and video can be shared on various social media platforms, which act as community language archives. Most speakers of so-called unwritten languages actually do write, using conventions learned from more dominant language(s). Since research is based on asking the right questions, trained linguists and amateur language enthusiasts can equally participate in the documentation and description of minority languages using social media, as per Himmelmann’s main language documentation features (2006:15).
The SOAS Sylheti Project (SSP) is an extracurricular, student-led language documentation project that began work with the Sylheti-origin users of the local Surma Community Centre in Camden, after the invitation of the Centre’s director who noticed how the older immigrants weren’t speaking the ‘same language’ as the newer immigrants from Sylhet. Face-to-face linguistic documentation work with the local diaspora community in London, complemented with social media participants’ input and feedback, took on a connected international scope, with speakers in other diaspora communities, in South Asia and further around the globe, as well as in the homeland of Sylhet, participating ‘virtually’. Encouraging interdisciplinary collaboration, the SSP’s online component has helped to increase awareness of the status of Sylheti as the language that it is (and not mere ‘slang’ or a dismissible ‘dialect of Bangla/Bengali’), has gained a greater picture of dialectal variation within Sylheti (with the borderless mixing of social groups in diaspora and online), has had a role revitalisation of the endangered Sylheti/Siloti/Syloti Nagri script, has on spelling conventions, etc.
Cosima Wagner - Freie Universität Berlin. Challenging research infrastructures from a multilingual DH point of view – impulses from two workshops on non-latin scripts
This lightning talk aims at bringing in impulses from two past workshops on multilingual DH in 2018/2019 in Berlin and Utrecht with a special focus on non-Latin scripts and proposes a next step by setting-up a “requirement profile for multilingually enabled digital knowledge infrastructures”.
Within the discussion on digital monolingualism non-Latin-scripts (NLS) provide a particular challenge for all digital knowledge infrastructure environments as software, information systems and infrastructure allow for the use of NLS only to a limited extent or often not at all: limitations to reproduce different scripts (writing directions left-to-right; right-to-left; top-to-bottom); discovery and retrieval problems due to search algorithms which are not optimized for non-Anglophone languages; missing mapping routines between different character systems, transcriptions, recognition of variants, tokenization; missing Unicode definitions for rare/complex characters; missing NLS compatible tools for multilingual DH projects; a.s.o. Furthermore, sustainable expertise on digital tools for multilingual DH is often only developed in projects with a limited time frame and research infrastructure providers like libraries, data centers or research IT departments are often neither staff-wise nor technology-wise equipped to take over and support all languages and disciplines.
The talk summarizes findings and networking activities of NLS DH researchers brought together so far and aims at building on these results as well as on participant’s further input by initiating the establishment of a “requirement profile” for multilingually enabled digital knowledge infrastructures with a special focus on non-Latin scripts.
In order to facilitate the set-up of the “requirement profile” all workshop participants are invited to share their wish list for Dos and Don’ts of multilingually – and especially NLS – enabled digital knowledge infrastructures in an etherpad document, which will be prepared and open for comments also after the workshop.