How to find what isn’t (apparently) there: the secret life of large corpora

Authors: Christopher J. Pountain, Queen Mary, University of London; Isabel García Ortiz, Queen Mary, University of London.
Session: Explorations in the Stories of Words (10:00-12:00, Wednesday 13 June 2018)

A well-known limitation of corpus linguistic studies is that a corpus ‘can show nothing more than its own contents’ (Hunston 2002:22; see also Cheng 2012:175).  (Indeed, we should also note that since the decision about the contents of a corpus rests with its designer, so a corpus shows us no more than its designer wants us to see.)  Exploitation of a corpus therefore requires a considerable degree of what may be termed philological mediation, some of the dimensions of which are identified in Pountain (2011).  In this paper we shared some of the methodological challenges found in using the Corpus del español (CDE) and the Corpus diacrónico del español (CORDE) for a variety of research projects on the history of Spanish.  We first looked at the task of ‘finding what isn’t there’: omission of the complementiser que, which enjoyed a certain fashion in the 15th and 16th centuries (reported in Pountain 2014).  We then turned my attention to deducing information about the usage of particular words in different linguistic registers in the 19th and 20th centuries, which may give crucial insights into the downwards diffusion of ‘learnèd words’ (cultismos) in recent times.  Finally, we briefly explored the possibility of discriminating speakers’ passive and active knowledge of their language in the 20th century.  We conclude that critical philological mediation of the data available from corpora does allow interesting conclusions to be drawn from what at first seems rather unpromising material.

