Corpus description

The APU Writing and Reading Corpus 1979-1988

The main output of this project is a diachronic corpus of schoolchildren’s writings. The methodology builds on a cross-disciplinary approach to literacy development and corpus linguistics. The corpus has been developed in three steps. It currently consists of 522 individual files with school scripts (c. 93,000 words) and 21 files with basal readrs (c. 79,000 words). We aim to make the corpus available shortly following the link below.

       ►► https://apucorpus.liverpool.ac.uk (coming soon, with the APU User Manual).►►

(I) Data selection

The materials are based on the Assessment of Performance Unit (APU) surveys of language performance, carried out by the National Foundation for Educational Research (NFER). The APU writing surveys aimed at assessing pupils’ performance in different communicative situations, such as editing, describing, reporting, etc. (Gorman et al. 1991: 29). There are scripts by/for primary schoolchildren – Year-6 level, 11-year-olds – as well as by/for secondary schoolchildren – Year-11 level, 15-year-olds. This project focuses on the former age group (Year-6) and two types of text with a long-standing tradition in UK schools, namely narration and argumentation. The importance of genres and the influence of the task on writing performance have been widely acknowledged in previous studies (e.g. Reppen 1994: 23–32, Gorman et al. 1991: 30–5) to the extent that “an essential knowledge of forms of texts is [considered] a prerequisite to full competence in writing” (Kress 1994: xiv). The rationale behind the selection of the younger age-group and these two tasks lies in the observation that children start to be aware of genre differences in writing already at age 8, “using linguistic features to distinguish between narrative tasks and expository tasks”, and that at age 11–12 they are “able to control a number of different types of writing tasks”, including “a distinct linguistic style for argumentative/persuasive writing” (see Biber et al. 2002: 460, Reppen 1994: 7). Besides, primary education has recently undergone important changes in the curriculum of English grammar teaching (see the National Curriculum statutory programmes here and here). On the practical side, the narration and argumentation tasks can be compared with other children corpora (e.g. Oxford Children Corpus; Reppen 1994).

The APU Writing and Reading Corpus consists of two major components: writings by children ("school scripts") and writings for children ("basal readers"), in line with work by Biber and associates (Reppen 1994, Biber et al. 2002), The former will help us to identify the range of lexical and grammatical features that are (fully or partially) mastered at Year-6 level; the latter will signal what linguistic features this age-group tends to be exposed to and/or presented with as linguistic models. The "school scripts" component has been collected from a representative sample of the APU writing surveys from 1979-1988. The "basal readers" component has been compiled from a parallel sample of the APU reading surveys and some supplementay materials. The school scripts have been stratified by (i) time period (1979, 1988), (ii) communicative function of the writing task (argumentative/persuasive, narrative/descriptive), and (iii) pupil’s sex (male, female).

(II) Data transcription

The original materials are currently stored in hard-copy at the University of Liverpool. The first stage involved scanning the scripts to PDF format. These have been transcribed manually in raw text format with a text editor such as Notepad++. An XML-TEI tagged version has been derived from the raw texts following standard corpus-transcription procedures to facilitate the accuracy of analysis and statistical comparisons. This involves:

  1. conversion to XML according to the standard Text Encoding Initiative guidelines (with oXygen editor, TEI Lite);
  2.  a parallel text with normalisation into modern spelling as a pre-tagging stage;
  3. morphological tagging by part-of-speech (CLAWS7 tagset and manual spot-checks);
  4. semantic tagging (USAS).

Steps (iii) and (iv) have been performed with the web-corpus tool W‑Matrix.

The corpus metadata has been fully documented in a purpose-designed MS Access relational database, including fields related to the script, pupil, school, and transcription stage process. The scripts' metadata will be displayed in the online corpus.

(III) Web-based application/interface

The data has been migrated to a web-based application that displays the digitised images and the XML text side by side in browse and search layouts, for both the "school scripts" and the "basal readers" components. The interface has a user-friendly design, suitable for academic and non-academic audiences. It offers browse, search and download functions.

The web application will be accessed via the project’s webpage (see link above). The corpus will be licence-free, but users will be asked to register by signing a ‘user agreement form’, whereby they will comply with copyright, anonymity and confidentiality restrictions (see Ethical considerations below).

The APU Writing Corpus has a number of advantages: (a) it is freely available, grammatically and semantically tagged; (b) it includes a comparable set of ‘basal readers’, which are often seen as indicators of the 'standards' of writing performance to which children are exposed (Reppen 2001); (c) its historical dimension allows for comparisons of writing performance before and after the National Curriculum.


2014, October-November: Selection of scripts for digitisation.

2015: (a) Transcription of 522 school scripts (c. 93,000 words): untagged TXT version, normalised TXT version, TEI-XML version. (b) Transcription of 21 basal readers (c. 79,000 words): untagged TXT version, TEI-XML version. (c) Online interface setup.

2016, January-March: (a) Pupil scripts: part-of-speech and semantic tagged versions. (b) Basal readers: part-of-speech and semantic tagged versions. (c) Online interface development.

2016, April-September: Online interface testing and enhancement.

2017, February: Multi-dimensional analyses of the APU school scripts based on the Multidimensional Analysis Tagger (MAT, Nini 2015); in particular, the APU story task in relation to Biber's Dimension 2 Narrative vs. Non-Narrative Concerns, and the APU rule task in relation to Dimension 4 Overt Expression of Persuasion.

Ethical considerations

We duly attend to ethical considerations in compliance with general practice in applied linguistic studies (see, for instance, the British Education Research Association ethical recommendations). For the purposes of our project, these concern copyright ownership of the APU materials and data storage.

  1. Copyright ownership of all the APU materials rests with the University of Liverpool, where they have been stored since the 1990s. González-Díaz has already signed an agreement with Prof. Greg Brooks, Emeritus Professor at the University of Sheffield and former member of the APU team, and with Dr. Chris Whetton, Deputy Director of NFER at the time of the APU surveys and Head of the Department of Assessment and Measurement there until 2013. This agreement ensures that all team members will maintain the original privacy undertaking and that no participating child will be identified, by preserving confidentiality and anonymity. Participants’ anonymity and confidentiality will be preserved at all times in reports and publications in compliance with the Data Protection Act.
  2. All printed materials of the APU surveys will be safely filed at the University of Liverpool, and all electronic data will be safely archived on encrypted university computers.

