Long-term Access and Usage of Deeply Annotated Information
The management and archiving of digital research data is an overlapping field for linguistics, library and information science (LIS) and computer science. These disciplines are cooperating in the LAUDATIO project. The name LAUDATIO is an abbreviation for Long-term Access and Usage of Deeply Annotated Information. The project is funded by the German Research Foundation from 2011-2018. The departments of Corpus Linguistics as well as Historical Linguistics, and the Computer and Media Service (CMS) at Humboldt-Universität zu Berlin and The National Institute for Research in Computer Science and Control (INRIA France) are project partners cooperating with the Berlin School of Library and Information Science (BSLIS).
LAUDATIO aims to build an open access research data repository for historical linguistic data with respect to the above mentioned requirements of historical corpus linguistics. For the access and (re-)use of historical linguistic data the LAUDATIO repository uses a flexible and appropriate documentation schema with a subset of TEI customized by TEI ODD. The extensive metadata schema contains information about the preparation and checking methods applied to the data, tools, formats and annotation guidelines used in the project, as well as bibliographic metadata, and information on the research context (e.g. the research project). To provide complex and comprehensive search in the linguistic annotation data, the linguistic search and visualization tool ANNIS will be integrated in the LAUDATIO repository infrastructure.
All corpora are available with open access Creative Commons License in LAUDATIO. All researchers from the academic disciplines of Linguistics and Historical Linguistics can use and re-use the corpora, e.g.:
- display corpora,
- search corpora,
- download corpora,
- upload new annotations to an existing corpus
- and upload new corpora.
What is meant by research data in historical corpus linguistics?
Research data in historical corpus linguistics consists of digitized text material, transcriptions of historical texts and linguistic annotation - a linguistic corpus. It is varied, often idiosyncratic depending on the research question and the language investigated, e.g. standard versus non-standard varieties of a language. Linguistic annotations of all kinds such as morpho-syntactic, syntactic, pragmatic and semantic annotation are seen as a crucial and essential part of a corpus in themselves. However, every single corpus built for a single research question may be used for another one as well. A collection of historical corpora from different periods may be even more useful, for example in order to trace diachronic language change phenomena over a long period. For this academic discipline, it would be a big advantage if such corpora - prepared in one way or another - are accessible and (re-)usable for all corpus linguists dealing with similar historical language data.
Metadata and Research Data Documentation
A detailed and full documentation of the preparation and annotation process including a bibliographic list of the digitized texts is an important step towards improving the quality and thus the usability of historical linguistic corpus data: Such a documentation includes a description of the corpus design, the proper references of authors and editors of the corpus and the annotation, including formats and methods of quality assurance such as measuring Inter-Coder Agreement. Thus, the documentation covers document description as well as object description. Additionally, for (re-)use of the linguistic digital research data the aforementioned information is essential in order to add annotations or new texts to already existing corpora, or in order to use them as a template for practicing in building a new corpus. The required information will be provided by the customization of a TEI Header with an ODD specification along with a RELAX NG schema. Metadata must capture the entire life cycle of historical linguistic data.