Working with Text - Digital Humanities - Introduction - LibGuides at University of Exeter

Digital Text Editing and Encoding

Creating digital texts, whether of historical documents, literary works, or even documentation and manuals, will benefit from the standardisation of text formats, and allows the processing of texts into multiple outputs, editions or visualisations. The practice of digital editing and production of digital editions is well documented.

Text Encoding Initiative
The Text Encoding Initiative (TEI) is a consortium which develops and maintains guidelines for the representation of texts in digital form.
TEI By Example
A freely available set of tutorials and example encoded documents, illustrating a wide range of text genres and encoding tasks.

Digital Scholarly Editing by Matthew James Driscoll (Editor); Elena Pierazzo (Editor)
ISBN: 1783742380

Publication Date: 2016-08-15
Digital Critical Editions by Daniel Apollon (Editor); Claire Belisle (Editor); Philippe Regnier (Editor)
ISBN: 9780252038402

Publication Date: 2014-06-18

Note that many individual research projects have their own encoding guidelines, but these will contain customisations or use a reduced set of tags, so may not give a complete understanding of TEI.

Digital Editions

Digital editions vary widely in scope and approach, and if you are learning digital editing or creating your own edition, it is worthwhile examining other editions outside your specific field of interest. The links below give some examples of innovative and well-designed editions, as well as catalogues of editions available to the public.

A Catalog of Digital Scholarly Editions
A regularly updated catalogue of digital editions, categorised by language, subject, period and genre.
Scholarlyediting.org
The Annual of the Association for Documentary Editing - An online journal of exploratory and innovative digital editions and commentary, and essays and reviews on digital editions.

Textual Analysis and Natural Language Processing

Analysing language to understand its structure, features, style and authorship can be achieved using a number of tools, and was one of the earliest examples of the use of computers in humanities research, with the work of Fr. Roberto Busa in the late 1940s. Modern techniques used in corpus linguistics, text mining, document categorisation and cluster analysis can give insight into the content and structure of texts.

TAPoR
A portal for over 400 tools for studying text, categorised by task. Many of the tools can be used with straightforward text files, others may need specific formats or software downloads.
Voyant Tools
Voyant Tools is a web-based reading and analysis environment for digital texts. Working with common text datasets, the tools allow experimentation and analysis using a broad range of techniques.
Natural Language Toolkit (NLTK)
NLTK consists of Python-based libraries for advanced natural language processing. Requires experience of Python programming language.
GATE
GATE, the general architecture for text engineering, is a package of software tools for extracting and analysing meaning from any collection of texts. Whilst GATE has a steep learning curve, it allows the researcher access to advanced text processing algorithms without formal programming, through the use of toolchains and pipelines.
Add / Reorder

Natural Language Processing with Python by Steven Bird; Ewan Klein; Edward Loper
Publication Date: 2010

A more advanced text on practical natural language processing, for those already familiar with the Python programming language.
Cluster Analysis for Corpus Linguistics by Hermann Moisl
ISBN: 9783110363814

Publication Date: 2015-02-24

Digital Publishing

What is Digital Publishing?

Digital Publishing is, in essence, the replacement of print format with a digital variant, of which there are no formal boundaries. As long as it exists in code, everything from the text message to the weekly blog can be considered an example of Digital Publishing. It offers methods and scales of distribution made logistically impossible by the limitations of traditional print media, and carries with it the potential for exponential levels of consumption, due to the inherently collaborative modes that define the contemporary experience of accessing digital media. A single person is able to share the material at no extra cost with their friends and followers, who in turn might share it too, all the while endlessly producing reproductions of the original.

Benefits of Digital Publishing

Digital Publishing is unique in that, at least with the majority of its variants, it is endlessly malleable. It contains within itself the potential to be altered at the click of a button, by writer and consumer alike, as well as, in the PDF format for example, alternatively offering a security of sorts for those concerned with the preservation of authorial sanctity offered by print. As well as being beneficial in regards to the act of publishing itself, digitized versions of published material go beyond the constraints of their physical counterparts, insofar as their foundational coding offers a vastly deeper experience of consumption, whether that be through the use of hyperlinking, embedded video, instantaneous translation, real-time updates; the benefits of Digital Publishing over Physical Publishing are substantial, and ever changing.

Paper Knowledge by Lisa Gitelman
ISBN: 9780822356578

Publication Date: 2014-03-28

Textual Corpora

There are a wide range of corpora (collections of texts) available for search, download or reuse as the subject of textual analysis. They vary from simple plain-text archives, such as Project Gutenberg, to complex, heavily marked-up corpora such as the British National Corpus.

For many applications in textual analysis, it is essential to have texts that are marked up systematically and consistently, and that have been reliably edited, usually with the involvement of an editorial team to ensure that errors are not missed by a single editor's oversights.

Oxford Text Archive
A wide-ranging collection of texts, mostly marked up in TEI/XML or TEI/SGML, covering 25 languages, from the earliest written records to the present day.
Text Creation Partnership
A large number of marked-up texts, freely available, from the EEBO, ECCO and Evans collections available elsewhere. See the FAQ page for details of how to download the raw files to analyse.

Gale Digital Scholar Lab
The Digital Scholar's Lab gives access to a variety of Gale Primary Sources databases, which can be analysed using Digital Humanities tools,

Text Encoding: Find Out More

To find out more about the basic structure of TEI and how to get started encoding your own text, visit our TEI page.

Digital Humanities - Introduction: Working with Text

Getting started with text

Analysis

Text encoding