Texts of all kinds are fundamental in the study of many disciplines, and much effort has been devoted to creating reliable, accurate and readable versions of texts, both in print and in the digital world. Editions of texts that are published online can have much greater interactivity, and can often be searched and displayed in new ways, opening up the study of text through computer analysis, or showing features of the text not easily displayed in traditional print editions.
To discover what text analysis tools can do to reveal information from groups of text, you should begin by looking at Voyant Tools. Loading one of the predefined sets of texts (corpora) will produce basic statistics on word counts, word contexts (collocations), word trends across documents, and other useful metrics, giving you a starting point to explore texts in new ways.
A key tool in the use of digital text is the transcription language called TEI. Using this will ensure texts that you create for analysis are standardised and can be exchanged readily with other researchers; there are a number of repositories of freely available texts that can be downloaded for analysis.
Creating digital texts, whether of historical documents, literary works, or even documentation and manuals, will benefit from the standardisation of text formats, and allows the processing of texts into multiple outputs, editions or visualisations. The practice of digital editing and production of digital editions is well documented.
Note that many individual research projects have their own encoding guidelines, but these will contain customisations or use a reduced set of tags, so may not give a complete understanding of TEI.
Digital editions vary widely in scope and approach, and if you are learning digital editing or creating your own edition, it is worthwhile examining other editions outside your specific field of interest. The links below give some examples of innovative and well-designed editions, as well as catalogues of editions available to the public.
Analysing language to understand its structure, features, style and authorship can be achieved using a number of tools, and was one of the earliest examples of the use of computers in humanities research, with the work of Fr. Roberto Busa in the late 1940s. Modern techniques used in corpus linguistics, text mining, document categorisation and cluster analysis can give insight into the content and structure of texts.
Digital Publishing is, in essence, the replacement of print format with a digital variant, of which there are no formal boundaries. As long as it exists in
code, everything from the text message to the weekly blog can be considered an example of Digital Publishing. It offers methods and scales of distribution made logistically impossible by the limitations of traditional print media, and carries with it the potential for exponential levels of consumption, due to the inherently collaborative modes that define the contemporary experience of accessing digital media. A single person is able to share the material at no extra cost with their friends and followers, who in turn might share it too, all the while endlessly producing reproductions of the original.
Digital Publishing is unique in that, at least with the majority of its variants, it is endlessly malleable. It contains within itself the potential to be altered at the click of a button, by writer and consumer alike, as well as, in the PDF format for example, alternatively offering a security of sorts for those concerned with the preservation of authorial sanctity offered by print. As well as being beneficial in regards to the act of publishing itself, digitized versions of published material go beyond the constraints of their physical counterparts, insofar as their foundational coding offers a vastly deeper experience of consumption, whether that be through the use of hyperlinking, embedded video, instantaneous translation, real-time updates; the benefits of Digital Publishing over Physical Publishing are substantial, and ever changing.
There are a wide range of corpora (collections of texts) available for search, download or reuse as the subject of textual analysis. They vary from simple plain-text archives, such as Project Gutenberg, to complex, heavily marked-up corpora such as the British National Corpus.
For many applications in textual analysis, it is essential to have texts that are marked up systematically and consistently, and that have been reliably edited, usually with the involvement of an editorial team to ensure that errors are not missed by a single editor's oversights.