HMT utilities, version 0.3.0 > Tokenization of Greek texts >

Components shared by all tokenization systems

Input and output

The tokenization process applies to passages of texts identified by a CTS URN. The tokenizers support two forms of input:

  1. a string of text to parse, accompanied by the passage's URN, and optionally by a CITE URN identifying the kind of context
  2. a delimited text file representing the full OHCO2 model of a text as defined in the hocuspocus library (http://cite-architecture.github.io/hocuspocus/)

In both cases, the output is an ordered list of token analyses. Each token analysis has two parts, a CTS URN for the substring, and a CITE URN analyzing the type of the token.

Parsing strings of texts

At a minimum, a request to tokenize a string of text must identify the CTS URN as well as the text content to analyze. In addition, it is possible to identify the type of context being analyzed with a CITE URN taken from one of the values urn:cite:hmt:tokentypes.lexical, urn:cite:hmt:tokentypes.numeric, urn:cite:hmt:tokentypes.waw, urn:cite:hmt:tokentypes.sic, or by a CITE URN in either of the two collections urn:cite:hmt:place or urn:hmt:pers. If the context is not identified with one of these URNs, the content of XML text nodes is treated by default as a GreekString object, which the greeklang library can split into lexical tokens.

Examples

If we parse the string of characters, προΐαλλε θοὰς ἐπι νῆας from the text passage urn:cts:greekLit:tlg0012.tlg001.msA:11.3, we get an ordered list of 4 tokens.

Token string
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@προΐαλλε
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@θοὰς
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@ἐπι
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@νῆας

Their types are all the same:

Type
urn:cite:hmt:tokentypes.lexical
urn:cite:hmt:tokentypes.lexical
urn:cite:hmt:tokentypes.lexical
urn:cite:hmt:tokentypes.lexical

If we parse the string Ζεὺς using the same CTS URN, and specify the context with the CITE URN urn:cite:hmt:pers.pers8, this is analyzed as:

By default, requests to tokenize a string do not include explicit index values on the CTS URN substrings, since the most common use may not necessarily be to analyze an entire citable node of text, but explicit subreference indexing can optionally be included.

tis t'ar sfwe qewn eridi cunehke maxesqai;

Parsing delimited text files

The tabular representation of the HMT project editions preserves the full XML markup of the archival TEI documents, so the HMT tokenizers can identify the appropriate context from the HMT project's markup conventions. They can therefore formulate requests with both the CTS URN for the text passage, and a CITE URN classifying the context. Analyses of delimited text files always include explicit indexes on CTS URN subreferences.

Examples

Tokenizing this data file parses Its XML text node into a list of 8 tokens.

Token string
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@Ζεὺς[1]
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@δ'[1]
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@Ἔριδα[1]
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@προΐαλλε[1]
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@θοὰς[1]
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@ἐπι[1]
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@νῆας[1]
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@Ἀχαιῶν[1]

Their types are:

Types
urn:cite:hmt:pers.pers8
urn:cite:hmt:tokentypes.lexical
urn:cite:hmt:pers.pers156
urn:cite:hmt:tokentypes.lexical
urn:cite:hmt:tokentypes.lexical
urn:cite:hmt:tokentypes.lexical
urn:cite:hmt:tokentypes.lexical
urn:cite:hmt:peoples.place96

Universally allowed elements and their mapping to token types

Elements to test:

Illustrated above:

Elements allowed in "secondary" texts, but not in Iliad editions

Splitting strings

Both the editorial and diplomatic tokenization systems include utility methods to split strings of text on white space.

Examples

The string

Ζεὺς  δ' Ἔριδα  προΐαλλε θοὰς ἐπι νῆας Ἀχαιῶν

yields the following ordered set of tokens:

Token string
Ζεὺς
δ'
Ἔριδα
προΐαλλε
θοὰς
ἐπι
νῆας
Ἀχαιῶν