The tokenization process applies to passages of texts identified by a CTS URN. The tokenizers support two forms of input:
hocuspocus
library (http://cite-architecture.github.io/hocuspocus/)In both cases, the output is an ordered list of token analyses. Each token analysis has two parts, a CTS URN for the substring, and a CITE URN analyzing the type of the token.
At a minimum, a request to tokenize a string of text must identify the CTS URN as well as the text content to analyze. In addition, it is possible to identify the type of context being analyzed with a CITE URN taken from one of the values urn:cite:hmt:tokentypes.lexical
, urn:cite:hmt:tokentypes.numeric
, urn:cite:hmt:tokentypes.waw
, urn:cite:hmt:tokentypes.sic
, or by a CITE URN in either of the two collections urn:cite:hmt:place
or urn:hmt:pers
. If the context is not identified with one of these URNs, the content of XML text nodes is treated by default as a GreekString
object, which the greeklang
library can split into lexical tokens.
If we parse the string of characters, προΐαλλε θοὰς ἐπι νῆας from the text passage urn:cts:greekLit:tlg0012.tlg001.msA:11.3, we get an ordered list of 4 tokens.
Token string |
---|
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@προΐαλλε |
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@θοὰς |
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@ἐπι |
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@νῆας |
Their types are all the same:
Type |
---|
urn:cite:hmt:tokentypes.lexical |
urn:cite:hmt:tokentypes.lexical |
urn:cite:hmt:tokentypes.lexical |
urn:cite:hmt:tokentypes.lexical |
If we parse the string Ζεὺς using the same CTS URN, and specify the context with the CITE URN urn:cite:hmt:pers.pers8, this is analyzed as:
By default, requests to tokenize a string do not include explicit index values on the CTS URN substrings, since the most common use may not necessarily be to analyze an entire citable node of text, but explicit subreference indexing can optionally be included.
tis t'ar sfwe qewn eridi cunehke maxesqai;
The tabular representation of the HMT project editions preserves the full XML markup of the archival TEI documents, so the HMT tokenizers can identify the appropriate context from the HMT project's markup conventions. They can therefore formulate requests with both the CTS URN for the text passage, and a CITE URN classifying the context. Analyses of delimited text files always include explicit indexes on CTS URN subreferences.
Tokenizing this data file parses Its XML text node into a list of 8 tokens.
Token string |
---|
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@Ζεὺς[1] |
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@δ'[1] |
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@Ἔριδα[1] |
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@προΐαλλε[1] |
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@θοὰς[1] |
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@ἐπι[1] |
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@νῆας[1] |
urn:cts:greekLit:tlg0012.tlg001.msA:11.3@Ἀχαιῶν[1] |
Their types are:
Types |
---|
urn:cite:hmt:pers.pers8 |
urn:cite:hmt:tokentypes.lexical |
urn:cite:hmt:pers.pers156 |
urn:cite:hmt:tokentypes.lexical |
urn:cite:hmt:tokentypes.lexical |
urn:cite:hmt:tokentypes.lexical |
urn:cite:hmt:tokentypes.lexical |
urn:cite:hmt:peoples.place96 |
Elements to test:
w
Illustrated above:
persName
placeName
rs
(@type = 'ethnic')num
: treated as MilesianString
s in the greeklang
libraryref
(@type = "urn" and @n=urn value)q
cit
rs
(type = waw
)figDesc
note
Both the editorial and diplomatic tokenization systems include utility methods to split strings of text on white space.
The string
Ζεὺς δ' Ἔριδα προΐαλλε θοὰς ἐπι νῆας Ἀχαιῶν
yields the following ordered set of tokens:
Token string |
---|
Ζεὺς |
δ' |
Ἔριδα |
προΐαλλε |
θοὰς |
ἐπι |
νῆας |
Ἀχαιῶν |