cjklib.reading.operator — Operation on character readings

Architecture

A ReadingOperator supports basic operations on string written in a character reading:

  • decompose() breaks down a text into the basic entities of that reading (additional non reading substrings are also accepted).
  • compose() joins these entities together and might apply formatting rules needed by the reading.
  • isReadingEntity() and isFormattingEntity() are provided to check which of the strings returned by decompose() are supported entities for the given reading. While a reading entity expresses an entity of the language (in most cases a syllable), a formatting entity merely exists for the convenience of the written form, e.g. punctuation marks or syllable separators.
  • getDefaultOptions() will return the default reading dialect.

Many child classes add many more reading specific methods.

Romanisation

Additional to decompose() provided by the class ReadingOperator a RomanisationOperator offers a method getDecompositions() that returns several possible decompositions in an ambiguous case. Also, as Romanisations have a fixed set of entities, a method getReadingEntities() offers access to a list of all accepted reading entities.

Decomposition

Transcriptions into the Latin (or Cyrilic) alphabet generate the problem that syllable boundaries or boundaries of entities belonging to single Chinese characters aren’t clear anymore once entities are grouped together.

Therefore it is important to have methods at hand to separate strings and to split those into single entities. This though cannot always be done in a clear and unambiguous way as several different decompositions might be possible thus leading to the general case of ambiguous decompositions.

Many romanisations do provide a way to tackle this problem. Pinyin for example requires the use of an apostrophe (') when the reverse process of splitting the string into syllables gets ambiguous. The Wade-Giles romanisation in its strict implementation asks for a hyphen used between all syllables. The LSHK’s Jyutping when written with tone marks will always be clearly decomposable as the digits mark syllable borders.

The method isStrictDecomposition() can be implemented to check if one possible decomposition is the strict decomposition offered by the romanisation’s protocol. This method should guarantee that under all circumstances only one decomposed version will be regarded as strict.

If no strict version is yielded and different decompositions exist an unambiguous decomposition can not be made. These decompositions can be accessed through method getDecompositions(), even in a cases where a strict decomposition exists.

Letter case

Romanisations are special to other readings as their entities can be written in upper or lower case, or in a mix of them. By default operators will recognise both, this behaviour can be changed with option 'case' which can alternatively be changed to 'lower'. Upper case is not explicitly supported. If such a writing is needed, this behaviour can be implemented by choosing lower case and converting strings to and from the operator manually. Method getReadingEntities() will by default return lower case entities.

Tonal readings

Tonal readings are supported with class TonalFixedEntityOperator. It provides two methods getTonalEntity() and splitEntityTone() to cope with tonal information in text.

Tones

Operators are free to handle tones according to their needs. No data type constraint is given so that some will handle tones as integers, while others will handle strings. Even the count of tones between different operators for the same language may vary as one system might be more specific about tonal features.

Plain entities

While some operators have a fixed set of accepted entities, the more specific subgroup for tonal languages has a set of basic entities, such entity here being called plain entity, which can be annotated with tonal information to yield a regular reading entity. Some plain entities might themselves be normal reading entities, while others might be not. No requirements are made that the set of plain entity in cross product with the set of tones will fully span the set of reading entities.

Examples

Decompose a reading string in Gwoyeu Romatzyh into single entities:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.decompose('"Hannshyue" .de mingcheng duey Jonggwo [...]', 'GR')
['"', 'Hann', 'shyue', '" ', '.de', ' ', 'ming', 'cheng', ' ', 'duey', ' ', 'Jong', 'gwo', ' [...]']

The same can be done by directly using the operator’s instance:

>>> from cjklib.reading import operator
>>> cy = operator.CantoneseYaleOperator()
>>> cy.decompose(u'gwóngjàuwá')
[u'gwóng', u'jàu', u'wá']

Composing will reverse the process, using a Pinyin string:

>>> f.compose([u'xī', u'ān'], 'Pinyin')
u"xī'ān"

For more complex operators, see PinyinOperator or MandarinIPAOperator.

Base classes