cjklib.reading.operator
— Operation on character readings¶
Architecture¶
A ReadingOperator
supports basic operations
on string written in a character reading:
decompose()
breaks down a text into the basic entities of that reading (additional non reading substrings are also accepted).compose()
joins these entities together and might apply formatting rules needed by the reading.isReadingEntity()
andisFormattingEntity()
are provided to check which of the strings returned bydecompose()
are supported entities for the given reading. While a reading entity expresses an entity of the language (in most cases a syllable), a formatting entity merely exists for the convenience of the written form, e.g. punctuation marks or syllable separators.getDefaultOptions()
will return the default reading dialect.
Many child classes add many more reading specific methods.
Romanisation¶
Additional to decompose()
provided by the class ReadingOperator
a
RomanisationOperator
offers a method
getDecompositions()
that returns several possible decompositions in an ambiguous case. Also,
as Romanisations have a fixed set of entities, a method
getReadingEntities()
offers access to a list of all accepted reading entities.
Decomposition¶
Transcriptions into the Latin (or Cyrilic) alphabet generate the problem that syllable boundaries or boundaries of entities belonging to single Chinese characters aren’t clear anymore once entities are grouped together.
Therefore it is important to have methods at hand to separate strings and to split those into single entities. This though cannot always be done in a clear and unambiguous way as several different decompositions might be possible thus leading to the general case of ambiguous decompositions.
Many romanisations do provide a way to tackle this problem. Pinyin for
example requires the use of an apostrophe ('
) when the reverse process
of splitting the string into syllables gets ambiguous. The Wade-Giles
romanisation in its strict implementation asks for a hyphen used between all
syllables. The LSHK’s Jyutping when written with tone marks will always be
clearly decomposable as the digits mark syllable borders.
The method
isStrictDecomposition()
can be implemented to check if one possible decomposition is the
strict decomposition offered by the romanisation’s protocol.
This method should guarantee that under all
circumstances only one decomposed version will be regarded as strict.
If no strict version is yielded and different decompositions exist an
unambiguous decomposition can not be made. These decompositions can be
accessed through method
getDecompositions()
,
even in a cases where a strict decomposition exists.
Letter case¶
Romanisations are special to other readings as their entities can be written
in upper or lower case, or in a mix of them. By default operators will
recognise both, this behaviour can be changed with option 'case'
which
can alternatively be changed to 'lower'
. Upper case is not explicitly
supported. If such a writing is needed, this behaviour can be implemented
by choosing lower case and converting strings to and from the operator
manually. Method
getReadingEntities()
will by default return lower case entities.
Tonal readings¶
Tonal readings are supported with class
TonalFixedEntityOperator
.
It provides two methods
getTonalEntity()
and
splitEntityTone()
to cope with tonal information in text.
Tones¶
Operators are free to handle tones according to their needs. No data type constraint is given so that some will handle tones as integers, while others will handle strings. Even the count of tones between different operators for the same language may vary as one system might be more specific about tonal features.
Plain entities¶
While some operators have a fixed set of accepted entities, the more specific subgroup for tonal languages has a set of basic entities, such entity here being called plain entity, which can be annotated with tonal information to yield a regular reading entity. Some plain entities might themselves be normal reading entities, while others might be not. No requirements are made that the set of plain entity in cross product with the set of tones will fully span the set of reading entities.
Examples¶
Decompose a reading string in Gwoyeu Romatzyh into single entities:
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.decompose('"Hannshyue" .de mingcheng duey Jonggwo [...]', 'GR')
['"', 'Hann', 'shyue', '" ', '.de', ' ', 'ming', 'cheng', ' ', 'duey', ' ', 'Jong', 'gwo', ' [...]']
The same can be done by directly using the operator’s instance:
>>> from cjklib.reading import operator
>>> cy = operator.CantoneseYaleOperator()
>>> cy.decompose(u'gwóngjàuwá')
[u'gwóng', u'jàu', u'wá']
Composing will reverse the process, using a Pinyin string:
>>> f.compose([u'xī', u'ān'], 'Pinyin')
u"xī'ān"
For more complex operators, see
PinyinOperator
or MandarinIPAOperator
.
Readings¶
- Mandarin Chinese
- Cantonese
- Shanghainese
- Korean
- Japanese