cjklib.reading.converter — Conversion between character readings

Architecture

The basic method is convert() which converts one input string from one reading to another.

The method getDefaultOptions() will return the conversion default settings.

What gets converted

The conversion process uses the ReadingOperator for the source reading to decompose the given string into the single entities. The decomposition contains reading entities and entities that don’t represent any pronunciation. While the goal is to convert included reading entities to the target reading, some convertes might decide to also convert non-reading entities. This can be for example delimiters like apostrophes that differ between romanisations or punctuation marks that have a defined representation in the target system, e.g. Braille.

Errors

By default conversion won’t stop on entities that closely resemble other reading entities but itself are not valid. Those will turn up unchanged in the result and can cause a CompositionError when the target operator decideds that it is impossible to link a converted entity with a non-converted one as it would make it impossible to later determine the entity boundaries. Most of those errors will probably result from bad input that fails on conversion. This can be solved by telling the source operator to be strict on decomposition (where supported) so that the error will be reported beforehand. The followig example tries to convert xiǎo tōu (“thief”), misspelled as *xiǎo tō:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> print f.convert(u'xiao3to1', 'Pinyin', 'GR',
...     sourceOptions={'toneMarkType': 'numbers'})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
cjklib.exception.CompositionError: Unable to delimit non-reading entity 'to1'
>>> print f.convert(u'xiao3to1', 'Pinyin', 'GR',
...     sourceOptions={'toneMarkType': 'numbers',
...         'strictSegmentation': True})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
cjklib.exception.DecompositionError: Segmentation of 'to1' not possible or invalid syllable

Not being strict results in a lazy conversion, which might fail in some cases as shown above. u'xiao3 to1' (with a space in between) though will work for the lazy way ('to1' not being converted), while the strict version will still report the wrong *to1.

Other errors that can arise:

  • AmbiguousDecompositionError, if the source string can not be decomposed unambigiuously,
  • ConversionError, e.g. if the target system doesn’t support a feature given in the source string, and
  • AmbiguousConversionError, if a given entity can be mapped to more than one entity in the target reading.

Bridge

Conversions between two Readings can be made using a third reading if no direct conversion is defined. This reading is called a bridge reading and is implemented in BridgeConverter. Using the routines from the ReadingFactory will automatically employ bridges if needed.

Examples

Convert a string from Jyutping to Cantonese Yale:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert('gwong2jau1waa2', 'Jyutping', 'CantoneseYale')
u'gwóngyāuwá'

This is also possible creating a converter instance explicitly using the factory:

>>> jyc = f.createReadingConverter('GR', 'Pinyin')
>>> jyc.convert('Woo.men tingshuo yeou "Yinnduhshyue", "Aijyishyue"')
u'Wǒmen tīngshuō yǒu "Yìndùxué", "Āijíxué"'

Convert between different dialects of the same reading Wade-Giles:

>>> f.convert(u'kuo3-yü2', 'WadeGiles', 'WadeGiles',
...     sourceOptions={'toneMarkType': 'numbers'},
...     targetOptions={'toneMarkType': 'superscriptNumbers'})
u'kuo³-yü²'

See PinyinDialectConverter for more examples.

Base classes