cjklib.reading.operator.PinyinOperator is a complete implementation of the standard Chinese Pinyin romanisation (Hanyu Pinyin Fang’an, 汉语拼音方案, standardised in ISO 7098).
Features:
Pinyin syllables need to be separated by an apostrophe in case their decomposition will get ambiguous. A famous example might be the city Xi’an, which if written xian would be read as one syllable, meaning e.g. ‘fresh’. Another example would be Chang’an which could be read chan’gan if no delimiter is used in at least one of both cases.
Different rules exist where to place apostrophes. A simple yet sufficient rule is implemented in aeoApostropheRule() which is used as default in this class. Syllables starting with one of the three vowels a, e, o will be separated. Remember that vowels [i], [u], [y] are represented as yi, wu, yu respectively, thus making syllable boundaries clear. compose() will place apostrophes where required when composing the reading string.
An alternative rule can be specified to the constructor passing a function as an option pinyinApostropheFunction. A possible function could be a rule separating all syllables by an apostrophe thus simplifying the reading process for beginners.
On decomposition of strings it is important to check which of the possibly several choices will be the one actually meant. E.g. syllable xian given above should always be segmented into one syllable, solution xi’an is not an option in this case. Therefore an alternative to aeoApostropheRule() should make sure it guarantees proper decomposition, which is tested through isStrictDecomposition().
Last but not least compose(decompose(string)) will only be the identity if apostrophes are applied properly according to the rule as wrongly placed apostrophes will be kept when composing. Use removeApostrophes() to remove separating apostrophes.
>>> def noToneApostropheRule(opInst, precedingEntity, followingEntity):
... return precedingEntity and precedingEntity[0].isalpha() \
... and not precedingEntity[-1].isdigit() \
... and followingEntity[0].isalpha()
...
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert('an3ma5mi5ba5ni2mou1', 'Pinyin', 'Pinyin',
... sourceOptions={'toneMarkType': 'numbers'},
... targetOptions={'toneMarkType': 'numbers',
... 'missingToneMark': 'fifth',
... 'pinyinApostropheFunction': noToneApostropheRule})
u"an3ma'mi'ba'ni2mou1"
The phenomenon Erhua (兒化音/儿化音, Erhua yin), i.e. the r-colouring of syllables, is found in the northern Chinese dialects and results from merging the formerly independent sound er with the preceding syllable. In written form a word is followed by the character 兒/儿, e.g. 頭兒/头儿.
In Pinyin the Erhua sound is quite often expressed by appending a single r to the syllable of the character preceding 兒/儿, e.g. tóur for 頭兒/头儿, to stress the monosyllabic nature and in contrast to words like 兒子/儿子 ér’zi where 兒/儿 ér constitutes a single syllable.
For decomposing syllables in Pinyin it is thus important to decide if the r marking r-colouring should be an entity on its own account stressing the representation in the character string with an own character or rather stressing the monosyllabic nature and being part of a syllable of the foregoing character. This can be configured at instantiation time. By default the two-syllable form is chosen, which is more general as both examples are allowed: banr and ban r (i.e. one without delimiter, one with; both though being two entities in this representation).
Tone marks, if using the standard form with diacritics, are placed according to official Pinyin rules. The PinyinOperator by default tries to work around misplaced tone marks though, e.g. *tīan’ānmén (correct: tiān’ānmén), to ease handling of malformed input. There are cases though, where this generous behaviour leads to a different segmentation compared to the strict interpretation, as for *hónglùo which can fall into hóng *lùo (correct: hóng luò) or hóng lù o (also, using the first example, tī an ān mén). As the latter result also stems from a wrong transcription, no means are implemented to disambiguate between both solutions. The general behaviour is controlled with option 'strictDiacriticPlacement'.
Pinyin allows to shorten two-letter pairs ng, zh, ch and sh to ŋ, ẑ, ĉ and ŝ. This behaviour can be controlled by option 'shortenedLetters'.
See also