cjklib.dictionary — High level dictionary access

New in version 0.3.

This module provides classes for easy access to well known CJK dictionaries. Queries can be done using a headword, reading or translation.

Dictionary sources yield less structured information compared to other data sources exposed in this library. Owing to this fact, a flexible system is provided to the user.

Examples

Examples how to use this module:

  • Create a dictionary instance:

    >>> from cjklib.dictionary import CEDICT
    >>> d = CEDICT()
    
  • Get dictionary entries by reading:

    >>> [e.HeadwordSimplified for e in
    ...     d.getForReading('zhi dao', reading='Pinyin', toneMarkType='numbers')]
    [u'制导', u'执导', u'指导', u'直到', u'直捣', u'知道']
    
  • Change a search strategy (here search for a reading without tones):

    >>> d = CEDICT(readingSearchStrategy=search.SimpleWildcardReading())
    >>> d.getForReading('nihao', reading='Pinyin', toneMarkType='numbers')
    []
    >>> d = CEDICT(readingSearchStrategy=search.TonelessWildcardReading())
    >>> d.getForReading('nihao', reading='Pinyin', toneMarkType='numbers')
    [EntryTuple(HeadwordTraditional=u'你好', HeadwordSimplified=u'你好', Reading=u'nǐ hǎo', Translation=u'/hello/hi/how are you?/')]
    
  • Apply a formatting strategy to remove all initial and final slashes on CEDICT translations:

    >>> from cjklib.dictionary import *
    >>> class TranslationFormatStrategy(format.Base):
    ...     def format(self, string):
    ...         return string.strip('/')
    ...
    >>> d = CEDICT(
    ...     columnFormatStrategies={'Translation': TranslationFormatStrategy()})
    >>> d.getFor(u'东京')
    [EntryTuple(HeadwordTraditional=u'東京', HeadwordSimplified=u'东京', Reading=u'Dōng jīng', Translation=u'Tōkyō, capital of Japan')]
    
  • A simple dictionary lookup tool:

    >>> from cjklib.dictionary import *
    >>> from cjklib.reading import ReadingFactory
    >>> def search(string, reading=None, dictionary='CEDICT'):
    ...     # guess reading dialect
    ...     options = {}
    ...     if reading:
    ...         f = ReadingFactory()
    ...         opClass = f.getReadingOperatorClass(reading)
    ...         if hasattr(opClass, 'guessReadingDialect'):
    ...             options = opClass.guessReadingDialect(string)
    ...     # search
    ...     d = getDictionary(dictionary, entryFactory=entry.UnifiedHeadword())
    ...     result = d.getFor(string, reading=reading, **options)
    ...     # print
    ...     for e in result:
    ...         print e.Headword, e.Reading, e.Translation
    ...
    >>> search('_taijiu', 'Pinyin')
    茅台酒(茅臺酒) máo tái jiǔ /maotai (a Chinese liquor)/CL:杯[bei1],瓶[ping2]/
    

Entry factories

Similar to SQL interfaces, entries can be returned in different fashion. An entry factory takes care of preparing the output. For this predefined factories exist: cjklib.dictionary.entry.Tuple, which is very basic, will return each entry as a tuple of its columns while the mostly used cjklib.dictionary.entry.NamedTuple will return tuple objects that are accessible by attribute also.

Formatting strategies

As reading formattings vary and many readings can be converted into each other, a formatting strategy can be applied to return the expected format. cjklib.dictionary.format.ReadingConversion provides an easy way to convert the reading given by the dictionary into the user defined reading. Other columns can also be formatted by applying a strategy, see the example above.

A hybrid approach makes it possible to apply strategies on single cells, giving a mapping from the cell name to the strategy, or a strategy that operates on the entire result entry, by giving a mapping from None to the strategy. In the latter case the formatting strategy needs to deal with the dictionary specific entry structure:

>>> from cjklib.dictionary import *
>>> d = CEDICT(columnFormatStrategies={
...     'Translation': format.TranslationFormatStrategy()})
>>> d = CEDICT(columnFormatStrategies={
...     None: format.NonReadingEntityWhitespace()})

Formatting strategies can be chained together using the cjklib.dictionary.format.Chain class.

Search strategies

Searching in natural language data is a difficult process and highly depends on the use case at hand. This task is provided by search strategies which account for the more complex parts of this module. Strategies exist for the three main parts of dictionary entries: headword, reading and translation. Additionally mixed searching for a headword partially expressed by reading information is supported and can augment the basic reading search. Several instances of search strategies exist offering basic or more sophisticated routines. For example wildcard searching is offered on top of many basic strategies offering by default placeholders '_' for a single character, and '%' for a match of zero to many characters.

Headword search strategies

Searching for headwords is the most simple among the three. Exact searches are provided by class cjklib.dictionary.search.Exact. By default class cjklib.dictionary.search.Wildcard is employed which offers wildcard searches.

Reading search strategies

Readings have more complex and unique representations. Several classes are provided here: cjklib.dictionary.search.Exact again can be used for exact matches, and cjklib.dictionary.search.Wildcard for wildcard searches. cjklib.dictionary.search.SimpleReading and cjklib.dictionary.search.SimpleWildcardReading provide similar searching for transcriptions as found e.g. in CEDICT. A more complex search is provided by cjklib.dictionary.search.TonelessWildcardReading which offers search for readings missing tonal information.

Translation search strategies

A basic search is provided by cjklib.dictionary.search.SingleEntryTranslation which finds an exact entry in a list of entries separated by slashes (‘/‘). More flexible searching is provided by cjklib.dictionary.search.SimpleTranslation and cjklib.dictionary.search.SimpleWildcardTranslation which take into account additional information placed in parantheses. These classes have even more special implementations adapted to formats found in dictionaries CEDICT and HanDeDict.

More complex ones can be implemented on the basis of extending the underlying table in the database, e.g. using full text search capabilities of the database server. One popular way is using stemming algorithms for copying with inflections by reducing a word to its root form.

Mixed reading search strategies

Special support for a string with mixed reading and headword entities is provided by mixed reading search strategies. For example 'dui4 qi3' will find all entries with headwords whose middle character out of three is '不' and whose left character is read 'dui4' while the right character is read 'qi3'.

Case insensitivity & Collations

Case insensitive searching is done through collations in the underlying database system and for databases without collation support by employing function lower(). A default case independent collation is chosen in the appropriate build method in cjklib.build.builder.

SQLite by default has no Unicode support for string operations. Optionally the ICU library can be compiled in for handling alphabetic non-ASCII characters. The DatabaseConnector can register own Unicode functions if ICU support is missing. Queries with LIKE will then use function lower(). This compatibility mode has a negative impact on performance and as it is not needed for dictionaries like EDICT or CEDICT it is disabled by default.

Functions

Classes