cjklib.dictionary
— High level dictionary access¶
New in version 0.3.
This module provides classes for easy access to well known CJK dictionaries. Queries can be done using a headword, reading or translation.
Dictionary sources yield less structured information compared to other data sources exposed in this library. Owing to this fact, a flexible system is provided to the user.
Examples¶
Examples how to use this module:
Create a dictionary instance:
>>> from cjklib.dictionary import CEDICT >>> d = CEDICT()
Get dictionary entries by reading:
>>> [e.HeadwordSimplified for e in ... d.getForReading('zhi dao', reading='Pinyin', toneMarkType='numbers')] [u'制导', u'执导', u'指导', u'直到', u'直捣', u'知道']
Change a search strategy (here search for a reading without tones):
>>> d = CEDICT(readingSearchStrategy=search.SimpleWildcardReading()) >>> d.getForReading('nihao', reading='Pinyin', toneMarkType='numbers') [] >>> d = CEDICT(readingSearchStrategy=search.TonelessWildcardReading()) >>> d.getForReading('nihao', reading='Pinyin', toneMarkType='numbers') [EntryTuple(HeadwordTraditional=u'你好', HeadwordSimplified=u'你好', Reading=u'nǐ hǎo', Translation=u'/hello/hi/how are you?/')]
Apply a formatting strategy to remove all initial and final slashes on CEDICT translations:
>>> from cjklib.dictionary import * >>> class TranslationFormatStrategy(format.Base): ... def format(self, string): ... return string.strip('/') ... >>> d = CEDICT( ... columnFormatStrategies={'Translation': TranslationFormatStrategy()}) >>> d.getFor(u'东京') [EntryTuple(HeadwordTraditional=u'東京', HeadwordSimplified=u'东京', Reading=u'Dōng jīng', Translation=u'Tōkyō, capital of Japan')]
A simple dictionary lookup tool:
>>> from cjklib.dictionary import * >>> from cjklib.reading import ReadingFactory >>> def search(string, reading=None, dictionary='CEDICT'): ... # guess reading dialect ... options = {} ... if reading: ... f = ReadingFactory() ... opClass = f.getReadingOperatorClass(reading) ... if hasattr(opClass, 'guessReadingDialect'): ... options = opClass.guessReadingDialect(string) ... # search ... d = getDictionary(dictionary, entryFactory=entry.UnifiedHeadword()) ... result = d.getFor(string, reading=reading, **options) ... # print ... for e in result: ... print e.Headword, e.Reading, e.Translation ... >>> search('_taijiu', 'Pinyin') 茅台酒(茅臺酒) máo tái jiǔ /maotai (a Chinese liquor)/CL:杯[bei1],瓶[ping2]/
Entry factories¶
Similar to SQL interfaces, entries can be returned in different fashion. An
entry factory takes care of preparing the output. For this predefined
factories exist: cjklib.dictionary.entry.Tuple
, which is very basic,
will return each entry as a tuple of its columns while the mostly used
cjklib.dictionary.entry.NamedTuple
will return tuple objects
that are accessible by attribute also.
Formatting strategies¶
As reading formattings vary and many readings can be converted into each other,
a formatting strategy can be applied to return the expected format.
cjklib.dictionary.format.ReadingConversion
provides an easy way
to convert the reading given by the dictionary into the user defined reading.
Other columns can also be formatted by applying a strategy,
see the example above.
A hybrid approach makes it possible to apply strategies on single cells, giving
a mapping from the cell name to the strategy, or a strategy that operates on the
entire result entry, by giving a mapping from None
to the strategy. In the
latter case the formatting strategy needs to deal with the dictionary specific
entry structure:
>>> from cjklib.dictionary import *
>>> d = CEDICT(columnFormatStrategies={
... 'Translation': format.TranslationFormatStrategy()})
>>> d = CEDICT(columnFormatStrategies={
... None: format.NonReadingEntityWhitespace()})
Formatting strategies can be chained together using the
cjklib.dictionary.format.Chain
class.
Search strategies¶
Searching in natural language data is a difficult process and highly depends on
the use case at hand. This task is provided by search strategies which
account for the more complex parts of this module. Strategies exist for the
three main parts of dictionary entries: headword, reading and translation.
Additionally mixed searching for a headword partially expressed by reading
information is supported and can augment the basic reading search. Several
instances of search strategies exist offering basic or more sophisticated
routines. For example wildcard searching is offered on top of many basic
strategies offering by default placeholders '_'
for a single character, and
'%'
for a match of zero to many characters.
Headword search strategies¶
Searching for headwords is the most simple among the three. Exact searches are
provided by class cjklib.dictionary.search.Exact
. By default class
cjklib.dictionary.search.Wildcard
is employed which offers
wildcard searches.
Reading search strategies¶
Readings have more complex and unique representations. Several classes are
provided here: cjklib.dictionary.search.Exact
again can be used
for exact matches, and cjklib.dictionary.search.Wildcard
for wildcard searches. cjklib.dictionary.search.SimpleReading
and cjklib.dictionary.search.SimpleWildcardReading
provide
similar searching for transcriptions as found e.g. in CEDICT.
A more complex search is provided by
cjklib.dictionary.search.TonelessWildcardReading
which offers search for readings missing tonal information.
Translation search strategies¶
A basic search is provided by
cjklib.dictionary.search.SingleEntryTranslation
which
finds an exact entry in a list of entries separated by slashes (‘/
‘). More
flexible searching is provided by
cjklib.dictionary.search.SimpleTranslation
and
cjklib.dictionary.search.SimpleWildcardTranslation
which take
into account additional information placed in parantheses.
These classes have even more special implementations adapted to formats
found in dictionaries CEDICT and HanDeDict.
More complex ones can be implemented on the basis of extending the underlying table in the database, e.g. using full text search capabilities of the database server. One popular way is using stemming algorithms for copying with inflections by reducing a word to its root form.
Mixed reading search strategies¶
Special support for a string with mixed reading and headword entities is
provided by mixed reading search strategies. For example 'dui4 不 qi3'
will find all entries with headwords whose middle character out of three is
'不'
and whose left character is read 'dui4'
while the right character is
read 'qi3'
.
Case insensitivity & Collations¶
Case insensitive searching is done through collations in the underlying database
system and for databases without collation support by employing function
lower()
. A default case independent collation is chosen in the appropriate
build method in cjklib.build.builder
.
SQLite by default has no Unicode support for string operations. Optionally
the ICU library can be compiled in for handling alphabetic non-ASCII
characters. The DatabaseConnector can register own Unicode functions if ICU
support is missing. Queries with LIKE
will then use function lower()
. This
compatibility mode has a negative impact on performance and as it is not needed
for dictionaries like EDICT or CEDICT it is disabled by default.