Title: | A Statistical Framework for the Analysis of Japanese Kanji Characters |
---|---|
Description: | Various tools and data sets that support the study of kanji, including their morphology, decomposition and concepts of distance and similarity between them. |
Authors: | Dominic Schuhmacher [aut, cre] , Lennart Finke [aut] |
Maintainer: | Dominic Schuhmacher <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.14.1.9000 |
Built: | 2024-11-06 05:29:53 UTC |
Source: | https://github.com/dschuhmacher/kanjistat |
All CJK characters in the file(s) found at the specified path are substituted by their Unicode escape sequences (\u + 4 digit hex number or \U + 8 digit hex number where necessary).
cjk_escape(path, outdir = NULL, verbose = TRUE)
cjk_escape(path, outdir = NULL, verbose = TRUE)
path |
the path to a directory or a single file. |
outdir |
the directory where the output files are written. Defaults to the subdirectory
|
verbose |
whether to print a message for each output file. |
If path
is a directory, the replacement is performed for all files at that location
(subdirectories are ignored). If outdir
is the same as path
, the original files
are overwritten without warning.
If path
is a file, the replacement is limited to this file. If outdir
is the same
as dirname(path)
, the files are overwritten without warning.
No return value, called for side effects.
Given codepoints cp
, the function codepointToKanji
transforms
to UTF-8, which will typically show as the actual character the codepoints stands for.
Vice versa, given (UTF-8 encoded) kanjis kan
, the function kanjiToCodepoint
transforms
to unicode codepoints.
codepointToKanji(cp, concat = FALSE) kanjiToCodepoint(kan, character = FALSE)
codepointToKanji(cp, concat = FALSE) kanjiToCodepoint(kan, character = FALSE)
cp |
a vector of character strings or objects of class |
concat |
logical. Shall the returned characters be concatenated? |
kan |
a vector of kanjis (strings of length 1) or a single string of length >= 1 of kanjis. |
character |
logical. Shall the returned codepoints be of class "character" or hexmode. |
For codepointToKanji
a character vector of kanji. For kanjiToCodepoint
a vector
of hexadecimal numbers (class hexmode
).
codepointToKanji(c("51b7", "6696", "71b1")) kanjiToCodepoint("冷暖熱")
codepointToKanji(c("51b7", "6696", "71b1")) kanjiToCodepoint("冷暖熱")
List distances to nearest neighbors of a given kanji in terms of a reference distance (which is currently only the stroke edit distance) and compare with values in terms of another distance (currently only the component transport distance, a.k.a. kanji distance).
compare_neighborhoods( kan, refdist = "strokedit", refnn = 10, compdist = "kanjidist", compnn = 0, ... )
compare_neighborhoods( kan, refdist = "strokedit", refnn = 10, compdist = "kanjidist", compnn = 0, ... )
kan |
a kanji (currently only as a single UTF-8 character). |
refdist |
the name of the reference distance (currently only "strokedit"). |
refnn |
the number of nearest neighbors in terms of the reference distance. |
compdist |
a character vector. The name(s) of one or several other distances to compare with (currently only "kanjidist"). |
compnn |
the number of nearest neighbors in terms of the other distance(s). If this is positive it is assumed that the suggested package kanjistat.data is available. |
... |
further parameters that are passed to |
A matrix of distances with refnn + compnn
columns named by the nearest neighbors of kan
(first
in terms of the reference distance, then the other distances) and 1 + length(compdist)
rows named
by the type of distance.
This is only a first draft of the function and its interface and details may change considerably in the future.
As there is currently no precomputed kanjidist matrix, there is a huge difference in computation time between
setting compnn = 0
(only kanji distances to the refnn
nearest neighbors in terms of refdist
have to be
computed) and setting compnn
to any value $> 0$ (kanji distances to all 2135 other Jouyou kanji have to be
computed in order to determine the compnn
nearest neighbors; depending on the system and parameter settings
this can take (roughly) anywhere between 2 minutes and an hour).
# compare_neighborhoods("晴", refnn=5, compo_seg_depth=4, approx="pcweighted", # compnn=0, minor_warnings=FALSE)
# compare_neighborhoods("晴", refnn=5, compo_seg_depth=4, approx="pcweighted", # compnn=0, minor_warnings=FALSE)
Accept any interpretable representation of kanji in terms of index numbers,
UTF-8 character strings of length 1, UTF-8 codepoints
or kanjivec
objects and convert it to all or any of these
formats.
convert_kanji( key, output = c("all", "index", "character", "hexmode", "kanjivec"), simplify = TRUE )
convert_kanji( key, output = c("all", "index", "character", "hexmode", "kanjivec"), simplify = TRUE )
key |
an atomic vector or list of kanji in any combination of formats. |
output |
a string describing the desired output. |
simplify |
logical. Whether to simplify the output to an atomic vector or keep the structure of the original vector. In either case it depends on output whether this is possible. |
Index numbers are in terms of the order in kbase
. UTF-8 codepoints are
usually of class "hexmode", but character strings starting
with "0x" or "0X" are also accepted in the key
.
For output = "kanjivec"
, the GitHub package kanjistat.data has to be available or
an error is returned. For output = "all"
, component kanjivec is set to NA if
kanjistat.data is not available.
A vector of the same length as key. If simplify
is TRUE
, this is an
atomic vector for output = "index", "character" or "hexmode", and a list
for output = "kanjivec" or "all" a list. If simplify
is FALSE
, the original
structure (atomic or list) kept whenever possible.
convert_kanji(as.hexmode("99ac")) convert_kanji("0x99ac") # same convert_kanji(500, "character") == kbase$kanji[500] # TRUE
convert_kanji(as.hexmode("99ac")) convert_kanji("0x99ac") # same convert_kanji(500, "character") == kbase$kanji[500] # TRUE
Precomputed kanji distances
dstrokedit dyehli
dstrokedit dyehli
Symmetric sparse matrices containing distances between a key kanji, its ten nearest neighbors and
possibly some other close kanji.
For dstrokedit
, these are the stroke edit distances according to Yencken and Baldwin (2008).
For dyehli
, these are the bag-of-radicals distances according to Yeh and Li (2002).
Both are an instance of the S4 class dsCMatrix
(symmetric sparse matrices in column-compressed format)
with 2133 rows and 2133 columns.
All pre-2010 jouyou kanji that are also post-2010
jouyou kanji are included. The indices are those from kbase
.
Datasets from https://lars.yencken.org/datasets, made available under the Creative Commons Attribution 3.0 Unported licence.
Computed as part of Yencken, Lars (2010) Orthographic support for passing the reading hurdle in Japanese. PhD Thesis, University of Melbourne, Melbourne, Australia.
Yeh, Su-Ling and Li, Jing-Ling (2002). Role of structure and component in judgements of visual similarity of Chinese characters. Journal of Experimental Psychology: Human Perception and Performance, 28(4), 933–947.
Yencken, Lars, & Baldwin, Timothy (2008). Measuring and predicting orthographic associations: Modelling the similarity of Japanese kanji. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 1041-1048.
# Find index for kanji 部 bu_index <- match("部", kbase$kanji) # Look up available stroke edit distances for 部. non_zero <- which(dstrokedit[bu_index,] != 0) sed <- dstrokedit[non_zero, bu_index] names(sed) <- kbase[non_zero,]$kanji sort(sed) # Look up available bag-of-radicals distances for 部. non_zero <- which(dyehli[bu_index,] != 0) bord <- dyehli[non_zero, bu_index] names(bord) <- kbase[non_zero,]$kanji sort(bord)
# Find index for kanji 部 bu_index <- match("部", kbase$kanji) # Look up available stroke edit distances for 部. non_zero <- which(dstrokedit[bu_index,] != 0) sed <- dstrokedit[non_zero, bu_index] names(sed) <- kbase[non_zero,]$kanji sort(sed) # Look up available bag-of-radicals distances for 部. non_zero <- which(dyehli[bu_index,] != 0) bord <- dyehli[non_zero, bu_index] names(bord) <- kbase[non_zero,]$kanji sort(bord)
A sample list of kanjivec objects
fivebetas
fivebetas
fivebetas
is a list of five kanjivec
objects
representing the basic kanji 部,障,陪,郵,陣 containing "beta" components,
which come in fact from two different classical radicals:
阜–>⻖ on the left: mound, small village
邑–>⻏ on the right: large village
The list has been generated with the function kanjivec
with parameter flatten="intelligent"
from the corresponding files
in the KanjiVG database by Ulrich Apel (https://kanjivg.tagaini.net/).
oldpar <- par(mfrow = c(1,5), mai = rep(0,4)) invisible( lapply(fivebetas, plot, seg_depth = 2) ) par(oldpar)
oldpar <- par(mfrow = c(1,5), mai = rep(0,4)) invisible( lapply(fivebetas, plot, seg_depth = 2) ) par(oldpar)
Sample lists of kanjimat objects
fivetrees1 fivetrees2 fivetrees3
fivetrees1 fivetrees2 fivetrees3
fivetrees1
, fivetrees2
and fivetrees3
are lists of five kanjimat
objects each, representing the same five basic kanji 校,木,休,林,相, containing each
a tree component. Their matrices are antialiased 64 x 64 pixel representations of the kanji. The size
is chosen as a compromise between aesthetics and memory/computational cost,
such as for kmatdist
.
All of them are in handwriting style fonts.
fivetrees1
is in a Kyoukasho font (schoolbook style),
fivetrees2
is in a Kaisho font (regular script calligraphy font),
fivetrees3
is in a Gyousho font (semi-cursive calligraphy font).
An object of class list
of length 5.
An object of class list
of length 5.
An object of class list
of length 5.
The list has been generated with the function kanjimat
using the Mac OS
pre-installed YuKyokasho font (fivetrees1), as well as the freely available fonts nagayama_kai
by Norio Nagayama and KouzanBrushFontGyousyo by Aoyagi Kouzan.
oldpar <- par(mfrow = c(3,5)) invisible( lapply(fivetrees1, plot) ) invisible( lapply(fivetrees2, plot) ) invisible( lapply(fivetrees3, plot) ) par(oldpar)
oldpar <- par(mfrow = c(3,5)) invisible( lapply(fivetrees1, plot) ) invisible( lapply(fivetrees2, plot) ) invisible( lapply(fivetrees3, plot) ) par(oldpar)
The strokes are the leaves of the kanjivec stroketree
. They consist of
a two-column matrix giving a discretized path for the stroke in the unit square
with further attributes.
get_strokes(kvec, which = 1:kvec$nstrokes, simplify = TRUE)
get_strokes(kvec, which = 1:kvec$nstrokes, simplify = TRUE)
kvec |
an object of class |
which |
a numeric vector specifying the numbers of the strokes that are to be returned. Defaults to all strokes. |
simplify |
logical. Shall only the stroke be returned if |
Usually a list of strokes with attributes. Regardless of
whether which
is ordered or contains duplicates, the returned list will always contain
the strokes in their natural order without duplicates. If which
has length 1 and
simplified = TRUE
, the list is avoided, and only the single stroke is returned.
kanji <- fivebetas[[5]] get_strokes(kanji, c(3,10)) # the two long vertical strokes in 陣
kanji <- fivebetas[[5]] get_strokes(kanji, c(3,10)) # the two long vertical strokes in 陣
The strokes are the leaves of the kanjivec stroketree
. They consist of
a two-column matrix giving a discretized path for the stroke in the unit square
with further attributes.
get_strokes_compo(kvec, which = c(1, 1))
get_strokes_compo(kvec, which = c(1, 1))
kvec |
an object of class |
which |
a vector of length 2 specifing the index of the component, i.e. the component used
is |
A list of strokes with attributes.
kanji <- fivebetas[[5]] # get the three strokes of the component⻖ in 陣 rad <- get_strokes_compo(kanji, c(2,1)) plot(0.5, 0.5, xlim=c(0,1), ylim=c(0,1), type="n", asp=1, xaxs="i", yaxs="i", xlab="", ylab="") invisible(lapply(rad, lines, lwd=4))
kanji <- fivebetas[[5]] # get the three strokes of the component⻖ in 陣 rad <- get_strokes_compo(kanji, c(2,1)) plot(0.5, 0.5, xlim=c(0,1), ylim=c(0,1), type="n", asp=1, xaxs="i", yaxs="i", xlab="", ylab="") invisible(lapply(rad, lines, lwd=4))
The tibbles kbase and kmorph provide basic and morphologic information, respectively, for all kanji contained in the KANJIDIC2 file (see below)
kbase kmorph
kbase kmorph
kbase is a tibble with 13,108 rows and 13 variables:
the kanji
the Unicode codepoint
the number of strokes
one of four classes: "kyouiku", "jouyou", "jinmeiyou" or "hyougai"
a number from 1-11, basically a finer version of class, same as in KANJIDIC2, except that we assgined an 11 for all hyougaiji (rather than an NA value)
at what level the kanji appears in the Nihon Kanji Nouryoku Kentei (Kanken)
at what level the kanji appears in the Japanese Language Proficiency Test (Nihongou Nouryoku Shiken)
at what level the kanji is learned on the kanji learning website Wanikani
the frequency rank (1 = most frequent) "based on several averages (Wikipedia, novels, newspapers, ...)"
the frequency rank (1 = most frequent) based on news paper data (2501 most frequent kanji over four years in the Mainichi Shimbun)
a single ON reading in katakana
a single kun reading in hiragana
a single English meaning of the kanji
kmorph is a tibble with 13,108 rows and 15 variables:
the kanji
the number of strokes
the traditional (Kangxi) radical used for indexing kanji (one of 214)
the variant of the radical if it is different, otherwise NA
the Nelson radical if it differs from the traditional one, otherwise NA
ideographic description character (plus sometimes a number or a letter) describing the shape of the kanji
visible components of the kanji; originally from KRADFILE
the kanji's SKIP code
a single English meaning of the kanji (same as in kbase)
The single ON and kun readings and the single meaning are for easy identification
of the more difficult kanji. They are the first entry in the KANJIDIC2 file which may not
always be the most important one. For full readings/meanings use the function lookup
or consult a dictionary.
Most of the data is directly from the KANJIDIC2 file.
https://www.edrdg.org/wiki/index.php/KANJIDIC_Project
Variables jlpt
, frank
, idc
, components
were taken from the Kanjium data base
https://github.com/mifunetoshiro/kanjium
Variable components
is originally from RADKFILE/KRADFILE.
https://www.edrdg.org/)
The use of this data is covered in each case by a Creative Commons BY-SA 4.0 License. See the package's LICENSE file for details and copyright holders.
Variable "class" is derived from "grade".
Variable "kanken" was compiled based on the Wikipedia description of the test levels (as of September 2022).
The kanji distance is based on matching hierarchical component structures in a nesting-free way across all levels. The cost for matching individual components is a cost for registering the components (i.e. alligning there position, scale and aspect ratio) plus the (relative unbalanced) Wasserstein distance between the registered components.
kanjidist( k1, k2, compo_seg_depth1 = 3, compo_seg_depth2 = 3, p = 1, C = 0.2, approx = c("grid", "pc", "pcweighted"), type = c("rtt", "unbalanced", "balanced"), size = 48, lwd = 2.5, density = 30, verbose = FALSE, minor_warnings = TRUE )
kanjidist( k1, k2, compo_seg_depth1 = 3, compo_seg_depth2 = 3, p = 1, C = 0.2, approx = c("grid", "pc", "pcweighted"), type = c("rtt", "unbalanced", "balanced"), size = 48, lwd = 2.5, density = 30, verbose = FALSE, minor_warnings = TRUE )
k1 , k2
|
two objects of type |
compo_seg_depth1 , compo_seg_depth2
|
two integers |
p |
the order of the Wasserstein distance used for matching components. All distances and
the penalty (if any) are taken
to the |
C |
the penalty for extra mass if |
approx |
what kind of approximation is used for matching components. If this is |
type |
the type of Wasserstein distance used for matching components based on the grid or
point cloud approximation chosen. |
size |
side length of the bitmaps used for matching components (if |
lwd |
linewidth for drawing the components in these bitmaps (if |
density |
approximate number of discretization points per unit line length (if |
verbose |
logical. Whether to print detailed information on the cost for all pairs of components and the final matching. |
minor_warnings |
logical. Should minor_warnings be given. If |
For the precise definition and details see the reference below. Parameter C
corresponds to in the paper.
The kanji distance, a non-negative number.
The interface and details of this function will change in the future. Currently only a minimal
set of parameters can be passed. The other parameters are fixed exactly as in the
"prototype distance" (4.1) of the reference below for better or worse.
There is a certain
tendency that exact matches of components are rather strongly favored (if the KanjiVG elements
agree this can overrule the unbalanced Wasserstein distance) and the penalties for
translation/scaling/distortion of components are somewhat mild.
The computation time is rather high (depending on the settings and kanji up to several
seconds per kanji pair). This can be alleviated somewhat by keeping the compo_seg_depth
parameters at 3 or lower and setting size = 32
(which goes well with lwd=1.8
).
Future versions will use a much faster line base optimal transport algorithm and further
speed-ups.
Dominic Schuhmacher (2023).
Distance maps between Japanese kanji characters based on hierarchical optimal transport.
ArXiv, doi:10.48550/arXiv.2304.02493
if (requireNamespace("ROI.plugin.glpk")) { kanjidist(fivebetas[[4]], fivebetas[[5]]) kanjidist(fivebetas[[4]], fivebetas[[5]], verbose=TRUE) # faster and similar: kanjidist(fivebetas[[4]], fivebetas[[5]], compo_seg_depth1=2, compo_seg_depth2=2, size=32, lwd=1.8, verbose=TRUE) # slower and similar: kanjidist(fivebetas[[4]], fivebetas[[5]], size=64, lwd=3.2, verbose=TRUE) }
if (requireNamespace("ROI.plugin.glpk")) { kanjidist(fivebetas[[4]], fivebetas[[5]]) kanjidist(fivebetas[[4]], fivebetas[[5]], verbose=TRUE) # faster and similar: kanjidist(fivebetas[[4]], fivebetas[[5]], compo_seg_depth1=2, compo_seg_depth2=2, size=32, lwd=1.8, verbose=TRUE) # slower and similar: kanjidist(fivebetas[[4]], fivebetas[[5]], size=64, lwd=3.2, verbose=TRUE) }
Individual distances are based on kanjidist
.
kanjidistmat( klist, klist2 = NULL, compo_seg_depth = 3, p = 1, C = 0.2, approx = c("grid", "pc", "pcweighted"), type = c("rtt", "unbalanced", "balanced"), size = 48, lwd = 2.5, density = 30, verbose = FALSE, minor_warnings = FALSE )
kanjidistmat( klist, klist2 = NULL, compo_seg_depth = 3, p = 1, C = 0.2, approx = c("grid", "pc", "pcweighted"), type = c("rtt", "unbalanced", "balanced"), size = 48, lwd = 2.5, density = 30, verbose = FALSE, minor_warnings = FALSE )
klist |
a list of |
klist2 |
an optional second list of |
compo_seg_depth |
integer |
p , C , type , approx , size , lwd , density , verbose , minor_warnings
|
the same as for the function |
A matrix of dimension length(klist)
x length(klist2)
having
as its -th entry the distance between
klist[[i]]
and
klist2[[j]]
. If klist2
is not provided it is assumed to be equal to
klist
, but computation is more efficient as only the upper triangular part
is computed and then symmetrized with diagonal zero.
The same precautions apply as for kanjidist
.
kanjidistmat(fivebetas)
kanjidistmat(fivebetas)
Create a (list of) kanjimat object(s), i.e. bitmap representations of a kanji using a certain font-family and other typographical parameters.
kanjimat( kanji, family = NULL, size = NULL, margin = 0, antialias = TRUE, save = FALSE, overwrite = FALSE, simplify = TRUE, ... )
kanjimat( kanji, family = NULL, size = NULL, margin = 0, antialias = TRUE, save = FALSE, overwrite = FALSE, simplify = TRUE, ... )
kanji |
a (vector of) character string(s) containing kanji. |
family |
the font-family to be used. For details see vignette. |
size |
the sidelength of the (square) bitmap |
margin |
numeric. Extra margin around the character. Defaults to 0 which leaves a relatively slim margin. Positive values increase this margin, negative values decrease it (which usually cuts off part of the kanji). |
antialias |
logical. Shall antialiasing be performed? |
save |
logical or character. If FALSE return the (list of) kanjimat object(s). Otherwise save the result as an rds file in the working directory (as kmatsave.rds) or under the file path provided. |
overwrite |
logical. If FALSE return an error (before any computations are done) if the designated file path already exists. Otherwise an existing file is overwritten. |
simplify |
logical. Shall a single kanjimat object be returned (instead a list of one) if |
... |
futher arguments passed to png. This is for extensibility. The only argument
that may currently be used is |
A list of objects of class kanjimat
or, if only one kanji was specified and
simplify
is TRUE
, a single objects of class kanjimat
. If save = TRUE
,
the same is (saved and) still returned invisibly.
If no font family is provided, the default Chinese font WenQuanYi Micro Hei that comes with the package showtext is used. This means that the characters will typically be recognizable, but quite often look odd as Japanese characters. We strongly advised that a Japanese font is used as detailed above.
res <- kanjimat(kanji="藤", size = 128)
res <- kanjimat(kanji="藤", size = 128)
Create a (list of) kanjivec object(s). Each object is a representation of the kanji as a tree of strokes based on .svg files from the KanjiVG database containing further, derived information.
kanjivec( kanji, database = NULL, flatten = "intelligent", bezier_discr = c("svgparser", "eqtimed", "eqspaced"), save = FALSE, overwrite = FALSE, simplify = TRUE )
kanjivec( kanji, database = NULL, flatten = "intelligent", bezier_discr = c("svgparser", "eqtimed", "eqspaced"), save = FALSE, overwrite = FALSE, simplify = TRUE )
kanji |
a (vector of) character string(s) of one or several kanji. |
database |
the path to a local copy of (a subset of) the KanjiVG database. It is expected
that the svg files reside at this exact location (not in a subdirectory). If |
flatten |
logical. Should nodes that are only-children be fused with their parents? Alternatively one of the strings "intelligent", "inner" or "leaves". Although the first is the default it is experimental and the precise meaning will change in the future; see details. |
bezier_discr |
character. How to discretize the Bézier curves describing the strokes. If "svgparser" (the only option available prior to kanjistat 0.12.0), code from the non-CRAN package svgparser is used for discretizing at equal time steps. The new choices "eqtimed" and "eqspaced" discretize into fewer points (and allow for more customization underneath). The former creates discretization points at equal time steps, the latter at equal distance steps (to a good approximation). |
save |
logical or character. If FALSE return the (list of) kanjivec object(s). Otherwise save the result as an rds file in the working directory (as kvecsave.rds) or under the file path provided. |
overwrite |
logical. If FALSE return an error (before any computations are done) if the designated file path already exists. Otherwise an existing file is overwritten. |
simplify |
logical. Shall a single kanjivec object be returned (instead a list of one) if |
A kanjivec object contains detailed information on the strokes of which an individual kanji is composed including their order, a segmentation into reasonable components ("radicals" in a more general sense of the word), classification of individual strokes, and both vector data and interpolated points to recreate the actual stroke in a Kyoukashou style font. For more information on the original data see http://kanjivg.tagaini.net/. That data is licenced under Creative Commons BY-SA 3.0 (see licence file of this package).
The original .svg files sometimes contain additional <g>
elements that provide
information about the current group of strokes rather than establishing a new subgroup
of its own. This happens typically for information that establishes coherence with another
part of the tree (by noting that the current subgroup is also part 2 of something else),
but also for variant information. With the option flatten = TRUE
the extra hierarchy
level in the tree is avoided, while the original information in the KanjiVG file is kept.
This is achieved by fusing only-children to their parents, giving the new node the name
of the child and all its attributes, but prefixing p.
to the attribute names
of the parent (the parents' "names" attribute is discarded, but can be reconstructed from
the parents' id). Removal of several hierarchies in sequence can lead to attribute names
with multiple p.
in front. Fusing to parents is suppressed if the parent is the
root of the hierarchy (typically for one-stroke kanji), as this could lead to confusing
results.
The options flatten = "inner"
and flatten = "leaves"
implement the above behavior
only for the corresponding type of node (inner nodes or leaves). The option
flatten = "intelligent"
tries to find out in more sophisticated ways which flattening
is desirable and which is not (it will flatten rather conservatively). Currently nodes without
an element attribute that have only one child are flattened away (one example where this is
reasonable is in kanji kbase[187, ]
), as are nodes with an element attribute and only
one child if this child is also an inner node and has the same element and part attribute as the
parent, but both have no number (this would be problematic for any component-building code
in the particular case of kanji kbase[1111, ]
).
A kanjivec
object has components
char
the kanji (a single character)
hex
its Unicode codepoint (integer of class hexmode
)
padhex
the Unicode codepoint padded with zeros to five digits (mode character)
family
the font on which the data is based. Currently only "schoolbook" (to be extended with "kaisho" at some point)
nstrokes
the number of strokes in the kanji
ncompos
a vector of the number of components at each depth of the tree
nveins
the number of veins in the component structure
strokedend
the decomposition tree of the kanji as an object of class dendrogram
components
the component structure by segmentation depth (components can overlap) in terms of KanjiVG elements and their depth-first tree coordinates
veins
the veins in the component structure. Each vein is represented as a two-column matrix
that lists in its rows the indices of components
(starting at the root,
which in the component indexing is c(1,1)
)
stroketree
the decomposition tree of the kanji, a list containing the full information of the the KanjiVG file (except some top level attributes)
stroketree
is a close representation of the KanjiVG svg file as list object with
some serious nesting of sublists. The XML attributes become attributes of the list and its elements.
The user will usually not have to look at or manipulate stroketree
directly, but
strokedend
and compents
are derived from it and other functions may process it
further.
The main differences to the svg file are
the actual strokes are not only given as d-attributes describing Bézier curves, but
but also as two-column matrices describing discretizations of these curves. These matrices
are the actual contents of the innermost lists in stroketree
, but are more conveniently
accessed via the function get_strokes
. Starting with version 0.13.0, there is
also an additional attribute "beziermat", which describes the Bézier curves for the stroke
in a 2 x (1+3n) matrix format. The first column is the start point, then each triplet of columns
stands for control point 1, control point 2 and end point (=start point of the next Bézier curve
if any).
The positions of the stroke numbers (for plotting) are saved as an attribute strokenum_coords to the entire stroke tree rather than a separate element.
strokedend
is more easy to examine and work with due to various convenience functions for
dendrograms in the packages stats
and dendextend
, including str
and plot.dendrogram
. The function plot.kanjivec
with option
type = "dend"
is a wrapper for plot.dendrogram
with reasonable presets
for various options.
The label-attributes of the nodes of strokedend
are taken from the element (for inner nodes)
and type (for leaves) attributes of the .svg files. They consist of UTF-8 characters representing
kanji parts and a combination of UTF-8 characters for representing strokes and may not represent
well in all CJK fonts (see details of plot.kanjivec
). If element and type are missing
in the .svg file, the label assigned is the second part of the id-attribute, e.g. g5 or s9.
The components
at a given level can be plotted, see plot.kanjivec
with
type = "kanji"
. Both components
and veins
serve mainly for the computation
of kanji distances.
A list of objects of class kanjivec
or, if only one kanji was specified and
simplify
is TRUE
, a single objects of class kanjivec
. If save = TRUE
,
the same is (saved and) still returned invisibly.
if (interactive()) { # Try to load the svg file for the kanji from GitHub. res <- kanjivec("藤", database=NULL) str(res) } fivebetas # sample kanjivec data str(fivebetas[[1]])
if (interactive()) { # Try to load the svg file for the kanji from GitHub. res <- kanjivec("藤", database=NULL) str(res) } fivebetas # sample kanjivec data str(fivebetas[[1]])
This gives the dissimilarity of pixel-images of the kanji based on how far mass (or "ink") has to be transported to transform one image into the other.
kmatdist( k1, k2, p = 1, C = 0.2, type = c("unbalanced", "balanced"), output = c("dist", "all") )
kmatdist( k1, k2, p = 1, C = 0.2, type = c("unbalanced", "balanced"), output = c("dist", "all") )
k1 , k2
|
two objects of type |
p |
the order of the Wasserstein distance. All distances and a potential penalty are taken
to the |
C |
the penalty for extra mass if |
type |
the type of Wasserstein metric. |
output |
the requested output. See return value below. |
If output = "dist"
, a single non-negative number: the unbalanced or balanced Wasserstein
distance between the kanji. If output = "all"
a list with detailed information on the transport plan
and the disposal of pixel mass. See unbalanced
for details.
res <- kmatdist(fivetrees1[[1]], fivetrees1[[5]], p=1, C=0.1, output="all") plot(res, what="plan", angle=20, lwd=1.5) plot(res, what="trans") plot(res, what="extra") plot(res, what="inplace")
res <- kmatdist(fivetrees1[[1]], fivetrees1[[5]], p=1, C=0.1, output="all") plot(res, what="plan", angle=20, lwd=1.5) plot(res, what="trans") plot(res, what="extra") plot(res, what="inplace")
Apply kmatdist
to every pair of kanjimat
objects to compute
the unbalanced or balanced Wasserstein distance.
kmatdistmat( klist, klist2 = NULL, p = 1, C = 0.2, type = c("unbalanced", "balanced") )
kmatdistmat( klist, klist2 = NULL, p = 1, C = 0.2, type = c("unbalanced", "balanced") )
klist |
a list of |
klist2 |
an optional second list of |
p , C , type
|
the same as for the function |
A matrix of dimension length(klist)
x length(klist2)
having
as its -th entry the distance between
klist[[i]]
and
klist2[[j]]
. If klist2
is not provided it is assumed to be equal to
klist
, but the computation is more efficient as only the upper triangular part
is computed and then symmetrized with diagonal zero.
kmatdistmat(fivetrees1) kmatdistmat(fivetrees1, fivetrees1) # same result but slower kmatdistmat(fivetrees1, fivetrees2) # note the smaller values on the diagonal
kmatdistmat(fivetrees1) kmatdistmat(fivetrees1, fivetrees1) # same result but slower kmatdistmat(fivetrees1, fivetrees2) # note the smaller values on the diagonal
Data set of all kanji readings and meanings from the KANJIDIC2 dataset in an R list
format. For convenient access to this data use function lookup
.
kreadmean
kreadmean
An object of class list
of length 13108.
KANJIDIC2 file by Jim Breen and The Electronic Dictionary Research
and Development Group (EDRDG)
https://www.edrdg.org/wiki/index.php/KANJIDIC_Project
The use of this data is covered by the Creative Commons BY-SA 4.0 License.
Return readings and meanings or information from kbase
or kmorph
.
lookup(kanji, what = c("readmean", "basic", "morphologic"))
lookup(kanji, what = c("readmean", "basic", "morphologic"))
kanji |
a (vector of) character strings containing kanji. |
what |
the sort of information to display. |
This is a very basic interface for a quick lookup information based on exact knowledge of the kanji (provided by a Japanese input method or its UTF-8 code). Most of the information is based on the KANJIDIC2 file by EDRDG (see thank you page) Please use one of the many excellent online kanji dictionaries (see e.g.) more sophisticated lookup methods and more detailed results.
If what
is "readmean" the information is output with cat and there is no return value (invisible NULL)
In the other cases the appropriate subsets of the tables kbase and kmorph are returned
Dominic Schuhmacher [email protected]
lookup(c("晴", "曇", "雨")) lookup("晴曇雨") # same
lookup(c("晴", "曇", "雨")) lookup("晴曇雨") # same
Set or examine global kanjistat options.
kanjistat_options(...) get_kanjistat_option(x)
kanjistat_options(...) get_kanjistat_option(x)
... |
any number of options specified as |
x |
name of an option given as character string. |
kanjistat_options
returns the list of all set options if there
is no function argument. Otherwise it returns list of all old options.
get_kanjistat_option
returns the current value set for option x
or NULL if the option is not set.
Plot kanjimat object
## S3 method for class 'kanjimat' plot( x, mode = c("dark", "light"), col = gray(seq(0, 1, length.out = 256)), ... )
## S3 method for class 'kanjimat' plot( x, mode = c("dark", "light"), col = gray(seq(0, 1, length.out = 256)), ... )
x |
object of class kanjimat. |
mode |
character string. If "dark" the original grayscale values are used, if "light" they are inverted. With the default grayscale color scheme the kanji is plotted white-on-black for "dark" and black-on-white for "light". |
col |
a vector of colors. Typically 256 values are enough to keep the full information of an (antialiased) kanjimat object. |
... |
further parameters passed to |
No return value, called for side effects.
Plot kanjivec objects
## S3 method for class 'kanjivec' plot( x, type = c("kanji", "dend"), seg_depth = 0, palette = "Dark 3", pal.extra = 0, numbers = FALSE, offset = c(0.025, 0), family = NULL, lwd = 8, ... )
## S3 method for class 'kanjivec' plot( x, type = c("kanji", "dend"), seg_depth = 0, palette = "Dark 3", pal.extra = 0, numbers = FALSE, offset = c(0.025, 0), family = NULL, lwd = 8, ... )
x |
an object of class |
type |
either "kanji" or "dend". Whether to plot the actual kanji, coloring strokes
according to levels of segmentation, or to plot a representation of the tree structure
underlying this segmentation. Among the following named parameters, only |
seg_depth |
an integer. How many steps down the segmentation hierarchy we use
different colors for different groups. If zero (the default), only one color is used
that can be specified with |
palette |
a valid name of a hcl palette (one of |
pal.extra |
an integer. How many extra colors are picked in the specified palette. If this is 0 (the default), palette is used with as many colors as we have components. Since many hcl palettes run from dark to light colors, the last (few) components may be too light. Increasing pal.extra then makes the component colors somewhat more similar, but the last component darker. |
numbers |
logical. Shall the stroke numbers be displayed. |
offset |
the (x,y)-offset for the numbers relative to the positions from kanjivg saved in the kanjivec object. Either a vector of length 2 specifying some fixed offset for all numbers or a matrix of dimension kanjivec$nstrokes times 2. |
family |
the font-family for labeling the nodes if |
lwd |
the usual line width graphics parameter. |
... |
further parameters passed to |
Setting up nice labels for the nodes if type = "dend"
is not easy. For many
font families it appears that some "kanji components" cannot be displayed in plots
even with the help of package showtext
and if the
font contains glyphs for the corresponding codepoints that display correctly in text documents.
This concerns in increasing severity of the problem Unicode blocks 2F00–2FDF (Kangxi Radicals),
2E80–2EFF (CJK Radicals Supplement) and 31C0–31EF (CJK Strokes). For the strokes it seems
nearly impossible which is why leaves are simply annotated with the number of the strokes.
For the other it is up to the user to find a suitable font and pass it via the argument
font family. The default family = NULL
first tries to use default_font
if this option has been set (via kanjistat_options
) and otherwise
uses wqy-microhei
, the Chinese default font that comes with package showtext
and cannot display any radicals from the supplement.
On a Mac the experience is that "hiragino_sans" works well. In addition there is the issue of
font size which is currently not judiciously set and may be too large for some (especially
on-screen) devices. The parameter cex
(via ...
) fixes this.
No return value, called for side effects.
kanji <- fivebetas[[2]] plot(kanji, type = "kanji", seg_depth = 2) plot(kanji, type = "dend") # gives a warning if get_kanjistat_option("default_font") is NULL
kanji <- fivebetas[[2]] plot(kanji, type = "kanji", seg_depth = 2) plot(kanji, type = "dend") # gives a warning if get_kanjistat_option("default_font") is NULL
Write kanji to a graphics device.
plotkanji( kanji, device = "default", family = NULL, factor = 10, width = NULL, height = NULL, ... )
plotkanji( kanji, device = "default", family = NULL, factor = 10, width = NULL, height = NULL, ... )
kanji |
a vector of class character specifying one or several kanji to be plotted. |
device |
the type of graphics device where the kanji is plotted. Defaults to the
user's default type according to |
family |
the font family or families used for writing the kanji. Make sure to add the font(s)
first by using |
factor |
a maginification factor applied to the font size (typically 12 points). |
width , height
|
the dimensions of the device. |
... |
further parameters passed to the function opening the device (such as a file name for devices that create a file). |
This function writes one or several kanji to a graphics device
in an arbitrary font that has been registered, i.e., added
to the database in package sysfonts
. For the latter say font_add
or
font_families
to verify what fonts are available.
For further information see Working with Japanese fonts in vignette("kanjistat", package = "kanjistat")
.
plotkanji
uses the package showtext
to write the kanji in a large font at the
center of a new device of the specified type.
specify device = "current"
to write the kanji to the current device. It is now recommended
to simply use graphics::text
in combination with showtext::showtext_auto
instead.
No return value, called for side effects.
If no font family is provided, the default Chinese font WenQuanYi Micro Hei that comes with the package showtext is used. This means that the characters will typically be recognizable, but quite often look odd as Japanese characters. We strongly advised that a Japanese font is used as detailed above.
plotkanji("滝") plotkanji("犬猫魚")
plotkanji("滝") plotkanji("犬猫魚")
Precomputed kanji distances
pooled_similarity
pooled_similarity
A tibble containing kanji similarity judgments by 3 "native or native-like" speakers of Japanese. For each row, the pivot kanji was compared to a list of potential distractors. From the distractors, the subjects selected one character which they found particularly easy to confuse with the pivot. For the exact methodology, see the original study referenced below.
Datasets from https://lars.yencken.org/datasets, made available under the Creative Commons Attribution 3.0 Unported licence.
Collected as part of Yencken, Lars (2010) Orthographic support for passing the reading hurdle in Japanese. PhD Thesis, University of Melbourne, Melbourne, Australia.
Yencken, Lars, & Baldwin, Timothy (2008). Measuring and predicting orthographic associations: Modelling the similarity of Japanese kanji. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 1041-1048.
# Get kanji characters that were found to be easily confused with 大. pooled_similarity[pooled_similarity$selected == "大", ]$pivot
# Get kanji characters that were found to be easily confused with 大. pooled_similarity[pooled_similarity$selected == "大", ]$pivot
Print basic information about a kanjivec object
## S3 method for class 'kanjivec' print(x, dend = FALSE, ...)
## S3 method for class 'kanjivec' print(x, dend = FALSE, ...)
x |
an object of class |
dend |
whether to print the structure of the |
... |
further parameters passed to |
No return value, called for side effects.
Perform basic validity checks and transform data to a standardized list or keep as an object of
class xml_document
(package xml2
).
read_kanjidic2(fpath = NULL, output = c("list", "xml"))
read_kanjidic2(fpath = NULL, output = c("list", "xml"))
fpath |
the path to a local KANJIDIC2 file. If |
output |
one of |
KANJIDIC2 contains detailed information on all of the 13108 kanji in three main Japanese standards (JIS X 0208, 0212 and 0213). The KANJIDIC files have been compiled and maintained by Jim Breen since 1991, with the help of various other people. The copyright is now held by the Electronic Dictionary Research and Development Group (EDRDG). The files are made available under the Creative Commons BY-SA 4.0 license. See https://www.edrdg.org/wiki/index.php/KANJIDIC_Project for details on the contents of the files and their license.
If output = "xml"
, some minimal checks are performed (high level structure and
total number of kanji).
If output = "list"
, additional validity checks of the lower level structure are performed.
Most are in accordance with the file's Document Type Definition (DTD).
Some additional check concern some common patterns that are true about the current
KANJIDIC2 file (as of December 2023) and seem unlikely to change in the near future.
This includes that there is always at most one rmgroup
entry in reading_meaning
.
Informative warnings are provided if any of these additional checks fail.
If output = "xml"
, the exact XML document obtained from xml2::read_xml. If output = "list"
, a list of
lists (the individual kanji), each with the following seven components.
literal
: a single UTF-8 character representing the kanji.
codepoint
: a named character vector giving the available codepoints in the unicode and jis standards.
radical
: a named numeric vector giving the radical number(s), in the range 1 to 214.
The number named classical
is as recorded in the KangXi Zidian (1716); if there is a number
named nelson_c
, the kanji was reclassified in Nelson's Modern Reader's Japanese-English Character
Dictionary (1962/74).
misc
: a list with six components
grade
: the kanji grade level. 1 through 6 indicates a kyouiku kanji and the grade in which the
kanji is taught in Japanese primary school. 8 indicates one of the remaining jouyou kanji learned
in junior high school, and 9 or 10 are jinmeiyou kanji. The remaining (hyougai) kanji have NA
as their entry.
stroke_count
: The stroke count of the kanji, including the radical. If more than
one, the first is considered the accepted count, while subsequent ones are common miscounts.
variant
: a named character vector giving either a cross-reference code to another kanji,
usually regarded as a variant, or an alternative indexing code for the current kanji.
The type of variant is given in the name.
freq
: the frequency rank (1 = most frequent) based on newspaper data.
NA
if not among the 2500 most frequent.
rad_name
: a character vector. For a kanji that is a radical itself, the name(s) of the radical
(if there are any), otherwise of length 0.
jlpt
: The Japanese Language Proficiency Test level according to the old four-level system
that was in place before 2010. A value from 4 (most elementary) to 1 (most advanced).
dic_number
: a named character vector (possibly of length 0) giving the index numbers (for some
kanji with letters attached) of the kanji in various dictionaries,
textbooks and flashcard collections (specified by the name). For Morohashi's Dai Kan-Wa Jiten,
the volume and page number is also provided in the format moro.VOL.PAGE.
query_code
: a named character vector giving the codes of the kanji in various query systems
(specified by the name). For Halpern's SKIP code, possible misclassifications (if any) of the kanji
are also noted in the format mis.skip.TYPE, where TYPE indicates the type of misclassification.
reading_meaning
: a (possibly empty) list containing zero or more rmgroup
components creating
groups of readings and meanings (in practice there is never more than one rmgroup
currently)
as well as a component nanori
giving a character vector (possibly of length 0) of readings only
associated with names. Each rmgroup
is a list with entries:
reading
: a (possibly empty) list of entries named from among pinyin
, korean_r
, korean_h
,
vietnam
, ja_on
and ja_kun
, each containing a character vector of the corresponding readings
meaning
: a (possibly empty) list of entries named with two-letter (ISO 639-1) language codes,
each containing a character vector of the corresponding meanings.
if (interactive()) { read_kanjidic2("kanjidic2.xml") }
if (interactive()) { read_kanjidic2("kanjidic2.xml") }
Sample kanji from a set
samplekan( set = c("kyouiku", "jouyou", "jinmeiyou", "kanjidic"), size = 1, replace = FALSE, prob = NULL )
samplekan( set = c("kyouiku", "jouyou", "jinmeiyou", "kanjidic"), size = 1, replace = FALSE, prob = NULL )
set |
a character string specifying the set of kanjis to sample from. |
size |
a positive number, the number of samples. |
replace |
logical. Sample with replacement? |
prob |
currently without effect. |
a vector of length size
containing the individual characters
(sam <- samplekan(size = 10)) lookup(sam)
(sam <- samplekan(size = 10)) lookup(sam)
Variants of the stroke edit distance proposed by Yencken (2010). Each kanji is encoded as sequence of stroke types according to its stroke order, using the type attribute from the kanjiVG data. Then the edit distance (a.k.a.\ Levenshtein distance) between sequences is computed and divided by the maximum of the number of strokes
sedist(k1, k2, type = c("full", "before_slash", "first"))
sedist(k1, k2, type = c("full", "before_slash", "first"))
k1 , k2
|
atomic vectors or lists of kanji in any format that can be treated by |
type |
the type of stroke edit distance to compute. See details. |
The kanjiVG type attribute is a single string composed of a CJK strokes Unicode character, an optional
latin letter providing further information and possibly a variant (another CJK strokes character with optional
letter) separated by "/". If type
is "full"' a match is only counted if two strings are exactly the
same, "before_slash" ignores any slashes and what comes after them, "first" only considers the first
character of each string (so the first CJK stroke character) when counting matches.
The stroke edit distance used by Yencken (2010) is obtained by setting type = "all" (the default), except that the underlying kanjiVG data has significantly changed since then. Comparing with the values in dstrokedit we get an agreement of 96.3 percent, whereas the other distances disagree by a small amount (usually 1-2 edit operations).
A length(k1)
x length(k2)
matrix of stroke edit distances.
Requires kanjistat.data package.
Yencken, Lars (2010). Orthographic support for passing the reading hurdle in Japanese.
PhD Thesis, University of Melbourne, Australia
ind1 <- 384 k1 <- convert_kanji(ind1, "character") ind2 <- which(dstrokedit[ind1,] > 0) # dstrokedit contains only the "closest" kanji k2 <- convert_kanji(ind2, "character") row_a <- dstrokedit[ind1, ind2] if (requireNamespace("kanjistat.data", quietly = TRUE)) { row_b <- sedist(k1, k2) mat <- rbind(row_a, row_b) rownames(mat) = c(k1, k1) colnames(mat) = k2 mat }
ind1 <- 384 k1 <- convert_kanji(ind1, "character") ind2 <- which(dstrokedit[ind1,] > 0) # dstrokedit contains only the "closest" kanji k2 <- convert_kanji(ind2, "character") row_a <- dstrokedit[ind1, ind2] if (requireNamespace("kanjistat.data", quietly = TRUE)) { row_b <- sedist(k1, k2) mat <- rbind(row_a, row_b) rownames(mat) = c(k1, k1) colnames(mat) = k2 mat }
Compactly display the structure of a kanjivec object
## S3 method for class 'kanjivec' str(object, ...)
## S3 method for class 'kanjivec' str(object, ...)
object |
an object of class |
... |
further parameters passed to |
No return value, called for side effects.