Skip to content

Improve documentation for string encodings #16

@ghost

Description

To get the id for a string that contains non-ASCII characters, the function RcppCWB::cl_id2str() will not yield a result if the Encoding of the incoming string is different from the encoding of the corpus. Yet this may happen frequently if you work under a UTF-8 locale (default on macOS, Linux) and your corpus is latin1-encoded (common encoding of CWB corpora). See the follwing examples.

So this fails ...

cl_str2id(corpus = "MIGPARL", p_attribute = "word", str = "über", registry = registry())

To make it work, you need to iconv the input string to the encoding of the corpus.

cl_str2id(corpus = "MIGPARL", p_attribute = "word", str = iconv("über", from = "UTF-8", to = "latin1"), registry = registry())

It might be worthwhile to let the R wrapper for the C function cl_str2id() check whether the encoding of the incoming string is identical with the encoding of the corpus. That may entail a performance loss that is is to be avoided.

It is much more important that the documentation stresses that the encoding of the string needs to conform to the encoding of the corpus. The documentation at its present stage falls short of making this requirement clear.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions