Improve documentation for string encodings

To get the id for a string that contains non-ASCII characters, the function `RcppCWB::cl_id2str()` will not yield a result if the Encoding of the incoming string is different from the encoding of the corpus. Yet this may happen frequently if you work under a UTF-8 locale (default on macOS, Linux) and your corpus is latin1-encoded (common encoding of CWB corpora). See the follwing examples.

So this fails ...

```r
cl_str2id(corpus = "MIGPARL", p_attribute = "word", str = "über", registry = registry())
```

To make it work, you need to iconv the input string to the encoding of the corpus.

```{r}
cl_str2id(corpus = "MIGPARL", p_attribute = "word", str = iconv("über", from = "UTF-8", to = "latin1"), registry = registry())
```

It might be worthwhile to let the R wrapper for the C function cl_str2id() check whether the encoding of the incoming string is identical with the encoding of the corpus. That may entail a performance loss that is is to be avoided. 

It is much more important that the documentation stresses that the encoding of the string needs to conform to the encoding of the corpus. The documentation at its present stage falls short of making this requirement clear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve documentation for string encodings #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve documentation for string encodings #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions