Skip to content

Conversation

@paulish
Copy link

@paulish paulish commented Sep 18, 2024

This commit allows to read languageDriverId from the header field and choose appropriate codepage to perform character conversion. I added LanguageDriverIdToCodepage record which is taken from the MS format explanation page.

@yortus
Copy link
Owner

yortus commented Sep 21, 2024

Hi @paulish, thanks for the PR! This is interesting because there is already support for reading and writing dbf files with different character encodings, including the code pages at the link you have referenced. It's done through the encoding option, which is more general and quite flexible because (a) dbase files don't have that header byte even though they do use various encodings and (b) some files are not conformant - e.g. they use different code pages for different fields, or the field names use a different code page from the field values.

Having said that, one thing you have in this PR that isn't currently implemented in this library is reading/writing the FoxPro "code page mark" / "code page id" (what you have called Language Driver ID). That would be a good way to try to get the encoding from the file itself without having to specify the encoding option separately when opening the file.

I'd be interested in keeping the code to read/write the "code page mark" from files, but with the following changes from the PR as it is currently:

  • Move the let/const changes to a separate commit (or remove them) as those changes are unrelated and make it difficult to see the actual changes being proposed in this PR.
  • Only read/write the code page mark for FoxPro versions that support it (dbase doesn't, at least in principle)
  • If the code page mark is present in the file and no options.encoding was given, use the code page mark to determine the encoding to use for the file.
  • When writing a dbffile, determine the code page mark from the encoding rather than having a separate options.languageDriverId option, since they do the same thing so we don't need both options. The encoding option is more general so that's the one I'd keep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants