Skip to content

Add support for encodings other than UTF-8#36497

Closed
EuclidDivisionLemma wants to merge 43 commits intozed-industries:mainfrom
EuclidDivisionLemma:add_support_for_non_utf_encodings
Closed

Add support for encodings other than UTF-8#36497
EuclidDivisionLemma wants to merge 43 commits intozed-industries:mainfrom
EuclidDivisionLemma:add_support_for_non_utf_encodings

Conversation

@EuclidDivisionLemma
Copy link

@EuclidDivisionLemma EuclidDivisionLemma commented Aug 19, 2025

Add the ability to open and save files in different encodings. Closes #16965

@cla-bot cla-bot bot added the cla-signed The user has signed the Contributor License Agreement label Aug 19, 2025
@maxdeviant maxdeviant changed the title Add support for non utf encodings Add support for non-UTF encodings Aug 20, 2025
@EuclidDivisionLemma EuclidDivisionLemma force-pushed the add_support_for_non_utf_encodings branch 2 times, most recently from 02890f2 to 4f7a563 Compare August 23, 2025 14:37
@EuclidDivisionLemma EuclidDivisionLemma changed the title Add support for non-UTF encodings Add support for encodings other than UTF-8 Aug 23, 2025
@EuclidDivisionLemma EuclidDivisionLemma force-pushed the add_support_for_non_utf_encodings branch 10 times, most recently from 3b20229 to ce0128c Compare August 27, 2025 02:46
@EuclidDivisionLemma EuclidDivisionLemma marked this pull request as ready for review August 27, 2025 03:08
@EuclidDivisionLemma EuclidDivisionLemma force-pushed the add_support_for_non_utf_encodings branch 4 times, most recently from f53e006 to 4f0bfa6 Compare August 29, 2025 17:18
@zed-industries-bot
Copy link

zed-industries-bot commented Aug 29, 2025

Warnings
⚠️

This PR is missing release notes.

Please add a "Release Notes" section that describes the change:

Release Notes:

- Added/Fixed/Improved ...

If your change is not user-facing, you can use "N/A" for the entry:

Release Notes:

- N/A

Generated by 🚫 dangerJS against 4330e5f

@EuclidDivisionLemma EuclidDivisionLemma force-pushed the add_support_for_non_utf_encodings branch 3 times, most recently from 0d9a756 to dbf899a Compare August 30, 2025 02:24
@CrazyboyQCD
Copy link
Contributor

I wonder why use encoding instead of encoding_rs, since the former has not been developed for a long time.

@EuclidDivisionLemma
Copy link
Author

EuclidDivisionLemma commented Aug 30, 2025

@CrazyboyQCD

Well, I considered it, but eventually decided against it as the docs explicitly states

Both in terms of scope and performance, the focus is on the Web.

@CrazyboyQCD
Copy link
Contributor

CrazyboyQCD commented Aug 30, 2025

@EuclidDivisionLemma

The main issue is that it is unmaintained, buggy and legacy, so I think a more mordern crate would be better if you don't want to fork and maintain it.
https://rustsec.org/advisories/RUSTSEC-2021-0153.html

@EuclidDivisionLemma
Copy link
Author

EuclidDivisionLemma commented Aug 30, 2025

@CrazyboyQCD

Yes, you're right. I'll definitely look into it. Also, what do you think about using ICU. Surely, it is more mature and has multi-threaded support. There is a C Version of the library, ICU4C, and also the Rust version is a part of the ICU4X project.

https://docs.rs/icu/2.0.0/icu/
https://icu4x.unicode.org/

@CrazyboyQCD
Copy link
Contributor

ICU4X is related with i18n and is not suitable for this.

@EuclidDivisionLemma EuclidDivisionLemma marked this pull request as draft August 30, 2025 09:20
 encoding instead of replacing the invalid bytes with replacement
 characters

 - Add `encoding` field in `Workspace`
- Pass encoding to `ProjectRegistry::open_path` and set the `encoding`
field in `Project`
- Remove the parameter from `BufferStore::open_buffer` as it is not
needed
now open the file in the chosen encoding if it is valid or show the
invalid screen again if not.

(UTF-16 files aren't being handled correctly as of now)
bytes replaced with replacement characters

- Fix UTF-16 file handling

- Introduce a `ForceOpen` action to allow users to open files despite
encoding errors

- Add `force` and `detect_utf16` flags

- Update UI to provide "Accept the Risk and Open" button for invalid
encoding files
associated file was in a different encoding, rather than showing an
error.
choosing the correct encoding from `InvalidBufferView`
UI. The `encodings_ui` crate will only have UI related components in the
future.
`encodings`

- `EncodingWrapper` is replaced with `encodings::Encoding`
re-opened, while retaining the text.

- Fix an issue that prevented `InvalidBufferView` from being shown when
an incorrect encoding was chosen from the status bar.

- Centre the error message in `InvalidBufferView`.
- Implement `From` for `Encoding` and `Clone` for `EncodingOptions`
- Add a licence symlink to `encodings`
@EuclidDivisionLemma EuclidDivisionLemma force-pushed the add_support_for_non_utf_encodings branch from 4a75b86 to 4330e5f Compare November 1, 2025 09:23
@ConradIrwin
Copy link
Member

@EuclidDivisionLemma

I finally got some time to sit down and make some significant changes. They are here: 7e22d05 because for some reason I am not able to push to your fork directly.

The biggest change is to remove the mutexes and have methods return the detected encoding along with the string; but I also tried to make the BOM handling safer (so Zed will not silently remove the BOM).

Next steps;

  • Fix the rendering of the selectors
  • Add actions to reopen/save with encoding to the editor that open the modal directly so you can get there from the command palette
  • Implement a setting to show/hide the character set indicator
  • Rebuild the "Take the risk" button using a fake Latin1 encoding (encodings_rs supports this in there mem module, but don't expose it as an encoding for some reason). When you open the character set selector we can show Latin1 (Binary) as one of the options and if that is the case open the file.
  • (maybe) Stop using a picker for save_or_open and make it use a menu on the status item instead (not sure about this).
  • Add some tests! I'm particularly worried about cases where you open a file in Zed and save it and we add or remove a BOM.
  • Consider what to do when you've edited a file to contain characters that cannot be represented in the encoding. Currently this writes HTML-escaped characters, but it might also just error. not sure what we want.

Thanks again for all your work on this. If you get time to build some of this, that would be great; otherwise I'll try and pick it up when I get some time.

@ConradIrwin
Copy link
Member

Hey! I'm going to close this PR for now, as I realistically don't have the time to get this merged right now.

If you'd like to build on this again, I'd love something more like 7e22d05 where we can avoid the shared mutable state; but there's still a lot of details to iron out.

Thank you again for your contributions here, and I hope to work with you again the next time!

@EuclidDivisionLemma
Copy link
Author

EuclidDivisionLemma commented Nov 21, 2025

I'm sorry that i couldn't respond to the last comment immediately as I am caught up with something else. I really wish to make further contributions. I understand that significant portions of the code have been changed. But I still wish to be a part of it as it matters to me. I can try, if you could tell exactly what you want me to work on.

@ConradIrwin
Copy link
Member

Amazing thank you!

I want to take the approach more like this commit: 7e22d05 (with no Mutexes that allow state to change implicitly) and flesh out the rest of the functionality to make sure it's working).

The other question back of mind for me is about UTF-8 files that start with a zero-width space. Should we interpret that as a BOM and hide it from the editor (as my commit did) or should we assume that actually, very few people use UTF-8 BOM's, and just pass this to the editor as a file that starts with a zero-width space?.

The final change that I want to make is to use a fallback "binary" encoding instead of the existing "open anyway" option. I am not sure that encodings.rs provides one, but I think it would be reasonable to map bytes in the range 0x80-0xff to the corresponding unicode character (\u80-\uff), and the same on the inverse.

@EuclidDivisionLemma
Copy link
Author

@ConradIrwin
I have implemented the fallback encoding. Please look into it when you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed The user has signed the Contributor License Agreement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for non UTF-8 text encodings

5 participants