From 335796d8875bfae037c495433d9db8eb17b2680b Mon Sep 17 00:00:00 2001 From: Jaggzh Date: Tue, 24 Dec 2024 16:04:48 -0800 Subject: [PATCH] We currently end up converting zero-width-space (zws) U+200B/\xe2\x80\x8b to an invalid sequence (FD BF BF BD A3 AC). We (chatgpt and I) added code to just remove these. --- src/content.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/src/content.c b/src/content.c index 2c8ea9f..27b3dcf 100644 --- a/src/content.c +++ b/src/content.c @@ -104,6 +104,11 @@ void strcpy_normalize(xmlChar *dest, const xmlChar *src) *dest = ' '; src += isspace(*src) ? 1 : 2; } + /* Skip zero-width space (U+200B) */ + if (memcmp(src, "\xe2\x80\x8b", 3) == 0) { + --dest; // Don't copy it + src += 2; // Skip the extra bytes + } if (*dest == ' ') --src; }