Generate more unique names for new book folders (BL-15608)#7683
Generate more unique names for new book folders (BL-15608)#7683JohnThomson wants to merge 4 commits intomasterfrom
Conversation
| var copyNum = 1; | ||
| // see if this already has a name like "foo - Copy-abc12345" or "foo - Copy 3" | ||
| // (The second is obsolete, but might still be encountered.) | ||
| // If it does, remove that suffix so we get back to the base name |
There was a problem hiding this comment.
📝 Info: Greedy regex in Duplicate() could match unintended patterns in book names containing ' - Copy'
The regex ^(.+)(\s-\sCopy)(-[0-9a-f]+)?$ at line 3453 uses a greedy (.+) for the first group. If a book title legitimately contains - Copy (e.g., How to - Copy Documents), the regex will match and strip the suffix. However, this is the same behavior as the old regex ^(.+)(\s-\sCopy)(\s[0-9]+)?$, and since (.+) is greedy, it will match the last occurrence of - Copy in the string. For a name like How to - Copy Documents - Copy-abc12345, group 1 would capture How to - Copy Documents which is correct. For How to - Copy Documents (no copy suffix), the regex matches with group 1 = How to which would incorrectly strip part of the title. But this pre-existing edge case was also present in the old code, so it's not a regression.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
If this is right, it seems like a pre-existing bug we should fix now.
There was a problem hiding this comment.
It would not match the example given, because a match requires that " - Copy" is followed by either a space and decimal number or a hyphen and hex number. If the user duplicates a book whose actual title is "My book - Copy 3", Bloom will give the duplicate an initial name like "My book - Copy-1a2b3c4d". And then probably the user changes the title to "My book - Copy 4" if that's really what he wants, and the folder name will change to match.
Pathologically the book title could be something like "My book - Copy-bad" which would match the first alternative in the regex. Even then, it's not clear to me that somehow producing "My book - Copy-bad - Copy-1a2b3c4d" is better than "My book - Copy-1a2b3c4d" for a first-pass name that probably won't last.
Unless we compare the existing name with the book title (and that's only even relevant if the user isn't doing a custom folder name), I don't see how we can NOT match a title that happens to look like one of our de-duplicated names. I believe Devin is wrong that the greedy initial regex makes things worse; neither of the regex's in the final group will match "Copy" so either way the regex can only match the last occurrence of " - Copy" in the title, and then only if what follows matches a fairly restrictive pattern.
I don't think it's worth worrying about.
JohnThomson
left a comment
There was a problem hiding this comment.
@JohnThomson made 3 comments.
Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @andrew-polk).
| var copyNum = 1; | ||
| // see if this already has a name like "foo - Copy-abc12345" or "foo - Copy 3" | ||
| // (The second is obsolete, but might still be encountered.) | ||
| // If it does, remove that suffix so we get back to the base name |
There was a problem hiding this comment.
It would not match the example given, because a match requires that " - Copy" is followed by either a space and decimal number or a hyphen and hex number. If the user duplicates a book whose actual title is "My book - Copy 3", Bloom will give the duplicate an initial name like "My book - Copy-1a2b3c4d". And then probably the user changes the title to "My book - Copy 4" if that's really what he wants, and the folder name will change to match.
Pathologically the book title could be something like "My book - Copy-bad" which would match the first alternative in the regex. Even then, it's not clear to me that somehow producing "My book - Copy-bad - Copy-1a2b3c4d" is better than "My book - Copy-1a2b3c4d" for a first-pass name that probably won't last.
Unless we compare the existing name with the book title (and that's only even relevant if the user isn't doing a custom folder name), I don't see how we can NOT match a title that happens to look like one of our de-duplicated names. I believe Devin is wrong that the greedy initial regex makes things worse; neither of the regex's in the final group will match "Copy" so either way the regex can only match the last occurrence of " - Copy" in the title, and then only if what follows matches a fairly restrictive pattern.
I don't think it's worth worrying about.
| private string GetInitialName(string parentCollectionPath, string instanceId) | ||
| { | ||
| var name = BookStorage.SanitizeNameForFileSystem(UntitledBookName); | ||
| return BookStorage.GetUniqueFolderName(parentCollectionPath, name); | ||
| return BookStorage.GetUniqueBookFolderName(parentCollectionPath, name, instanceId); |
There was a problem hiding this comment.
🚩 Behavior change: New books ALWAYS get a GUID suffix in folder name
Previously, GetInitialName called GetUniqueFolderName which would return the plain name (e.g., "Book") if no conflict existed, only appending 2, 3, etc. on conflict. Now GetUniqueBookFolderName at src/BloomExe/Book/BookStorage.cs:3533 ALWAYS appends separator + instanceSuffix (e.g., "Book-abc12345"), even when there's no naming conflict. This is intentional per the PR description (BL-15608), but it's a significant UX change: every new book folder will have a seemingly random suffix. Reviewers should confirm this is the desired behavior for all new-book scenarios.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This is exactly what we're trying to accomplish: different users of a TC each attempting to create a different new book won't, even temporarily, have books with the same name.
| // Truncate and clean the base name if necessary | ||
| if (baseName.Length > maxBaseNameLength) | ||
| { | ||
| baseName = MiscUtils.TruncateSafely(baseName, maxBaseNameLength); | ||
| baseName = Regex.Replace(baseName, "[\\s.]+$", "", RegexOptions.Compiled); | ||
| } |
There was a problem hiding this comment.
🚩 Name truncation after suffix reservation could produce empty or whitespace-only base names
In GetUniqueBookFolderName, when baseName is truncated at line 3554 and then trailing whitespace/periods are stripped at line 3555, the result could theoretically become empty if the truncation point lands in the middle of trailing whitespace. This would produce folder names like "-abc12345" or " - Copy-abc12345". While unlikely in practice (names are sanitized before reaching this point), there's no guard against an empty baseName after cleanup. The SanitizeNameForFileSystem method at src/BloomExe/Book/BookStorage.cs:3248 does protect against empty names, but that runs before this truncation step, not after.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
SanitizeNameForFileSystem is very careful not to return a name that will Trim() to an empty string.
ac9c4c7 to
f2c9f7c
Compare
JohnThomson
left a comment
There was a problem hiding this comment.
@JohnThomson made 2 comments.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @andrew-polk).
| private string GetInitialName(string parentCollectionPath, string instanceId) | ||
| { | ||
| var name = BookStorage.SanitizeNameForFileSystem(UntitledBookName); | ||
| return BookStorage.GetUniqueFolderName(parentCollectionPath, name); | ||
| return BookStorage.GetUniqueBookFolderName(parentCollectionPath, name, instanceId); |
There was a problem hiding this comment.
This is exactly what we're trying to accomplish: different users of a TC each attempting to create a different new book won't, even temporarily, have books with the same name.
f2c9f7c to
4346384
Compare
andrew-polk
left a comment
There was a problem hiding this comment.
@andrew-polk reviewed all commit messages and resolved 1 discussion.
Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @JohnThomson).
JohnThomson
left a comment
There was a problem hiding this comment.
@JohnThomson made 4 comments.
Reviewable status: 0 of 4 files reviewed, 5 unresolved discussions (waiting on @andrew-polk).
src/BloomTests/Book/BookStorageTests.cs line 210 at r4 (raw file):
[Test] public void MoveBookToSafeName_NameNotNFC_NewNameConflicts_Moves()
This test (in addition to the one above) is not useful any more; name conflicts are vanishingly unlikely and hard to simulate.
src/BloomTests/Book/BookStorageTests.cs line 1136 at r4 (raw file):
[Test] public void SetBookName_FolderWithSanitizedNameAlreadyExists_AddsANumberToName()
This seems to be doing the same thing as the previous test, at least now that a conflict is so unlikely.
| // Truncate and clean the base name if necessary | ||
| if (baseName.Length > maxBaseNameLength) | ||
| { | ||
| baseName = MiscUtils.TruncateSafely(baseName, maxBaseNameLength); | ||
| baseName = Regex.Replace(baseName, "[\\s.]+$", "", RegexOptions.Compiled); | ||
| } |
There was a problem hiding this comment.
SanitizeNameForFileSystem is very careful not to return a name that will Trim() to an empty string.
| var newPathForExtraBook = BookStorage.GetAvailableDirectoryPath( | ||
| desiredPath, | ||
| null, | ||
| " - " | ||
| ); |
There was a problem hiding this comment.
🟡 MoveBookToAvailableName always appends GUID suffix even when the desired path is available
MoveBookToAvailableName now calls GetAvailableDirectoryPath, which delegates to GetUniqueBookFolderName. This function always appends separator + 8 hex chars — it never tries the base name alone first. This means that even when the desired path does not conflict with any existing folder, the book gets an unnecessary GUID suffix in its name.
Detailed Explanation and Impact
The old code called GetUniqueFolderPath(desiredPath) → GetUniqueFolderName(parent, name), which started with an empty suffix and only appended 2, 3, etc. if there was a conflict. So if the desired path was available, the book kept the clean name.
The new code at BookStorage.cs:3856-3860 calls:
var newPathForExtraBook = BookStorage.GetAvailableDirectoryPath(
desiredPath,
null,
" - "
);which calls GetUniqueBookFolderName at BookStorage.cs:3569-3580, which always constructs baseName + separator + instanceSuffix without first checking if baseName alone is available.
This concretely affects MoveBookToSafeName (BookStorage.cs:3829). Example: a book with a non-NFC name like "Some nicé book" gets sanitized to NFC "Some nicé book". If that NFC folder doesn't exist, the old code would move the book to "Some nicé book". The new code moves it to "Some nicé book - 12345678", unnecessarily polluting the name.
Similarly, MoveBookToAvailableName is called from TeamCollection.cs:204 to resolve conflicts with existing books. If the desired path is available, the book name is needlessly modified.
The modified test at BookStorageTests.cs:203 acknowledges this with the comment: "Typically we added hyphen and a guid. Not sure if that's good."
Impact: Book folders that are moved for sanitization (e.g., NFC normalization) will always receive an unnecessary GUID suffix, making book names uglier and harder to identify. This also removes the deleted test case MoveBookToSafeName_NameNotNFC_NewNameConflicts_Moves which specifically validated the conflict scenario separately from the no-conflict scenario.
Prompt for agents
In MoveBookToAvailableName (around line 3843-3877 in BookStorage.cs), before calling GetAvailableDirectoryPath, first check if the desiredPath itself is available. If it doesn't already exist, use it directly (just call MoveDirectory to it and rename the HTM file). Only call GetAvailableDirectoryPath when the desiredPath already exists and we need to find an alternative. This preserves the old behavior where sanitized names are used directly when there's no conflict. The same logic applies within GetAvailableDirectoryPath/GetUniqueBookFolderName: consider adding an option or a separate code path for callers (like MoveBookToAvailableName) that want to try the base name first before adding a GUID suffix.
Was this helpful? React with 👍 or 👎 to provide feedback.
|
|
||
| var expectedFolder = bookFolder.Replace("e\x0301", "\x00e9"); | ||
|
|
||
| Assert.That(BookStorage.MoveBookToSafeName(bookFolder), Is.EqualTo(expectedFolder)); | ||
| Assert.That(Directory.Exists(expectedFolder)); | ||
| var expectedBookPath = bookPath.Replace("e\x0301", "\x00e9"); | ||
| Assert.That(File.ReadAllText(expectedBookPath), Is.EqualTo("this is a test")); | ||
| } | ||
|
|
||
| [Test] | ||
| public void MoveBookToSafeName_NameNotNFC_NewNameConflicts_Moves() | ||
| { | ||
| var oldName = "Some nice\x0301 book"; | ||
| var bookFolder = Path.Combine(_fixtureFolder.Path, oldName); | ||
| Directory.CreateDirectory(bookFolder); | ||
| var bookPath = Path.Combine(bookFolder, oldName + ".htm"); | ||
| File.WriteAllText(bookPath, "this is a test"); | ||
|
|
||
| var desiredFolder = bookFolder.Replace("e\x0301", "\x00e9"); | ||
| Directory.CreateDirectory(desiredFolder); | ||
| var expectedFolder = desiredFolder + " 2"; | ||
|
|
||
| Assert.That(BookStorage.MoveBookToSafeName(bookFolder), Is.EqualTo(expectedFolder)); | ||
| Assert.That(Directory.Exists(expectedFolder)); | ||
| var expectedBookPath = Path.Combine( | ||
| expectedFolder, | ||
| Path.GetFileName(expectedFolder) + ".htm" | ||
| ); | ||
| // Typically we added hyphen and a guid. Not sure if that's good. | ||
| var result = BookStorage.MoveBookToSafeName(bookFolder); | ||
| Assert.That(result, Does.StartWith(expectedFolder)); | ||
| Assert.That(Directory.Exists(result)); | ||
| var expectedBookPath = Path.Combine(result, Path.GetFileName(result) + ".htm"); | ||
| Assert.That(File.ReadAllText(expectedBookPath), Is.EqualTo("this is a test")); | ||
| } |
There was a problem hiding this comment.
🚩 Deleted test MoveBookToSafeName_NameNotNFC_NewNameConflicts_Moves reduces coverage
The test MoveBookToSafeName_NameNotNFC_NewNameConflicts_Moves (old lines 209-228) was deleted entirely rather than updated. This test specifically validated the case where the sanitized name conflicts with an existing folder. With the new always-add-GUID behavior, conflict is extremely unlikely, but the test still provided value in verifying that the conflict-resolution loop in GetUniqueBookFolderName works. Consider adding a unit test for GetUniqueBookFolderName directly that validates collision handling (where the first proposed name already exists as a directory).
(Refers to lines 192-209)
Was this helpful? React with 👍 or 👎 to provide feedback.
This change is