Skip to content

Conversation

@dlmiles
Copy link
Collaborator

@dlmiles dlmiles commented Feb 18, 2025

Should the cd_fd field be managed by a new flock_close(int fd) function, where the application can hand the open FD back to flock.c to manage when the real close should occur ? Which in turn, will receive the other active FD when it is also closed (or maybe there is also a flock_fclose(FILE *fp) to manage stdio use cases). The point is flock.c needs visibility of open and closes, then it can manage the locking requirements with multiple users in same process.
Then when all uses of a file are close (active_use_count_of_this_fd==0) then flock.c will manage closing all FDs (causing file-locking release) ?

Maybe if the sequence of events can be written in a issue comment, highlight the potential data-race-data-corruption concern (the need for a file lock in the first place) in the order of events. I can better understand how the existing code meets requirements.

In this one function there are multiple fp = fopen(expandname) then fclose(fp) that occur, without being locking-aware. Then at the end of the function, there is this lock-aware section, my patch touches. The call to fclose(fp) will destroy any existing locks that may have been obtained before, on that file. The file looks to be appended/write/rewind/truncate etc... and then closed(), if there were to happen more than once the 2nd use of this function the same file, would invalidate the 1st use's locks that were laid down. (due to the close(fp) syscall getting called from fclose use above)

In short you can't mix locking-aware and locking-unaware file open/access to the same file. Either all access to a particular file should be locked/locking-aware, or all accesses to a particular file never obey any locking rules (even when another process is successfully placing locks because it is locking-aware, in this scenario we did not participate in their locking scheme, as it is a co-operative scheme).

FWIW there are other locking types and schemes that exist, linux at least supports, that maybe more suited to the design goals trying to be met.


DBio.c: CodeQL File{MayNot,Never}BeClosed.ql file-handle resource leaks

Guided by CodeQL static code analyser.

FileMayNotBeClosed.ql
FileMayNeverBeClosed.ql

Technically the FILE_LOCKS feature leaks the file handle, but maybe this isn't in a perfectly controlled way (with assurance that at some correct point in the program future, all the fd's are eventually closed)

Guided by CodeQL static code analyser.

FileMayNotBeClosed.ql
FileMayNeverBeClosed.ql

Technically the FILE_LOCKS feature leaks the file handle, but maybe
this isn't in a perfectly controlled way (with assurance that at some
correct point in the program future, all the fd's are eventually closed)
@RTimothyEdwards
Copy link
Owner

Per my comment on the previous PR:

"FWIW there are other locking types and schemes that exist, linux at least supports, that maybe more suited to the design goals trying to be met."

---Yes, could not agree more.

@dlmiles
Copy link
Collaborator Author

dlmiles commented Feb 20, 2025

The flock use-cases needs to be understood in respect of the comments #374 (comment) and the flock.c implementation.

Merge Status: merge on hold, pending rework expected (review this status in 4 weeks)
Quality: not ready or complete
Risk: high-impact
Level of Testing: not started (needs targeted testing)

@RTimothyEdwards
Copy link
Owner

I have to recall conversations from around 1992, but I think the problem with having a lock file was that lock files could get left behind if the program crashed, but a file lock handled by the system would be released if the program crashed. The file locking mechanism implemented was the only one (that I am aware of) that was available at the time. Please enumerate the options.

@dlmiles
Copy link
Collaborator Author

dlmiles commented Feb 25, 2025

Yes that maybe considered an issue, but the policy at a slightly later time (~1995) was to create the lockfile and write into it the fprintf(lockfile, "%5d\n", getpid()) value of the owner (space padded somewhat, but usually the creation uses O_EXCL which does most of the work, so I never really understood why it need to be padded, presumably to overwrite a shorter looking PID under overwrite non-O_CREATE circumstances).

At the time another process that found the lockfile existed, would read it, then could send a kill(0, pid_from_lockfile) to see if this returned an error code (ESRCH) or was successful (0). This is an indicator the process is still alive. There is the concern of eventual PID number recycling, in those days maybe it was a 16bit range for PID.

This was the basic design that allowed stale lockfile to be spotted and removed, automatically in the 99.99% of cases, due to the owning PID not being alive.

Then to enhance this, use of fcntl(F_SETLK) like now, but on the lockfile itself. This does not suffer from the potential PID get recycled concern (for the reasons you state) and a long running process is currently using that PID by the time the stale lockfile test was done.

Today the PID number space has expanded somewhat, so less likely to see recycled PID.
Today there is security capability and containerization that may limit the visibility of the PIDs a process can see, including the ability to use this method to see if a process is alive (hidden due to security information leakage).
But the fcntl() should still work.

The benefit of only locking the the *.lockfile with fcntl() is that the process is free to open/close and use multiple file handles on the main file, knowing that the whole time it is still exclusively locked (by other systems that participate in the same locking scheme)

Back in 1992 I'm sure lockfiles on NFS systems might have also been an issue but these days even that is not a problem.

Visible lockfiles I find are very useful for a human to know, correct/workaround make a decision about.

How to handle things when one is found, is a UI matter with appropriate dialogs (popup modals with options to user), or diagnostic command suggestion in the error output such as when in batch mode on how to correct it:

ERROR: Unable to acquire lock file for /path/to/problem.mag
#####: To override this you must run manually remove the file:
#####   rm -f /absolute/path/to/problem.mag.lockfile

In modern times some system will create a symlink like: ln -s username@hostname.PID .#target_file.mag this make the lockfile's existence less visible (dot prefix hidden file), the target it points to does not exist, it is used to attach the string content into the readlink() part of the symlink, it identifies the username, the system-name (helpful for NFS shares) and the PID of that system that currently is the owner of the lock.

This way I guess it is possible to check when you kill(0, pid_from_filename) you first check you are running this on the correct hostname. But also provides additional information to the user on shared filesystems.

@dlmiles dlmiles marked this pull request as draft June 27, 2025 14:56
@dlmiles
Copy link
Collaborator Author

dlmiles commented Jun 27, 2025

Moved to draft to better indicate current status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants