-
Notifications
You must be signed in to change notification settings - Fork 32
fix: storage_flush check thread exists before joining #554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: next
Are you sure you want to change the base?
Conversation
Some issue are observed during storage write, happens when the storage_flush is called by client. At server thread join is tried when flush is called in writer, since the stop has already been called thread for the instance is already reinitialized to none, Due to this the at client we observe errors stating calling join on None type. Tested-by: Gourav Singh <gourav.singh.ext@siemens.com> Signed-off-by: Shivaschandra KL <shivaschandra.k-l@siemens.com>
|
Error observed at client during the storage write gourav@gourav:~/images/lx1/example$ mtda-cli -r 134.86.254.94 storage write isar-image-installer-lx1-x86-uefi.wic
Discovered bmap file 'isar-image-installer-lx1-x86-uefi.wic.bmap'
isar-image-installer-lx1-x86-uefi.wic: [####################] 100% (650 MiB read, 5.70 GiB written, 36 KiB/s)
'storage write' failed! ('NoneType' object has no attribute 'join') |
| self._thread.join() | ||
| if self._thread is not None: | ||
| self.mtda.debug(2, "storage.writer.flush(): waiting on thread...") | ||
| self._thread.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just shortens the windows of the race condition, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should shorten and also ensure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ensure? We still have a TOCTOU issue here if I understand the underlying problem correctly. How can it be, that the self._thread object does not exist anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue that we observed did not have consistent reproducibility, whenever we tried the storage write as bellow, we used to get the error, but not always.
gourav@gourav:~/images/lx1/example$ mtda-cli -r 134.86.254.94 storage write isar-image-installer-lx1-x86-uefi.wic
Discovered bmap file 'isar-image-installer-lx1-x86-uefi.wic.bmap'
isar-image-installer-lx1-x86-uefi.wic: [####################] 100% (650 MiB read, 5.70 GiB written, 36 KiB/s)
'storage write' failed! ('NoneType' object has no attribute 'join')And the issue was intermittent and no corruption on the image was observed.
I had seen the stop method, saw that it uses similar logic to ensure the thread is not none and then join.
So, I used the term ensure. And shortens because it also reduces the time window if join method are called simultaneously.
I am not sure if this is a TOCTOU issue, because we have the while loop which makes sure the writing is complete, so thread is not under usage, so at the moment I have some suspects that, somewhere stop is being called, because only that has capability to set to none type, but not sure of why or how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some suspects that, somewhere stop is being called, because only that has capability to set to none type, but not sure of why or how.
You could print a traceback whenever stop is called to debug this. Anyways, I approved the MR as it solves the issue for us. Once the root cause is found, we still can revert the patch.
Some issue are observed during storage write, it happens when the storage_flush is called by client. At server thread join is tried when flush is called to writer, since the stop has already been called thread for the instance is already reinitialized to none, Due to this at client we observe errors stating calling join on None type.
Tested-by: Gourav Singh gourav.singh.ext@siemens.com
Signed-off-by: Shivaschandra KL shivaschandra.k-l@siemens.com