-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
⚠️ This issue respects the following points: ⚠️
- This is a bug, not a question or a configuration/webserver/proxy issue.
- This issue is not already reported on Github OR Nextcloud Community Forum (I've searched it).
- Nextcloud Server is up to date. See Maintenance and Release Schedule for supported versions.
- I agree to follow Nextcloud's Code of Conduct.
Bug description
We think we have identified a massive (at least on our instance) performance bottleneck, when using groupfolders. But the issue is not within the groupfolders app, but withing the core.
This is because the path="__groupfolders" oc_filecache entry gets updated A LOT to change the etag, even though it is never mounted in any users filesystem (only subfolders of it are ever mounted) and the etag can therefore as far as we know never be requested over DAV. This causes massive lock contention over this single specific database row.
We think Storage implementations need a way to exclude certain directories from the automatic etag propagation, if it is guaranteed that etag is never needed, to avoid this performance problem.
We think this issue affects our instance more than most, as most files of the instance are in groupfolders (and there are a lot of files), but it should apply to every instance with groupfolders.
It is expected, that editing many files within the same directory would cause a lot of etag updates, because all parent directories need to be updated in the filecache and some lock contention is to be expected.
But this goes way beyond that, if ANY file in ANY groupfolder is changed this single entry in the database must be updated. As you might expect this is terrible for database performance, because lock contention is through the roof.
The reason this does not happen with non-groupfolder files is, that each users home directory is it's own storage, so the etag updater does not propagate further to some filecache entry shared by every home folder, because that does not exist. So the updates only contend in the database with writes within the same users home folder, not every home folder.
But groupfolders uses a single storage with subdirectories for every groupfolder, and only those subdirectories are actually mounted.
Backstory
Finding this bottleneck took a lot of debugging, so this leads you through our steps:
Recently we experienced "frequent" performance drops on a Nextcloud server with about 1000 users per hour. This happened a couple of times per week and the server was not able to recover by itself. We had to restart the database and/or PHP-FPM services.
We tried many thing to fix this issue, mostly focused on making mariadb faster thorugh configuration optimization and upgrading the SSD the database is on. But we could only reduce how often the problem occurs, the fundamental issue did not go away. So we focused more on finding out which behavior of Nextcloud makes the database lock up.
After collecting the MariaDB process multiple times in the degraded state, we saw a pattern emerge. Specifically with select trx_id, trx_state, CURRENT_TIME(), trx_wait_started, trx_started, trx_rows_locked, trx_query from information_schema.INNODB_TRX where trx_state = 'LOCK WAIT'; we noticed that all queries waiting for the lock were trying to update etag attributes with the same parent ID (26614568). Except for the other values, the queries are the same:
UPDATE `oc_filecache` SET `etag` = '67c2f8b9194e1', `checksum` = '', `parent` = '26614568', `permissions` = '31' WHERE (`fileid` = 26614851) AND (((`etag` <> '67c2f8b9194e1') OR (`etag` IS NULL)) OR ((`checksum` <> '') OR (`checksum` IS NULL)) OR ((`parent` <> '26614568') OR (`parent` IS NULL)) OR ((`permissions` <> '31') OR (`permissions` IS NULL)))Now, what query is this query waiting to get a lock from ? We think it's the etag propagator, whose queries appear in the slow query log when the issue occurs:
UPDATE `oc_filecache` SET `mtime` = GREATEST(`mtime`, 1741625955), `etag` = '67cf1a7eec988' WHERE (`storage` = 39980) AND (`path_hash` IN ('d41d8cd98f00b204e9800998ecf8427e', '29ff0edf73a32cb03e437d88fd049245', '6fad44a07c7385ecb3759c98c9f1dcf0', '848f3c7d9b51833a665d5a6ac8b42b22', '14be412a0b1a10ccdf396f2da67342a8', '558438a62cc164946ed7028269eb679f', 'ff6626a3ae3178146fd8eb2ae1c87c96'));`The propagator updates multiple folders etags in one query and a query can never be partially applied, either every row matched is updated or all have to wait. We think one row matched is the one also matched by the type of query above waiting on a lock and one of the matched rows is:
MariaDB [wolke]> SELECT fileid, path FROM oc_filecache WHERE storage = 39980 AND path_hash="29ff0edf73a32cb03e437d88fd049245" LIMIT 1;
+----------+----------------+
| fileid | path |
+----------+----------------+
| 26614568 | __groupfolders |
+----------+----------------+
1 row in set (0.000 sec)
the groupfolder root. So if the groupfolder root is slow to update because of lock contention many subfolders stay locked for longer causing other queries to wait for a lock for a very long time (which in our case escalates until the database deadlocks).
So the cause of the problem seems to be a Mutex contention in the MariaDB database in the oc_filecache table for this specific groupfolder root row. It is known, that the filecache table can be a performance bottleneck (#42920). We know there is work under way in this area by enabling sharding for this table. We opened this issue as we don't think this problem can be solved with sharding (or should not need to be solved with it, because a better solution is possible).
When the issue happens, the nextcloud.log is "flooded" with the following message:
An exception occurred while executing a query: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
We can see, that the etag of the oc_filecache entry for path "__groupfolders" is changed all the time, presumably any time any folder or file in any groupfolder is changed.
Steps to reproduce
Reproducing is rather difficult. We only see this randomly occur with many concurrent users at the same time. We cannot provide a specific procedure right now that reproduces this problem reliably.
Expected behavior
Nextclouds architecture and query behaviour should try to avoid lock contention. This is not always possible, but in this case we believe it is.
Nextcloud Server version
29
Operating system
Debian/Ubuntu
PHP engine version
PHP 8.2
Web server
Nginx
Database engine version
MariaDB
Is this bug present after an update or on a fresh install?
None
Are you using the Nextcloud Server Encryption module?
Encryption is Disabled
What user-backends are you using?
- Default user-backend (database)
- LDAP/ Active Directory
- SSO - SAML
- Other
Configuration report
{
"system": {
"instanceid": "***REMOVED SENSITIVE VALUE***",
"passwordsalt": "***REMOVED SENSITIVE VALUE***",
"secret": "***REMOVED SENSITIVE VALUE***",
"trusted_domains": [
"nextcloud.example.com"
],
"datadirectory": "***REMOVED SENSITIVE VALUE***",
"filesystem_check_changes": 0,
"overwrite.cli.url": "https:\/\/nextcloud.example.com",
"dbtype": "mysql",
"version": "29.0.11.1",
"dbname": "***REMOVED SENSITIVE VALUE***",
"dbhost": "***REMOVED SENSITIVE VALUE***",
"dbport": "3306",
"dbtableprefix": "oc_",
"mysql.utf8mb4": true,
"dbuser": "***REMOVED SENSITIVE VALUE***",
"dbpassword": "***REMOVED SENSITIVE VALUE***",
"installed": true,
"mail_smtpmode": "smtp",
"mail_smtpauthtype": "LOGIN",
"mail_domain": "***REMOVED SENSITIVE VALUE***",
"maintenance": false,
"loglevel": 2,
"logfile": "\/var\/log\/nextcloud.log",
"theme": "",
"filelocking.ttl": 3600,
"filelocking.enabled": true,
"redis": {
"host": "***REMOVED SENSITIVE VALUE***",
"port": 0,
"dbindex": 0,
"timeout": 1.5
},
"memcache.local": "\\OC\\Memcache\\Redis",
"memcache.locking": "\\OC\\Memcache\\Redis",
"memcache.distributed": "\\OC\\Memcache\\Redis",
"updater.release.channel": "stable",
"preview_max_x": "2048",
"preview_max_y": "2048",
"mail_sendmailmode": "pipe",
"mail_from_address": "***REMOVED SENSITIVE VALUE***",
"mail_smtpauth": 1,
"mail_smtphost": "***REMOVED SENSITIVE VALUE***",
"mail_smtpname": "***REMOVED SENSITIVE VALUE***",
"mail_smtppassword": "***REMOVED SENSITIVE VALUE***",
"mail_smtpsecure": "ssl",
"mail_smtpport": "465",
"data-fingerprint": "490ffc7bacdce6eb676bcbc2013215ca",
"integrity.check.disabled": false,
"simpleSignUpLink.shown": false,
"default_phone_region": "DE",
"maintenance_window_start": 1,
"profile.enabled": false,
"trusted_proxies": "***REMOVED SENSITIVE VALUE***",
"trashbin_retention_obligation": "30,auto"
}
}List of activated Apps
Enabled:
- activity: 2.21.1
- admin_audit: 1.19.0
- announcementcenter: 7.0.1
- bruteforcesettings: 2.9.0
- calendar: 4.7.16
- cloud_federation_api: 1.12.0
- comments: 1.19.0
- contacts: 6.0.3
- dav: 1.30.1
- deck: 1.13.4
- drawio: 3.0.3
- federatedfilesharing: 1.19.0
- files: 2.1.1
- files_antivirus: 5.6.1
- files_downloadlimit: 2.0.0
- files_pdfviewer: 2.10.0
- files_reminders: 1.2.0
- files_sharing: 1.21.0
- files_trashbin: 1.19.0
- files_versions: 1.22.0
- firstrunwizard: 2.18.0
- groupfolder_filesystem_snapshots: 1.3.1
- groupfolders: 17.0.10
- customgroupfoldermanagement: 2.3.1
- keeweb: 0.6.20
- logreader: 2.14.0
- lookup_server_connector: 1.17.0
- mass_share: 1.0.0
- nextcloud_announcements: 1.18.0
- notifications: 2.17.0
- oauth2: 1.17.1
- polls: 7.3.2
- privacy: 1.13.0
- provisioning_api: 1.19.0
- related_resources: 1.4.0
- richdocuments: 8.4.10
- serverinfo: 1.19.0
- settings: 1.12.0
- sharebymail: 1.19.0
- systemtags: 1.19.0
- tasks: 0.16.1
- text: 3.10.1
- theming: 2.4.0
- theming_customcss: 1.18.0
- twofactor_backupcodes: 1.18.0
- updatenotification: 1.19.1
- user_saml: 6.5.0
- viewer: 2.3.0
- workflowengine: 2.11.0Nextcloud Signing status
Nextcloud Logs
Additional info
No response