Skip to content

[Bug]: database deadlock due to unnecessary etag updates #51483

@svenseeberg

Description

@svenseeberg

⚠️ This issue respects the following points: ⚠️

Bug description

We think we have identified a massive (at least on our instance) performance bottleneck, when using groupfolders. But the issue is not within the groupfolders app, but withing the core.

This is because the path="__groupfolders" oc_filecache entry gets updated A LOT to change the etag, even though it is never mounted in any users filesystem (only subfolders of it are ever mounted) and the etag can therefore as far as we know never be requested over DAV. This causes massive lock contention over this single specific database row.

We think Storage implementations need a way to exclude certain directories from the automatic etag propagation, if it is guaranteed that etag is never needed, to avoid this performance problem.

We think this issue affects our instance more than most, as most files of the instance are in groupfolders (and there are a lot of files), but it should apply to every instance with groupfolders.

It is expected, that editing many files within the same directory would cause a lot of etag updates, because all parent directories need to be updated in the filecache and some lock contention is to be expected.
But this goes way beyond that, if ANY file in ANY groupfolder is changed this single entry in the database must be updated. As you might expect this is terrible for database performance, because lock contention is through the roof.

The reason this does not happen with non-groupfolder files is, that each users home directory is it's own storage, so the etag updater does not propagate further to some filecache entry shared by every home folder, because that does not exist. So the updates only contend in the database with writes within the same users home folder, not every home folder.
But groupfolders uses a single storage with subdirectories for every groupfolder, and only those subdirectories are actually mounted.

Backstory

Finding this bottleneck took a lot of debugging, so this leads you through our steps:

Recently we experienced "frequent" performance drops on a Nextcloud server with about 1000 users per hour. This happened a couple of times per week and the server was not able to recover by itself. We had to restart the database and/or PHP-FPM services.

We tried many thing to fix this issue, mostly focused on making mariadb faster thorugh configuration optimization and upgrading the SSD the database is on. But we could only reduce how often the problem occurs, the fundamental issue did not go away. So we focused more on finding out which behavior of Nextcloud makes the database lock up.

After collecting the MariaDB process multiple times in the degraded state, we saw a pattern emerge. Specifically with select trx_id, trx_state, CURRENT_TIME(), trx_wait_started, trx_started, trx_rows_locked, trx_query from information_schema.INNODB_TRX where trx_state = 'LOCK WAIT'; we noticed that all queries waiting for the lock were trying to update etag attributes with the same parent ID (26614568). Except for the other values, the queries are the same:

UPDATE `oc_filecache` SET `etag` = '67c2f8b9194e1', `checksum` = '', `parent` = '26614568', `permissions` = '31' WHERE (`fileid` = 26614851) AND (((`etag` <> '67c2f8b9194e1') OR (`etag` IS NULL)) OR ((`checksum` <> '') OR (`checksum` IS NULL)) OR ((`parent` <> '26614568') OR (`parent` IS NULL)) OR ((`permissions` <> '31') OR (`permissions` IS NULL)))

Now, what query is this query waiting to get a lock from ? We think it's the etag propagator, whose queries appear in the slow query log when the issue occurs:

UPDATE `oc_filecache` SET `mtime` = GREATEST(`mtime`, 1741625955), `etag` = '67cf1a7eec988' WHERE (`storage` = 39980) AND (`path_hash` IN ('d41d8cd98f00b204e9800998ecf8427e', '29ff0edf73a32cb03e437d88fd049245', '6fad44a07c7385ecb3759c98c9f1dcf0', '848f3c7d9b51833a665d5a6ac8b42b22', '14be412a0b1a10ccdf396f2da67342a8', '558438a62cc164946ed7028269eb679f', 'ff6626a3ae3178146fd8eb2ae1c87c96'));`

The propagator updates multiple folders etags in one query and a query can never be partially applied, either every row matched is updated or all have to wait. We think one row matched is the one also matched by the type of query above waiting on a lock and one of the matched rows is:

MariaDB [wolke]> SELECT fileid, path FROM oc_filecache WHERE storage = 39980 AND path_hash="29ff0edf73a32cb03e437d88fd049245" LIMIT 1;
+----------+----------------+
| fileid   | path           |
+----------+----------------+
| 26614568 | __groupfolders |
+----------+----------------+
1 row in set (0.000 sec)

the groupfolder root. So if the groupfolder root is slow to update because of lock contention many subfolders stay locked for longer causing other queries to wait for a lock for a very long time (which in our case escalates until the database deadlocks).

So the cause of the problem seems to be a Mutex contention in the MariaDB database in the oc_filecache table for this specific groupfolder root row. It is known, that the filecache table can be a performance bottleneck (#42920). We know there is work under way in this area by enabling sharding for this table. We opened this issue as we don't think this problem can be solved with sharding (or should not need to be solved with it, because a better solution is possible).

When the issue happens, the nextcloud.log is "flooded" with the following message:

An exception occurred while executing a query: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction

We can see, that the etag of the oc_filecache entry for path "__groupfolders" is changed all the time, presumably any time any folder or file in any groupfolder is changed.

Steps to reproduce

Reproducing is rather difficult. We only see this randomly occur with many concurrent users at the same time. We cannot provide a specific procedure right now that reproduces this problem reliably.

Expected behavior

Nextclouds architecture and query behaviour should try to avoid lock contention. This is not always possible, but in this case we believe it is.

Nextcloud Server version

29

Operating system

Debian/Ubuntu

PHP engine version

PHP 8.2

Web server

Nginx

Database engine version

MariaDB

Is this bug present after an update or on a fresh install?

None

Are you using the Nextcloud Server Encryption module?

Encryption is Disabled

What user-backends are you using?

  • Default user-backend (database)
  • LDAP/ Active Directory
  • SSO - SAML
  • Other

Configuration report

{
    "system": {
        "instanceid": "***REMOVED SENSITIVE VALUE***",
        "passwordsalt": "***REMOVED SENSITIVE VALUE***",
        "secret": "***REMOVED SENSITIVE VALUE***",
        "trusted_domains": [
            "nextcloud.example.com"
        ],
        "datadirectory": "***REMOVED SENSITIVE VALUE***",
        "filesystem_check_changes": 0,
        "overwrite.cli.url": "https:\/\/nextcloud.example.com",
        "dbtype": "mysql",
        "version": "29.0.11.1",
        "dbname": "***REMOVED SENSITIVE VALUE***",
        "dbhost": "***REMOVED SENSITIVE VALUE***",
        "dbport": "3306",
        "dbtableprefix": "oc_",
        "mysql.utf8mb4": true,
        "dbuser": "***REMOVED SENSITIVE VALUE***",
        "dbpassword": "***REMOVED SENSITIVE VALUE***",
        "installed": true,
        "mail_smtpmode": "smtp",
        "mail_smtpauthtype": "LOGIN",
        "mail_domain": "***REMOVED SENSITIVE VALUE***",
        "maintenance": false,
        "loglevel": 2,
        "logfile": "\/var\/log\/nextcloud.log",
        "theme": "",
        "filelocking.ttl": 3600,
        "filelocking.enabled": true,
        "redis": {
            "host": "***REMOVED SENSITIVE VALUE***",
            "port": 0,
            "dbindex": 0,
            "timeout": 1.5
        },
        "memcache.local": "\\OC\\Memcache\\Redis",
        "memcache.locking": "\\OC\\Memcache\\Redis",
        "memcache.distributed": "\\OC\\Memcache\\Redis",
        "updater.release.channel": "stable",
        "preview_max_x": "2048",
        "preview_max_y": "2048",
        "mail_sendmailmode": "pipe",
        "mail_from_address": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpauth": 1,
        "mail_smtphost": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpname": "***REMOVED SENSITIVE VALUE***",
        "mail_smtppassword": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpsecure": "ssl",
        "mail_smtpport": "465",
        "data-fingerprint": "490ffc7bacdce6eb676bcbc2013215ca",
        "integrity.check.disabled": false,
        "simpleSignUpLink.shown": false,
        "default_phone_region": "DE",
        "maintenance_window_start": 1,
        "profile.enabled": false,
        "trusted_proxies": "***REMOVED SENSITIVE VALUE***",
        "trashbin_retention_obligation": "30,auto"
    }
}

List of activated Apps

Enabled:
  - activity: 2.21.1
  - admin_audit: 1.19.0
  - announcementcenter: 7.0.1
  - bruteforcesettings: 2.9.0
  - calendar: 4.7.16
  - cloud_federation_api: 1.12.0
  - comments: 1.19.0
  - contacts: 6.0.3
  - dav: 1.30.1
  - deck: 1.13.4
  - drawio: 3.0.3
  - federatedfilesharing: 1.19.0
  - files: 2.1.1
  - files_antivirus: 5.6.1
  - files_downloadlimit: 2.0.0
  - files_pdfviewer: 2.10.0
  - files_reminders: 1.2.0
  - files_sharing: 1.21.0
  - files_trashbin: 1.19.0
  - files_versions: 1.22.0
  - firstrunwizard: 2.18.0
  - groupfolder_filesystem_snapshots: 1.3.1
  - groupfolders: 17.0.10
  - customgroupfoldermanagement: 2.3.1
  - keeweb: 0.6.20
  - logreader: 2.14.0
  - lookup_server_connector: 1.17.0
  - mass_share: 1.0.0
  - nextcloud_announcements: 1.18.0
  - notifications: 2.17.0
  - oauth2: 1.17.1
  - polls: 7.3.2
  - privacy: 1.13.0
  - provisioning_api: 1.19.0
  - related_resources: 1.4.0
  - richdocuments: 8.4.10
  - serverinfo: 1.19.0
  - settings: 1.12.0
  - sharebymail: 1.19.0
  - systemtags: 1.19.0
  - tasks: 0.16.1
  - text: 3.10.1
  - theming: 2.4.0
  - theming_customcss: 1.18.0
  - twofactor_backupcodes: 1.18.0
  - updatenotification: 1.19.1
  - user_saml: 6.5.0
  - viewer: 2.3.0
  - workflowengine: 2.11.0

Nextcloud Signing status

Nextcloud Logs

Additional info

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions