Skip to content

Search hydrates all results before applying limit causing slow results on large sites + suggestions #13215

@jonathan-bird

Description

@jonathan-bird

Bug description

When investigating why my default.json search index which is 1.6MB was taking 3-4 seconds to load even with a simple:

return \Statamic\Facades\Search::index($index)
  ->ensureExists()
  ->search($query)
  ->limit($limit)
  ->get();

I got digging and realised that getBaseItems in QueryBuilder hydrates all results and transforms them before applying the limit. This was crazy to me because the lookup() returns results already sorted by score (highest first). So when we do ->take(20), we get the top 20 most relevant results.

I came up with this (rough-ish) solution which got my search results down to roughly 200ms - not great but much better than 3-4s:

// Get the search index and perform raw lookup (fast - searches JSON file)
$searchIndex = StatamicSearchFacade::index($index);
$searchIndex->ensureExists();

$rawResults = $searchIndex->lookup($query);

if ($rawResults->isEmpty()) {
    return collect();
}

// Limit results BEFORE hydration to avoid loading hundreds of entries
$limitedResults = $rawResults->take($limit);

// Extract entry IDs from the limited results
$ids = $limitedResults
  ->map(fn ($result) => Str::after($result['reference'], 'entry::'))
  ->values()
  ->all();

// Hydrate only the entries we need
$entries = Entry::query()->whereIn('id', $ids)->get()->keyBy->id();

// Return entries in the same order as search results (by relevance score)
return collect($ids)
  ->map(fn ($id) => $entries->get($id))
  ->filter();

Then I got digging more and noticed in the IteratorBuilder the getFilteredItems always hydrates all results for everything too (entries etc). I understand due to being flat file, you have to do this in case it has orderBy or no limit, but you could potentially improve it to be smarter via batching if there were limit + where without order by. This would instantly fix a lot of search:

protected function getFilteredItems()
{
  // No limit or has orderBy - can't optimize, need all items
  if (!$this->limit || $this->orderBys) {
    $items = $this->getBaseItems();
    return $this->filterWheres($items);
  }

  // No wheres - just limit base items directly
  if (empty($this->wheres)) {
    return $this->getBaseItems()->slice($this->offset ?? 0, $this->limit);
  }

  // Has limit AND wheres - batch hydrate until we have enough
  return $this->getFilteredItemsInBatches();
}

This would be much faster on large sites on many queries while still supporting the current approach.

Let me know your thoughts - I'd be happy to give a PR a go, just wanted to get your thoughts on the issue before I went down a rabbit hole.

How to reproduce

return \Statamic\Facades\Search::index($index)
  ->ensureExists()
  ->search($query)
  ->limit($limit)
  ->get();

You'd need ~1000 basic entries with title + content to see that .

Environment

Environment
Application Name: xxxx
Laravel Version: 12.40.2
PHP Version: 8.4.6
Composer Version: 2.8.4
Environment: production
Debug Mode: OFF
URL: xxxx
Maintenance Mode: OFF
Timezone: Australia/Brisbane
Locale: en

Cache
Config: CACHED
Events: CACHED
Routes: CACHED
Views: CACHED

Drivers
Broadcasting: log
Cache: redis
Database: mysql
Logs: stack / daily
Mail: smtp
Queue: redis
Session: redis

Storage
public/storage: NOT LINKED

Livewire
Livewire: v3.7.0

Statamic
Addons: 7
Sites: 1
Stache Watcher: Disabled (auto)
Static Caching: full
Version: 5.69.0 PRO

Statamic Addons
aerni/advanced-seo: 2.13.2
aerni/livewire-forms: 9.10.1
digital-bird/statamic-toc: 1.0.2
marcorieser/statamic-livewire: 4.5.3
mitydigital/sitemapamic: 3.4.1
mitydigital/statamic-two-factor: 2.7.3

Installation

Fresh statamic/statamic site via CLI

Additional details

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions