Downscale images before embedding #4609

joelochlann · 2026-01-29T10:08:32Z

Depends on #4608, currently diffed to that branch

how to test

run npm run test:integration and see the large image tests now pass
deploy to test and look at the logging for images coming in. check:
- How much to they get resized to?
- How many attempts does it take?
- Do we ever run out of attempts?
- What is the CPU and memory usage of the lambda?

joelochlann · 2026-01-29T10:14:56Z

image-embedder-lambda/src/index.ts

+  // We use sqrt because scaling affects both dimensions
+  const sizeRatio = MAX_IMAGE_SIZE_BYTES / imageBytes.length;
+  // Be conservative: target 80% of max size to account for compression variance
+  let scaleFactor = Math.sqrt(sizeRatio * 0.8);


Need to think through if this square root thing is actually right... 🤔

You have a 4x4 image = 16 pixels = (let's pretend one byte per pixel, uncompressed) = 16 bytes
You want to halve the size to 8 bytes
If you simply halve the width & height to 2x2, you'll get 4 pixels = 4 times smaller than 16 bytes

Whereas if you scale width & height by 1/sqrt(2) you get:
about 2.8 * 2.8 = 7.84 = about 8

OK I get it!

github-actions · 2026-01-29T11:06:08Z

Deploy build 13915 of `media-service::grid::all` to TEST

All deployment options

From guardian/actions-riff-raff.

paperboyo · 2026-01-29T12:13:12Z

Brute force seems the only way to ensure they will fall below certain filesize indeed. Sharp is a good choice (based on Vips) if lambda can haz any.

Without either reading about model behaviour with different resolution/compression and other characteristics (eg. does transparency even matter?) or doing a lot of tests, it’s impossible to know what are the best compromises (better larger and more compressed or the other way around? fine to bake transparency into JPEGs or is transparency taken seriously? etc).

One thing I can’t think wouldn’t be sensible is to convert everything to sRGB (this, I think).

joelochlann · 2026-01-30T09:38:39Z

Thanks @paperboyo, those are great points.

Interestingly, Cohere v4 (which we may move to soon), "images > 2,458,624 pixels are downsampled to that size; images < 3,136 pixels are upsampled".

This makes me think that I should actually target 2.4 megapixels first, and see what size those typically come out at. If they're usually under 5 MiB, I think that's a reasonable default for downscaling. We can iteratively downscale for edge cases

joelochlann · 2026-01-30T09:56:12Z

@paperboyo forgive me image ignorance, why is converting everything to sRGB a good idea? Would this save image bytes?

paperboyo · 2026-01-30T10:27:10Z

why is converting everything to sRGB a good idea

This is but a hunch. Our corpus may stray more from sRGB which is the most prevalent on the web. Hunch is that robots weren’t trained to make better decisions on what they are looking at thanks to colourspace. It’s possible, they may just not see as well. Or – barf completely on something exotic colourmodels.
Impossible to know for sure without looking into robot’s intestines, ofc.

Would this same image bytes?

Converting would change pixel values. So that pixel differences that look the same would translate to similar numerical difference. Without conversion, they may give you smaller numerical difference. The smaller, relatively, the wider than sRGB the colourspace is (sRGB is the smallest space).
Converting to (s)RGB colourmodel will make them pixels comparable, otherwise we will feed robots CMYKs, 1bit and Lab. Rarely, but sometimes. Again, unless they convert themselves (or can work on different models: unlikely), this should be safer.

I think.

ellenmuller

Just some comments - not blockers :) Worked a dream when I tested it!!! 😍

ellenmuller · 2026-01-30T11:49:07Z

image-embedder-lambda/src/index.ts

+  let result: Uint8Array = imageBytes;
+  let attempts = 0;
+  const maxAttempts = 5;
+
+  while (result.length > MAX_IMAGE_SIZE_BYTES && attempts < maxAttempts) {
+    attempts++;


I might prefer this as a for loop, eg:

for ( let attempts = 1; attempts <= maxAttempts && result.length > MAX_IMAGE_SIZE_BYTES; attempts++ )

Kind of personal preference, but it reads nicer imo!

I think I prefer the while loop. Or perhaps even a recursive function...

ellenmuller · 2026-01-30T11:52:35Z

image-embedder-lambda/src/index.ts

+    const newWidth = Math.round(originalWidth * scaleFactor);
+    const newHeight = Math.round(originalHeight * scaleFactor);
+
+    console.log(`Attempt ${attempts}: scaling to ${newWidth}x${newHeight} (factor: ${scaleFactor.toFixed(3)})`);
+
+    let pipeline = sharp(imageBytes).resize(newWidth, newHeight, {
+      fit: "inside",
+      withoutEnlargement: true,
+    });
+
+    // Output in the same format
+    if (mimeType === "image/jpeg") {
+      pipeline = pipeline.jpeg({ quality: 85 });
+    } else if (mimeType === "image/png") {
+      pipeline = pipeline.png({ compressionLevel: 9 });
+    }
+
+    result = new Uint8Array(await pipeline.toBuffer());
+    console.log(`Result size: ${result.length} bytes`);
+


This could probably be split off into its own function?

Yeah I agree that this function is a little long, let me see if I can re-structure a bit

…-images-before-embedding

joelochlann · 2026-01-30T13:02:37Z

OK, I got curious (always the best and worst thing that can happen) and had a look at what was happening in detail, rather than just the test pass/fail.

The file size change is much more aggressive than I would expect, coming out e.g. 0.14 times the size despite a scale factor of 0.9:

  console.log
    Image size 5242970 bytes exceeds 5242880 limit, downscaling...

      at downscaleImageIfNeeded (src/index.ts:109:11)

  console.log
    Original size 3454x2303, 7.95 MP

      at downscaleImageIfNeeded (src/index.ts:115:11)

  console.log
    Attempt 1: scaling to 3089x2060, 6.36 MP (factor: 0.894)

      at downscaleImageIfNeeded (src/index.ts:136:13)

  console.log
    Result size: 740976 bytes

      at downscaleImageIfNeeded (src/index.ts:153:13)

  console.log
    Downscaled from 5242970 to 740976 bytes

Perhaps this isn't a bad thing if the resulting image still looks OK, but I should (a) understand why this is happening and (b) eyeball the resulting images.

…tierrc

Downscale images to get them under 5 MiB

2f7ad1b

joelochlann requested a review from a team as a code owner January 29, 2026 10:08

joelochlann commented Jan 29, 2026

View reviewed changes

joelochlann added the feature Departmental tracking: work on a new feature label Jan 29, 2026

Upgrade sharp

0991e73

ellenmuller approved these changes Jan 30, 2026

View reviewed changes

Merge branch 'js-integration-test-for-large-images' into js-downscale…

aa449ca

…-images-before-embedding

Base automatically changed from js-integration-test-for-large-images to main January 30, 2026 12:42

Simpler downscaling to Cohere v4 resolution, add more tests, and pret…

686db76

…tierrc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downscale images before embedding #4609

Downscale images before embedding #4609

Uh oh!

joelochlann commented Jan 29, 2026 •

edited

Loading

Uh oh!

joelochlann Jan 29, 2026

Uh oh!

github-actions bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

paperboyo commented Jan 29, 2026

Uh oh!

joelochlann commented Jan 30, 2026

Uh oh!

joelochlann commented Jan 30, 2026 •

edited

Loading

Uh oh!

paperboyo commented Jan 30, 2026

Uh oh!

ellenmuller left a comment

Uh oh!

ellenmuller Jan 30, 2026

Uh oh!

joelochlann Jan 30, 2026

Uh oh!

ellenmuller Jan 30, 2026

Uh oh!

joelochlann Jan 30, 2026

Uh oh!

joelochlann commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Downscale images before embedding #4609

Are you sure you want to change the base?

Downscale images before embedding #4609

Uh oh!

Conversation

joelochlann commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

how to test

Uh oh!

joelochlann Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploy build 13915 of media-service::grid::all to TEST

Uh oh!

paperboyo commented Jan 29, 2026

Uh oh!

joelochlann commented Jan 30, 2026

Uh oh!

joelochlann commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paperboyo commented Jan 30, 2026

Uh oh!

ellenmuller left a comment

Choose a reason for hiding this comment

Uh oh!

ellenmuller Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

joelochlann Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

ellenmuller Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

joelochlann Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

joelochlann commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joelochlann commented Jan 29, 2026 •

edited

Loading

github-actions bot commented Jan 29, 2026 •

edited

Loading

Deploy build 13915 of `media-service::grid::all` to TEST

joelochlann commented Jan 30, 2026 •

edited

Loading