Skip to content

Mangled file content when multipart-POSTing a file with a "text/*" content type #403

@mciasuen

Description

@mciasuen

Reading a UTF-8 CSV and attempting to upload it with needle via a multipart POST can cause non-ASCII characters inside the CSV data to be replaced by other characters.

Steps to Reproduce:

# node --version
v14.18.1
'use strict';

const childproc = require('child_process');
const needle = require('needle');
const net = require('net');

async function postAndCapture(buffer, content_type) {
	const port = 49152 + Math.floor(Math.random() * 16384);
	const proxy = net.createServer();
	await new Promise((res, rej) => {
		proxy.on('error', rej);
		proxy.listen(port, res);
	});
	proxy.on('connection', sock => {
		const buffs = [];
		let pending;
		sock.on('data', x => {
			buffs.push(x);
			if (!pending)
				pending = setImmediate(() => sock.end('HTTP/1.1 200 OK\r\n\r\n'));
		});
		sock.on('close', () => {
			const proc = childproc.spawn('hexdump', ['-C'], {
				stdio: ['pipe', 'inherit', 'inherit']
			});
			proc.stdin.end(Buffer.concat(buffs));
		});
	});
	await needle('post', `http://localhost:${port}/`, {
		file: {
			buffer,
			content_type,
			filename: new Date().getTime() + '.csv'
		}
	}, {
		multipart: true,
	});
	proxy.close();
}

(async () => {
	// File content as a buffer, as it would be read directly from a file.
	const csvFile = Buffer.from('77u/Ikh5cGhlbiIsIkVtIERhc2giDQoiLSIsIuKAlCINCg==', 'base64');

	// Send with CSV type, as some web API might require.
	// -> needle heuristically detects a "text" type, tries to re-encode the
	// CSV payload, causing the UTF-8 em dash to be replaced by an ASCII
	// control character, and corrupting the payload.
	await postAndCapture(csvFile, 'text/csv');

	// Sending with application/octet-stream works around the issue, but prevents
	// us from sending the correct Content-Type, which might not work for all use-cases.
	await postAndCapture(csvFile, 'application/octet-stream');
})();

Expected

  • Changing the Content-Type of the uploaded file should not affect the file content.

Observed

With application/octet-stream...

...the relevant section of the hex dump shows:

00000110  54 50 43 4c 49 45 4e 54  0d 0a 43 6f 6e 74 65 6e  |TPCLIENT..Conten|
00000120  74 2d 44 69 73 70 6f 73  69 74 69 6f 6e 3a 20 66  |t-Disposition: f|
00000130  6f 72 6d 2d 64 61 74 61  3b 20 6e 61 6d 65 3d 22  |orm-data; name="|
00000140  66 69 6c 65 22 3b 20 66  69 6c 65 6e 61 6d 65 3d  |file"; filename=|
00000150  22 31 36 35 31 39 33 37  34 32 38 39 38 38 2e 63  |"1651937428988.c|
00000160  73 76 22 0d 0a 43 6f 6e  74 65 6e 74 2d 54 72 61  |sv"..Content-Tra|
00000170  6e 73 66 65 72 2d 45 6e  63 6f 64 69 6e 67 3a 20  |nsfer-Encoding: |
00000180  62 69 6e 61 72 79 0d 0a  43 6f 6e 74 65 6e 74 2d  |binary..Content-|
00000190  54 79 70 65 3a 20 61 70  70 6c 69 63 61 74 69 6f  |Type: applicatio|
000001a0  6e 2f 6f 63 74 65 74 2d  73 74 72 65 61 6d 0d 0a  |n/octet-stream..|
000001b0  0d 0a ef bb bf 22 48 79  70 68 65 6e 22 2c 22 45  |.."Hyphen","E|
000001c0  6d 20 44 61 73 68 22 0d  0a 22 2d 22 2c 22 e2 80  |m Dash".."-",".|
000001d0  94 22 0d 0a 0d 0a 2d 2d  2d 2d 2d 2d 2d 2d 2d 2d  |."....----------|

The em dash is encoded as 0x80 0x94, a valid UTF-8 code sequence.

With text/csv...

...the relevant section of the hex dump shows:

00000110  54 50 43 4c 49 45 4e 54  0d 0a 43 6f 6e 74 65 6e  |TPCLIENT..Conten|
00000120  74 2d 44 69 73 70 6f 73  69 74 69 6f 6e 3a 20 66  |t-Disposition: f|
00000130  6f 72 6d 2d 64 61 74 61  3b 20 6e 61 6d 65 3d 22  |orm-data; name="|
00000140  66 69 6c 65 22 3b 20 66  69 6c 65 6e 61 6d 65 3d  |file"; filename=|
00000150  22 31 36 35 31 39 33 37  34 32 38 39 36 36 2e 63  |"1651937428966.c|
00000160  73 76 22 0d 0a 43 6f 6e  74 65 6e 74 2d 54 79 70  |sv"..Content-Typ|
00000170  65 3a 20 74 65 78 74 2f  63 73 76 0d 0a 0d 0a ff  |e: text/csv....|
00000180  22 48 79 70 68 65 6e 22  2c 22 45 6d 20 44 61 73  |"Hyphen","Em Das|
00000190  68 22 0d 0a 22 2d 22 2c  22 14 22 0d 0a 0d 0a 2d  |h".."-","."....-|

The em dash is encoded as 0x14, an obscure ASCII control character, which apparently chokes some CSV parsers.


Uploading the CSV file as application/octet-stream may work as a workaround for some APIs but may not work in all cases, e.g. where a provider accepts multiple formats and uses the Content-Type header to actually differentiate which parser to use.

According to https://github.com/tomas/needle/blob/master/lib/multipart.js#L45, needle tries to heuristically determine if it should process and re-encode the payload data based on the content-type; there is apparently no way to instruct it to skip this re-encoding and send the data exactly as-is while still using a text content-type.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions