Skip to content

ATS SM hang when trying to get chunked content from cache #3881

@wjnforever

Description

@wjnforever

Recently in our CDN project, we found an issue when client do a continuous access from Nginx to ATS(both 5.3.2 and 6.x), for example like
'for i in {1..10}; do curl -v -H"host:test.nginx.com" "http://test.nginx.com" -x127.1:80 -o /dev/null; done'
"127.1:80" is Nginx interface which will be proxied to ATS, and then be proxied to final origin server. While origin server reply with different response and one of them is empty(content-length=0), then after this empty response replace the previous non-empty cache in ATS, client get stuck until download timeout.
We try to simulate the problem with the following simple website

from BaseHTTPServer import BaseHTTPRequestHandler
import random
a = 1
class TodoHandler(BaseHTTPRequestHandler):
protocol_version = 'HTTP/1.1'
def do_GET(self):
global a
#r = random.randint(0,1)
if a % 2 == 0:
r = 0
else:
r = 1
self.send_response(200)
##r = 0
if r == 0:
uu = "Not Found"
self.send_header("Connection", "keep-alive")
self.send_header("Content-type", "text/html")
self.send_header("Content-Length", str(len(uu)))
self.send_header("Cache-Control", "max-age=10")
self.send_header("Expires", "Thu, 01 Jan 1970 00:00:00 GMT")
self.send_header("Accept-Ranges", "bytes")
self.end_headers()
print(a, ": length", str(len(uu), r)
self.wfile.write(uu)
else:
#self.send_header("Content-type", "video/mp4")
self.send_header("Connection", "keep-alive")
self.send_header("Cache-Control", "max-age=10")
self.send_header("Content-Length", 0)
self.send_header("Accept-Ranges", "bytes")
self.send_header("Connection", "keep-alive")
self.end_headers()
print(a, ": length 0", r)
a += 1

if name == 'main':
# Start a simple server, and loop forever
from BaseHTTPServer import HTTPServer
server = HTTPServer(('0.0.0.0', 8081), TodoHandler)
print("Starting server, use to stop")
server.serve_forever()

After checking code and log, we found this zero length response will lead to a resident alternate rewritten
int CacheVC::updateVector(int /* event ATS_UNUSED /, Event * / e ATS_UNUSED */)
if (small_doc && have_res_alt && (fragment || (f.update && !total_len))) {
...
f.rewrite_resident_alt = 1;
write_len = doc->data_len();
Debug("cache_update_alt", "rewriting resident alt size: %d key: %X, first_key: %X", write_len, doc->key.slice32(0),
first_key.slice32(0));
}

but only replace response header in CacheHTTPInfoVector and still keep the previous content which is not as desired.
static int agg_copy(char *p, CacheVC *vc)
#ifdef HTTP_CACHE
if (vc->f.rewrite_resident_alt) {
ink_assert(vc->f.use_first_key);
Doc *res_doc = (Doc *)vc->first_buf->data();
res_alt_blk = new_IOBufferBlock(vc->first_buf, res_doc->data_len(), sizeofDoc + res_doc->hlen);
doc->key = res_doc->key;
doc->total_len = res_doc->data_len();
}
#endif

#ifdef HTTP_CACHE
if (vc->f.rewrite_resident_alt)
iobufferblock_memcpy(doc->data(), vc->write_len, res_alt_blk, 0);
else
#endif

As a result when found content-length(0) in header doesn't match the object length in cache , ATS make it a chunked reponse with previous content.
void HttpTransact::handle_content_length_header(State *s, HTTPHdr *header, HTTPHdr *base)
if ((int64_t)s->cache_info.object_read->object_size_get() == cl) {
s->hdr_info.trust_response_cl = true;
} else {
DebugTxn("http_trans", "Content Length header and cache object size mismatch."
"Disabling keep-alive");
s->hdr_info.trust_response_cl = false;
}
Then this transform stuck at the user agent [VC_EVENT_WRITE_READY] until timeout finally, and couldn't serve next request.
Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (http_chunk) creating a chunk of size 9 bytes
[Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (http_redirect) [HttpTunnel::producer_handler] enable_redirection: [1 0 0] event: 102
[Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (http) [3] [&HttpSM::tunnel_handler_cache_read, VC_EVENT_READ_COMPLETE]
[Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (cache_close) do_io_close 0x7f887f9c0000 -1 1
[Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (cache_free) free 0x7f887f9c0000
[Jun 22 14:08:09.805] Server {0x7f8895a28800} DEBUG: (http_tunnel) [3] consumer_handler [user agent VC_EVENT_WRITE_READY]

Now we simply modify agg_copy
from
if (vc->f.update && vc->total_len) {
CacheHTTPInfo *http_info = vc->write_vector->get(vc->alternate_index);
http_info->object_size_set(vc->total_len);
}
to
if (vc->f.update) {
CacheHTTPInfo *http_info = vc->write_vector->get(vc->alternate_index);
http_info->object_size_set(vc->total_len);
}
which keep object size as received one to make it work but don't know if already patched and why this stuck happen.
pls refer to attached log for more info.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions