-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Some notes on the investigation so far, I know this is a lot of blather and you can skim down through to the last half to see the juiciest bits. Not completely solved but I think I'm close and I believe the actions I've listed at the bottom will get us to resolution.
Previous threads:
- Filecoin Slack
#ecosystem-devchannel has threads (here & here) - fix: remove TraverseLinksOnlyOnce on piece CID application-research/filclient#91
- CAR offset writer ipld/go-car#290 (comment)
Thankfully Alvin (via stuberman and lodge) was able to provide a DAG that is failing so we can dig deeper into the nature of the failure. A ~32G DAG hanging off bafybeietzjzc4vudlevv6k6sxdixhp5nnmfblyjhqheyjyd4d3uluvqdgm. Looking at the version that ipfs dag export (proper exhaustive selector export, what we would expect for a well-formed CAR) gives compared to the one that Boost apparently has on its end where it's reporting the mismatch we can see:
- They are the same size
- They contain the same blocks
- The blocks are out of order
The ordering problem can be seen just by looking at the first few blocks. Here's what we expect (ipfs dag export):
bafybeietzjzc4vudlevv6k6sxdixhp5nnmfblyjhqheyjyd4d3uluvqdgm
bafkreigbnoobwwzjdl4yoccgyyeyybltkqqq5v7uix45zwi5rjvs64xwfy
bafkreih5rbh2rwzbg2v66rxpyjr755ycub246asuxa55k7ey3xlk457xiq
bafkreig6vmc5bg6lyn65k6jbteqqpugkhzxglbryoiyfiwwn5qef4hyiuu
bafkreibd3ult3qdw4xty3bj4n2kmryn3m6x7kshkjnb7dacbuxi5snj6ki
bafkreid73ihqui4bjpsse7ytsakpxg5r5yfhafsr35dyq67u2mojtvn6nq
bafkreifz4ya5gbziqtyxi4pua3tplk6d2awklicb74pdkmre65vioigvkq
bafkreidq4n2g6bqgct2s35naypliplkhfm3ljt2lxckfi5qxisr4wc6sta
bafkreietfxevvqzi3eqi7s4jhtcsomlmjyrbwqtndcka3d3wxbodoh7qmm
bafkreiav5y32trhvy6ifzfp2cnhfekgblagj3cjmfhe3wrtxdi2amzmogi
Here's what we get in Boost:
bafybeietzjzc4vudlevv6k6sxdixhp5nnmfblyjhqheyjyd4d3uluvqdgm
bafkreigbnoobwwzjdl4yoccgyyeyybltkqqq5v7uix45zwi5rjvs64xwfy
bafkreih5rbh2rwzbg2v66rxpyjr755ycub246asuxa55k7ey3xlk457xiq
bafkreicutg7gtaizq62pdbbon77hlkegerj4tiawnd5dwyrsfuahr3iw4q
bafkreidiaru5reixnumqnfyy3u5kpwdhmr3fu4gn4khuf77ffsl6hb7com
bafkreiav7ocqndlrlqb2hfkdedxuvzzlkazx4kf7cfzwmq5u4riwfuxvgq
bafkreihp42rqoi2bkwxnyeukuzkg6aj4gtzjlfwqlpstavmzh26fz3osza
bafkreigtzeu45gj6mk4y7vw4hmvelbqwifaovilsvgcdpgrtmcarmisgqq
bafkreid5cv2juopgreb6mqltxvxfrvgwia2d3qewp7ulg7y77dtevvwwem
bafkreiaxtiy6zxy5rm3hpytcpztkc2atyzhi4tsxuygi6pspwvdc324kri
This list of expected links can be confirmed by just looking at the root block's links with ipfs dag get bafybeietzjzc4vudlevv6k6sxdixhp5nnmfblyjhqheyjyd4d3uluvqdgm | jq .Links[].Hash[] | head.
The second and third block are the same in both lists and then it diverges. Both of those initial links are just Bytes, they have no links, so this isn't a case of a traverser deciding to go down a different pathway, they should just be walking those links in the root block in order.
I wrote a simple program to "traverse" these links in the various ways that may matter, just from that root block, and keep on getting the same, stable ordering:
- Raw links list coming out of a go-codec-dagpb decode
- Raw links list coming out of a go-merkledag
DecodeProtobuf - Traversal using go-ipld-prime
- Traversal using go-merkledag
Most of the tooling in the path to make CARs uses go-ipld-prime's traversals which in turn will be relying on go-codec-dagpb. But there is a dependency in boost for a custom branch of go-car @ ipld/go-car#290 that uses go-merkledag's Walk and other legacy pieces to load and decode blocks. So there's a suspicion that the use of the legacy stack may be involved here.
In version 0.4.0 of go-merkledag, the underlying mechanics of protobuf decode were swapped out to use go-codec-dagpb, so since that version we should even have the same decoding path.
BUT prior to 0.4.0 it turns out we had a sneaky decode-sort of links going on whenever you decode a DAG-PB block. This is not something that we factored in to the DAG-PB spec or go-codec-dagpb—links are only sorted on encode. And in a go-ipld-prime world, your Node decode ordering will dictate your traversal ordering. I'm going to add some clarifications to the spec about this @ ipld/ipld#233.
This shouldn't be a problem under normal circumstances, but we also have to deal with badly, or unsorted DAG-PB Links since we're not being strict about rejecting blocks with unsorted Links lists. And, it turns out that the failure case we have here is one of those. If we pull out the Name for each of the links that appear in the first blocks past the root in the CAR we can see what's going on:
ipfs dag export:
"0" bafkreigbnoobwwzjdl4yoccgyyeyybltkqqq5v7uix45zwi5rjvs64xwfy
"1" bafkreih5rbh2rwzbg2v66rxpyjr755ycub246asuxa55k7ey3xlk457xiq
"2" bafkreig6vmc5bg6lyn65k6jbteqqpugkhzxglbryoiyfiwwn5qef4hyiuu
"3" bafkreibd3ult3qdw4xty3bj4n2kmryn3m6x7kshkjnb7dacbuxi5snj6ki
"4" bafkreid73ihqui4bjpsse7ytsakpxg5r5yfhafsr35dyq67u2mojtvn6nq
"5" bafkreifz4ya5gbziqtyxi4pua3tplk6d2awklicb74pdkmre65vioigvkq
"6" bafkreidq4n2g6bqgct2s35naypliplkhfm3ljt2lxckfi5qxisr4wc6sta
"7" bafkreietfxevvqzi3eqi7s4jhtcsomlmjyrbwqtndcka3d3wxbodoh7qmm
"8" bafkreiav5y32trhvy6ifzfp2cnhfekgblagj3cjmfhe3wrtxdi2amzmogi
Boost:
"0" bafkreigbnoobwwzjdl4yoccgyyeyybltkqqq5v7uix45zwi5rjvs64xwfy
"1" bafkreih5rbh2rwzbg2v66rxpyjr755ycub246asuxa55k7ey3xlk457xiq
"10" bafkreicutg7gtaizq62pdbbon77hlkegerj4tiawnd5dwyrsfuahr3iw4q
"100" bafkreidiaru5reixnumqnfyy3u5kpwdhmr3fu4gn4khuf77ffsl6hb7com
"101" bafkreiav7ocqndlrlqb2hfkdedxuvzzlkazx4kf7cfzwmq5u4riwfuxvgq
"102" bafkreihp42rqoi2bkwxnyeukuzkg6aj4gtzjlfwqlpstavmzh26fz3osza
"103" bafkreigtzeu45gj6mk4y7vw4hmvelbqwifaovilsvgcdpgrtmcarmisgqq
"104" bafkreid5cv2juopgreb6mqltxvxfrvgwia2d3qewp7ulg7y77dtevvwwem
"105" bafkreiaxtiy6zxy5rm3hpytcpztkc2atyzhi4tsxuygi6pspwvdc324kri
- The first list is giving the list of links in the order they appear in the bytes, but Boost is doing them in sorted order.
- This isn't normally a problem because we expect DAG-PB encoders to sort before encoding, so the order they appear in the bytes is the sorted order, so in "normal" cases we wouldn't see this mismatch.
- There obviously exists a DAG-PB encoder that's producing alternatively sorted Links lists that's triggering these failures. This isn't awesome, it's why we have specs and also why we encourage use of existing, battle-hardened codecs. But to be clear: our systems should be able to account for this, the problems we are having arise when we have different decode paths in our tooling.
Re-running my test program against go-merkledag@0.3.2 and doing a Walk produces the same order we're seeing out of Boost.
Unfortunately I haven't figured out why Boost is doing this sorting. Even in v1.0.0 I can only see it pulling in >v0.4.0 versions of go-merkledag, and I've confirmed that this effect only appears for versions <v0.4.0. Perhaps there's some dependency jumbling that's going on to bring it in.
I see three things to do next:
- Figure out how/why Boost might be using an older go-merkledag to do this traversal (perhaps a weird Go dependency shuffle, perhaps this isn't actually coming out of
CarOffsetWriterbut some other CAR creation path I'm not seeing?) - I think we should prioritise getting Add a 'skip' parameter to writev1 so that the beginning of a car can … ipld/go-car#291 over the line and replacing
CarOffsetWriterhere with that. We really shouldn't be using go-merkledag for these kinds of things, we've been using ipld-prime traversals for CAR creation since the Filecoin launch (primarily through go-car'sSelectiveCar). - (lower priority) Figure out what DAG-PB encoder is producing these blocks—is it one that PL controls, or some other, or do we have a bug in sorting? As far as I'm aware we're sorting consistently across implementations and have been doing so forever. (Initial guess: https://github.com/Jorropo/linux2ipfs/blob/262ac5bb774b681babe85c944d69ee44f8505436/main.go#L504-L510 - @Jorropo - do you know if this is being used much in the wild? Do you have a way of checking whether it might be involved in these failing deals?).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status