-
Notifications
You must be signed in to change notification settings - Fork 12
Description
This question is mainly for Fredrik -- I found a potential issue regarding the tar files generated under /data2/sincere/crawled/phar. The issue is that many files under the same tar file are sharing exactly the same file name, which mean that if I use "tar xvf" to extract it, I will only get the last part of a post. The information is there but I am unsure (1) how to extract correctly and (2) why do we have the same name for multiple files (intentional or not)? Here is an example that I put in dropbox -- there are only two posts but each post has multiple parts (and with the same name) --
https://www.dropbox.com/sh/m2vp2v6vdhcyab5/AAB7dVRztjkxK5GSwEmzKg4ca?dl=0
I figured out a way to extract the original JSON -- I did "tar xOvf ../Samsung_Mobile-114219621960016.tar 114219621960016_2012-12-14T13_114219621960016_375711795852975.json > a1.json"
I moved both a1.json and a2.json to the same dropbox folder as well. So, the tar file contains about 20-21 files belonging to TWO posts only (i.e., each post got broken into multiple files). So, a1.json (and a2.json) is the concatenated file (a single one) for ONE single post. I had to use the O option in tar to re-direct the output (otherwise, the content will be overwritten during the untar process due to the same file name).
Update -- @fredrike, now I think we got a bigger problem -- after the tar file grew bigger (~ 128MB), it will be gzipped, but now in the gzip file, only *** ONE *** of those files sharing the same names was kept!!! This means that potentially most of the JSONs under the gz directory really only have a small fraction of a complete post. I will post the compressed file to the folder as well so you can see. Very bad news for our JSONs today, especially for those bigger ones. I believe that we still need to develop a program to re-generate JSONs for all the posts from SINCERE.dB, after all.