Skip to content

same file name for all parts of one post JSON in the tar file #7

@sfelixwu

Description

@sfelixwu

This question is mainly for Fredrik -- I found a potential issue regarding the tar files generated under /data2/sincere/crawled/phar. The issue is that many files under the same tar file are sharing exactly the same file name, which mean that if I use "tar xvf" to extract it, I will only get the last part of a post. The information is there but I am unsure (1) how to extract correctly and (2) why do we have the same name for multiple files (intentional or not)? Here is an example that I put in dropbox -- there are only two posts but each post has multiple parts (and with the same name) --

https://www.dropbox.com/sh/m2vp2v6vdhcyab5/AAB7dVRztjkxK5GSwEmzKg4ca?dl=0

I figured out a way to extract the original JSON -- I did "tar xOvf ../Samsung_Mobile-114219621960016.tar 114219621960016_2012-12-14T13_114219621960016_375711795852975.json > a1.json"

I moved both a1.json and a2.json to the same dropbox folder as well. So, the tar file contains about 20-21 files belonging to TWO posts only (i.e., each post got broken into multiple files). So, a1.json (and a2.json) is the concatenated file (a single one) for ONE single post. I had to use the O option in tar to re-direct the output (otherwise, the content will be overwritten during the untar process due to the same file name).

Update -- @fredrike, now I think we got a bigger problem -- after the tar file grew bigger (~ 128MB), it will be gzipped, but now in the gzip file, only *** ONE *** of those files sharing the same names was kept!!! This means that potentially most of the JSONs under the gz directory really only have a small fraction of a complete post. I will post the compressed file to the folder as well so you can see. Very bad news for our JSONs today, especially for those bigger ones. I believe that we still need to develop a program to re-generate JSONs for all the posts from SINCERE.dB, after all.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions