Skip to content

alternative output format based on JSONlines #34

@oepen

Description

@oepen

Background

I would like to suggest an alternative output mode that replaces BASE64 encoding with a representation based on JSONlines. This would lower storage footprint and allow adding annotations on each segment. Additionally, it could support downstream workflows that avoid the creation of large numbers of smallish (language-specific files) and would likely reduce compute for encoding and decoding.

One of my motivations for suggesting JSONLines-based output is to add in-line annotations on each output ‘text’ (i.e. each line in the files written by warc2text, which I understand corresponds to one HTML document). An obvious candidate would be the result of language detection (possibly including an indication of confidence, if available). One could maybe also record the value of the HTML lang attribute on the document, if deemed relevant. Having available both downstream could enable filtering to comparatively ‘pure’ mono-lingual corpora.

Another aspect in which we might want to preserve more information from text extraction relates to HTML parsing. For different downstream use cases – e.g. preparing mono-lingual corpora for LM training vs. discovery of parallel texts for downstream MT work – I can imagine different heuristics in treating various HTML elements. Currently, my understanding is that warc2text outputs newline-separated ‘segments’, for example for <li>, <input>, <select>, and probably <td> elements. Many of these will prototypically be very short ‘utterances’ with linguistic properties quite unlike regular ‘sentences’.

In selecting text for LM training, one might want to capitalize on ‘running’ text. Length thresholding seems an obvious candidate heuristic, but I could also imagine selectively discarding certain HTML elements, seeking to extract smaller but higher-quality mono-lingual text corpora. To enable downstream experimentation along these lines, warc2text would have to record the HTML element corresponding to each newline-separated segment in its output text.

Observations

I picked a random sample of 137 WARC files from the WIDE15 copy on Cirrus: in October 2022, there were 21340 corresponding text files. I uncompressed and concatenated the text files into one large file, called base64, which comes to 314 gb. Running base64 -d on this data (in my Cirrus home directory, which appears a comparatively slow file system) takes 34m11s (of which 24m21s cpu and 6m38s system time, respectively) and reduces file size to 236 gb (75%); call this file txt. Compression using gzip on these two variants yields

  • base64: 404m3s (311m42s, 4m27s), for a resulting file base64.gz of 116 gb (36.9%)
  • txt: 272m57s (249m2s, 3m18s), for a resulting file txt.gz of 82 gb (34.7%)

All of the above were on cirrus-login1 (which appeared very lightly loaded while I was running these experiments). This suggests that BASE64 encoding adds substantially in space and time, some 33% in volume and some 49% in (wall clock) compression time.

I also tried gauging the cost associated with locating and opening many small files on the Cirrus file system. Using cat > /dev/null on either the 21340 individual compressed files or the combined base64.gz gave timings of 29m15s (with virtually no cpu and system time) and 4m4s, respectively. I believe there is a substantive overhead introduced by using the file system to bifurcate into language-specific files, potentially multiplying our tens of thousands of WARC files by a factor of around 200. In my experience, HPC operators tend to be weary of data collections organized as millions of smallish files.

JSONlines

My suggestion is to add an alternative line-oriented output format based on JSONLines. Escaping of newlines in this scheme would be dealt with according to standard ANSI C string conventions (as \n, meaning that also literal backslashes and double quotes need to be backslash-escaped; in principle this also applies to other control characters, but I am guessing warc2text may not output any but rather normalize e.g. tabs or form-feeds into plain whitespace).

A candidate enriched output from warc2text could look somewhat like the following:

{"o": 42, "l": "nob", "cld2": ["nob", -0.1681], "h": ["p", "li", "li"], "s": "\"foo\"\n\\bar\\\nbaz"}
{"o": 4711, "l": "nno", "cld2": ["nob", -1.7369], "h": ["p"]" "s": "Eg skriver nynorsk."}

In addition to everything discussed in the introduction above (the lang attribute, CLD2 prediction, including log-probabilities, and a list of HTML contexts), the example introduces one additional key, o, to byte-offsets into the underlying WARC file, i.e. the starting position of the record from which each segment was extracted. This could serve as a (simple and compact) unique key pointing back to the full WARC record, in case one eventually wanted to extract additional meta-information, say the corresponding URL and time of capture.

As a side note, if warc2text outputs no longer would be bifurcated by language codes, one may also consider leaving language identification to a later stage (or asking and interpolating the opinion of multiple tools). In principle, one might want to experiment with different setups here, which would likely become practically easier, either way, by avoiding that warc2text output is organized into separate, language-specific files.

Discussion

For various downstream scenarios, one can discuss the trade-offs in recording additional information in-line, together with the text, vs. writing it into a separate file, where corresponding records could be sequentially aligned, i.e. by running line index. For current consumers of warc2text outputs, much of the additional information proposed above would likely be superfluous, i.e. unnecessary ‘fluff’ that each consumer would have to read over and discard.

In terms of total volume, the additional information may in fact still add less than the current BASE64 encoding (which limits itself to an unnecessarily narrow code book). Nevertheless, it might offer better scalability and greater flexibility over time (e.g. to communicate even more information from the original web page into downstream processing) to keep the actual text and JSON annotations as separate, aligned files.

Even if so, there may be benefits in shifting to a JSONlines encoding for the text too, notably reduced space (a more compact encoding) and an increased choice of tools for decoding warc2text outputs, e.g. either using off-the-shelf JSONlines readers or a ‘leaner’ decoder (making some specific assumptions about JSON conventions in the file) that one may hope will reduce processing time, certainly when compared to the current BASE64 decoding. I attach a code snippet below to illustrate that point (sadly purporting to be a text file, to work around limitations on possible attachments).

jsonl.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions