alternative output format based on JSONlines

## Background

I would like to suggest an alternative output mode that replaces BASE64 encoding with a representation based on JSONlines.  This would lower storage footprint and allow adding annotations on each segment.  Additionally, it could support downstream workflows that avoid the creation of large numbers of smallish (language-specific files) and would likely reduce compute for encoding and decoding.

One of my motivations for suggesting JSONLines-based output is to add in-line annotations on each output ‘text’ (i.e. each line in the files written by `warc2text`, which I understand corresponds to one HTML document).  An obvious candidate would be the result of language detection (possibly including an indication of confidence, if available).  One could maybe also record the value of the HTML `lang` attribute on the document, if deemed relevant.  Having available both downstream could enable filtering to comparatively ‘pure’ mono-lingual corpora.

Another aspect in which we might want to preserve more information from text extraction relates to HTML parsing.  For different downstream use cases – e.g. preparing mono-lingual corpora for LM training vs. discovery of parallel texts for downstream MT work – I can imagine different heuristics in treating various HTML elements.  Currently, my understanding is that `warc2text` outputs newline-separated ‘segments’, for example for `<li>`, `<input>`, `<select>`, and probably `<td>` elements.  Many of these will prototypically be very short ‘utterances’ with linguistic properties quite unlike regular ‘sentences’.

In selecting text for LM training, one might want to capitalize on ‘running’ text.  Length thresholding seems an obvious candidate heuristic, but I could also imagine selectively discarding certain HTML elements, seeking to extract smaller but higher-quality mono-lingual text corpora.  To enable downstream experimentation along these lines, `warc2text` would have to record the HTML element corresponding to each newline-separated segment in its output text.

## Observations

I picked a random sample of 137 WARC files from the WIDE15 copy on Cirrus: in October 2022, there were 21340 corresponding text files.  I uncompressed and concatenated the text files into one large file, called `base64`, which comes to 314 gb.  Running `base64 -d` on this data (in my Cirrus home directory, which appears a comparatively slow file system) takes 34m11s (of which 24m21s cpu and 6m38s system time, respectively) and reduces file size to 236 gb (75%); call this file `txt`.  Compression using `gzip` on these two variants yields

+ `base64`: 404m3s (311m42s, 4m27s), for a resulting file `base64.gz` of 116 gb (36.9%)
+ `txt`: 272m57s (249m2s, 3m18s), for a resulting file `txt.gz` of 82 gb (34.7%)

All of the above were on `cirrus-login1` (which appeared very lightly loaded while I was running these experiments).  This suggests that BASE64 encoding adds substantially in space and time, some 33% in volume and some 49% in (wall clock) compression time.

I also tried gauging the cost associated with locating and opening many small files on the Cirrus file system.  Using `cat > /dev/null` on either the 21340 individual compressed files or the combined `base64.gz` gave timings of 29m15s (with virtually no cpu and system time) and 4m4s, respectively.  I believe there is a substantive overhead introduced by using the file system to bifurcate into language-specific files, potentially multiplying our tens of thousands of WARC files by a factor of around 200.  In my experience, HPC operators tend to be weary of data collections organized as millions of smallish files.

## JSONlines

My suggestion is to add an alternative line-oriented output format based on [JSONLines](https://jsonlines.org/).  Escaping of newlines in this scheme would be dealt with according to standard ANSI C string conventions (as `\n`, meaning that also literal backslashes and double quotes need to be backslash-escaped; in principle this also applies to other control characters, but I am guessing `warc2text` may not output any but rather normalize e.g. tabs or form-feeds into plain whitespace).

A candidate enriched output from `warc2text` could look somewhat like the following:

```
{"o": 42, "l": "nob", "cld2": ["nob", -0.1681], "h": ["p", "li", "li"], "s": "\"foo\"\n\\bar\\\nbaz"}
{"o": 4711, "l": "nno", "cld2": ["nob", -1.7369], "h": ["p"]" "s": "Eg skriver nynorsk."}
```

In addition to everything discussed in the introduction above (the `lang` attribute, CLD2 prediction, including log-probabilities, and a list of HTML contexts), the example introduces one additional key, `o`, to byte-offsets into the underlying WARC file, i.e. the starting position of the record from which each segment was extracted.  This could serve as a (simple and compact) unique key pointing back to the full WARC record, in case one eventually wanted to extract additional meta-information, say the corresponding URL and time of capture.

As a side note, if `warc2text` outputs no longer would be bifurcated by language codes, one may also consider leaving language identification to a later stage (or asking and interpolating the opinion of multiple tools).  In principle, one might want to experiment with different setups here, which would likely become practically easier, either way, by avoiding that `warc2text` output is organized into separate, language-specific files.

## Discussion

For various downstream scenarios, one can discuss the trade-offs in recording additional information in-line, together with the text, vs. writing it into a separate file, where corresponding records could be sequentially aligned, i.e. by running line index.  For current consumers of `warc2text` outputs, much of the additional information proposed above would likely be superfluous, i.e. unnecessary ‘fluff’ that each consumer would have to read over and discard.

In terms of total volume, the additional information may in fact still add less than the current BASE64 encoding (which limits itself to an unnecessarily narrow code book).  Nevertheless, it might offer better scalability and greater flexibility over time (e.g. to communicate even more information from the original web page into downstream processing) to keep the actual text and JSON annotations as separate, aligned files.

Even if so, there may be benefits in shifting to a JSONlines encoding for the text too, notably reduced space (a more compact encoding) and an increased choice of tools for decoding `warc2text` outputs, e.g. either using off-the-shelf JSONlines readers or a ‘leaner’ decoder (making some specific assumptions about JSON conventions in the file) that one may hope will reduce processing time, certainly when compared to the current BASE64 decoding.  I attach a code snippet below to illustrate that point (sadly purporting to be a text file, to work around limitations on possible attachments).

[jsonl.txt](https://github.com/bitextor/warc2text/files/9739209/jsonl.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alternative output format based on JSONlines #34

Background

Observations

JSONlines

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

alternative output format based on JSONlines #34

Description

Background

Observations

JSONlines

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions