From 96b2a5b9d175a15896ac4d8d8bc920a92a494c57 Mon Sep 17 00:00:00 2001 From: Jonathan Speiser Date: Thu, 21 Sep 2023 21:54:59 -0400 Subject: [PATCH] fixes minor typo in data prep README --- data_prep/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data_prep/README.md b/data_prep/README.md index fb2138a..16493c6 100644 --- a/data_prep/README.md +++ b/data_prep/README.md @@ -10,7 +10,7 @@ We follow the [Llama paper](https://arxiv.org/abs/2302.13971) and tried our best ### Commoncrawl -We downlaod five dumps from Commoncrawl, and run the dumps through the official [`cc_net` pipeline](https://github.com/facebookresearch/cc_net). +We download five dumps from Commoncrawl, and run the dumps through the official [`cc_net` pipeline](https://github.com/facebookresearch/cc_net). We then deduplicate on the paragraph level, and filter out low quality text using a linear classifier trained to classify paragraphs as Wikipedia references or random Commoncrawl samples.