Deduplicate webaratas by acheronw · Pull Request #45 · DavidNemeskey/cc_corpus

acheronw · 2023-04-12T06:04:03Z

No description provided.

DavidNemeskey

Not really requesting changes, just asking questions. :)

DavidNemeskey · 2023-04-20T09:34:18Z

scripts/create_droplist.py

+    input:
+    https://nepszava.hu/json/cikk.json?id=1001322_elindult-a-bekemenet
+    output:
+    http://nepszava.hu/1001322_elindult-a-bekemenet


I was wondering: is this systematic in the sense that we don't get the json-type URLs in our data? Would it make sense to keep both the input and the output?

Hmmm...

In nepszava we have the following patterns:

the one seen in the output

same, just with https

the same but with cikk/ inserted to between the domain and the path.

These three are duplicates of each other, but not every article exists in all three forms (some do).

I just realized that if we want to keep the documents from webaratás, then we should filter out all three variants. :(

A fourth pattern is:
http://nepszava.hu/articles/article.php?id=238077

These have 6 digit numbers, while the rest have 7 digit numbers, so they don't seem to match.

The common crawl has no urls with nepszava.hu/json

DavidNemeskey · 2023-04-20T09:35:39Z

scripts/create_droplist.py

+    if args.transformation == 'nepszava':
+        transf = nepszava_transformation
+    elif args.transformation == '888hu':
+        transf = hu888_transformation


Are these the only domains we can handle?

Good question.

There are no further patterns to use in the query params parts of urls.

There might be useful patterns in the path parts of urls...

acheronw added 2 commits April 11, 2023 14:36

Nepszava url transformation

0545e6d

Added transformation for 888.hu

7b96b17

acheronw requested a review from DavidNemeskey April 12, 2023 06:04

DavidNemeskey requested changes Apr 20, 2023

View reviewed changes

Merge branch 'master' into deduplicate_webaratas

8bdfb3e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate webaratas#45

Deduplicate webaratas#45
acheronw wants to merge 3 commits intomasterfrom
deduplicate_webaratas

acheronw commented Apr 12, 2023

Uh oh!

DavidNemeskey left a comment

Uh oh!

DavidNemeskey Apr 20, 2023

Uh oh!

acheronw Apr 20, 2023

Uh oh!

DavidNemeskey Apr 20, 2023

Uh oh!

acheronw Apr 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

acheronw commented Apr 12, 2023

Uh oh!

DavidNemeskey left a comment

Choose a reason for hiding this comment

Uh oh!

DavidNemeskey Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

acheronw Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

DavidNemeskey Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

acheronw Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants