Skip to content

Deduplicate webaratas#45

Open
acheronw wants to merge 3 commits intomasterfrom
deduplicate_webaratas
Open

Deduplicate webaratas#45
acheronw wants to merge 3 commits intomasterfrom
deduplicate_webaratas

Conversation

@acheronw
Copy link
Collaborator

No description provided.

@acheronw acheronw requested a review from DavidNemeskey April 12, 2023 06:04
Copy link
Owner

@DavidNemeskey DavidNemeskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really requesting changes, just asking questions. :)

input:
https://nepszava.hu/json/cikk.json?id=1001322_elindult-a-bekemenet
output:
http://nepszava.hu/1001322_elindult-a-bekemenet
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering: is this systematic in the sense that we don't get the json-type URLs in our data? Would it make sense to keep both the input and the output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...

In nepszava we have the following patterns:

  • the one seen in the output
  • same, just with https
  • the same but with cikk/ inserted to between the domain and the path.

These three are duplicates of each other, but not every article exists in all three forms (some do).

I just realized that if we want to keep the documents from webaratás, then we should filter out all three variants. :(

A fourth pattern is:
http://nepszava.hu/articles/article.php?id=238077

These have 6 digit numbers, while the rest have 7 digit numbers, so they don't seem to match.

The common crawl has no urls with nepszava.hu/json

if args.transformation == 'nepszava':
transf = nepszava_transformation
elif args.transformation == '888hu':
transf = hu888_transformation
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the only domains we can handle?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

There are no further patterns to use in the query params parts of urls.

There might be useful patterns in the path parts of urls...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants