Conversation
DavidNemeskey
left a comment
There was a problem hiding this comment.
Not really requesting changes, just asking questions. :)
| input: | ||
| https://nepszava.hu/json/cikk.json?id=1001322_elindult-a-bekemenet | ||
| output: | ||
| http://nepszava.hu/1001322_elindult-a-bekemenet |
There was a problem hiding this comment.
I was wondering: is this systematic in the sense that we don't get the json-type URLs in our data? Would it make sense to keep both the input and the output?
There was a problem hiding this comment.
Hmmm...
In nepszava we have the following patterns:
- the one seen in the output
- same, just with https
- the same but with cikk/ inserted to between the domain and the path.
These three are duplicates of each other, but not every article exists in all three forms (some do).
I just realized that if we want to keep the documents from webaratás, then we should filter out all three variants. :(
A fourth pattern is:
http://nepszava.hu/articles/article.php?id=238077
These have 6 digit numbers, while the rest have 7 digit numbers, so they don't seem to match.
The common crawl has no urls with nepszava.hu/json
| if args.transformation == 'nepszava': | ||
| transf = nepszava_transformation | ||
| elif args.transformation == '888hu': | ||
| transf = hu888_transformation |
There was a problem hiding this comment.
Are these the only domains we can handle?
There was a problem hiding this comment.
Good question.
There are no further patterns to use in the query params parts of urls.
There might be useful patterns in the path parts of urls...
No description provided.