-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
So, BASE (http://www.base-search.net) indexes many repositories that have an OAI-PMH interface, which is already very useful and covers most of the respectable repositories.
But unfortunately, many repositories do not provide such an interface. And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record. Reaching out to the repository admins to encourage them to expose their metadata correctly is, from my experience, not effective at all.
If we want to go beyond this, I think we need to crawl!
- One (conceptually) simple option would be to crawl for PDF files, extract metadata from them and dump this in a proper database with indexes. I think this is what CiteSeerX does, but only for papers within a particular field. In my experience the metadata that comes out of this is quite noisy.
- The other option we have discussed would be to leverage existing scrapers (Zotero) to extract cleaner metadata from HTML pages. Zotero does the scraping pretty well, but I have no clue what crawling software we should use. Any idea? The scrapy framework looks nice but I'm a complete newcommer in this field so I have probably missed better options.
How much resources (servers) do we need for this? Where could we get them?
Metadata
Metadata
Assignees
Labels
No labels