Indexing OAI-PMH non-compliant repositories

So, BASE (http://www.base-search.net) indexes many repositories that have an OAI-PMH interface, which is already very useful and covers most of the respectable repositories.

But unfortunately, many repositories do not provide such an interface. And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record. Reaching out to the repository admins to encourage them to expose their metadata correctly is, from my experience, not effective at all.

If we want to go beyond this, I think we need to crawl!
- One (conceptually) simple option would be to crawl for PDF files, extract metadata from them and dump this in a proper database with indexes. I think this is what CiteSeerX does, but only for papers within a particular field. In my experience the metadata that comes out of this is quite noisy.
- The other option we have discussed would be to leverage existing scrapers (Zotero) to extract cleaner metadata from HTML pages. Zotero does the scraping pretty well, but I have no clue what crawling software we should use. Any idea? The scrapy framework looks nice but I'm a complete newcommer in this field so I have probably missed better options.

How much resources (servers) do we need for this? Where could we get them?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Indexing OAI-PMH non-compliant repositories #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Indexing OAI-PMH non-compliant repositories #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions