Skip to content

Indexing OAI-PMH non-compliant repositories #8

@wetneb

Description

@wetneb

So, BASE (http://www.base-search.net) indexes many repositories that have an OAI-PMH interface, which is already very useful and covers most of the respectable repositories.

But unfortunately, many repositories do not provide such an interface. And even when they provide one, the protocol is far from perfect: for instance, it is hard to detect accurately whether a full text is freely available behind a metadata record. Reaching out to the repository admins to encourage them to expose their metadata correctly is, from my experience, not effective at all.

If we want to go beyond this, I think we need to crawl!

  • One (conceptually) simple option would be to crawl for PDF files, extract metadata from them and dump this in a proper database with indexes. I think this is what CiteSeerX does, but only for papers within a particular field. In my experience the metadata that comes out of this is quite noisy.
  • The other option we have discussed would be to leverage existing scrapers (Zotero) to extract cleaner metadata from HTML pages. Zotero does the scraping pretty well, but I have no clue what crawling software we should use. Any idea? The scrapy framework looks nice but I'm a complete newcommer in this field so I have probably missed better options.

How much resources (servers) do we need for this? Where could we get them?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions