Configure Multiple Access Points For Multiple CDX Collections

Introduction

This document describes step-by-step configuration of separate access points for individual collections. Every collection is a set of ARC/WARC files that is indexed in CDX files. To save the storage space, ARC/WARC files can be compressed and have file extension .arc.gz or .warc.gz.

To illustrate the step-by-step configuration, we will take an example where we have two collections namely art and news. Each of the collections have couple of .warc.gz files (could be other supported formats as well). Suppose these collections are stored in the following directory structure:

$ tree /archives
/archives
└── collections
    ├── art
    │   ├── art-20140313083412-000.warc.gz
    │   └── art-20140422132637-001.warc.gz
    └── news
        ├── news-20140315112738-000.warc.gz
        └── news-20140418034624-001.warc.gz

Suppose that our Wayback server has a domain name wayback.example.com and we want to setup three access points as follows:

/art/ access point only searches in the art collection.
/news/ access point only searches in the news collection.
/all/ access point searches in all the collections and gives the composite result.

Indexing

Default Wayback server comes pre-configured to use BDB Index (Berkeley Data Base) that enables automatic indexing of small collection which is suitable for single access point. But for large scale collections with multiple access points, manually generated CDX indexing is preferred.

In this case we will need one or more CDX indexes for each collection along with path indexes. Path index is a simple sorted text file that has two columns separated by a TAB; the first column contains ARC/WARC file name and the second column contains corresponding full path to the file (or full path with the domain name if on a remote host). A utility called cdx-indexer is shipped with Wayback download (can be found in the bin directory) to generate CDX index from ARC/WARC files. For large collections we might want to write a script to automate the process of CDX generation while internally calling the shipped cdx-indexer script.

[TODO: Write a separate guide to describe the CDX generation.]

Suppose that we have generated one CDX file and one path index file for the art collection and similarly for the news collection. There can be more than one CDX files for each collection, but for the sake of simplicity, we are keeping one CDX file per collection. We have also created an additional path index file that contains the file and path listing of both the collections (this can be created by merging the two path index files and sorting them). Suppose that our archives directory now has the following directory structure:

$ tree /archives
/archives
├── collections
│   ├── art
│   │   ├── art-20140313083412-000.warc.gz
│   │   └── art-20140422132637-001.warc.gz
│   └── news
│       ├── news-20140315112738-000.warc.gz
│       └── news-20140418034624-001.warc.gz
├── cdx-idx
│   ├── index-art.cdx
│   └── index-news.cdx
└── path-idx
    ├── art-path-idx.txt
    ├── news-path-idx.txt
    └── all-path-idx.txt

Configuration

First of all we need to install Apache Tomcat, if not already installed. Once Tomcat is up and running, it will have a webapps directory. In our case it is located at /var/lib/tomcat7/webapps, but it may differ based on how Tomcat is configured on your machine. Now we need to obtain the latest copy of OpenWayback and install it. Please refer How to Install guide for further details. In this setup we will assume that you have installed Wayback as ROOT application. Although you can choose to name it anything else, but the configurations are easier for ROOT application.

Now we will focus on configuration files available in WEB-INF directory of Wayback application. Now let's have a look at the default wayback.xml file. Comments and unnecessary commented blocks have been removed to reduce the number of lines:

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans
           http://www.springframework.org/schema/beans/spring-beans-3.0.xsd"
       default-init-method="init">

  <bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
    <property name="properties">
      <value>
        wayback.basedir=/tmp/wayback
        wayback.urlprefix=http://localhost:8080/wayback/
      </value>
    </property>
  </bean>

  <bean id="waybackCanonicalizer" class="org.archive.wayback.util.url.AggressiveUrlCanonicalizer" />

  <bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB">
    <property name="bdbPath" value="${wayback.basedir}/file-db/db/" />
    <property name="bdbName" value="DB1" />
    <property name="logPath" value="${wayback.basedir}/file-db/db.log" />
  </bean>

<!--
  <bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
    <property name="path" value="${wayback.basedir}/path-index.txt" />
  </bean>
-->

  <import resource="BDBCollection.xml"/>
<!--
  <import resource="CDXCollection.xml"/>
  <import resource="RemoteCollection.xml"/>
  <import resource="NutchCollection.xml"/>
-->

  <import resource="ArchivalUrlReplay.xml"/>

  <bean name="+" class="org.archive.wayback.webapp.ServerRelativeArchivalRedirect">
    <property name="matchPort" value="8080" />
    <property name="useCollection" value="true" />
  </bean>

  <bean name="standardaccesspoint" class="org.archive.wayback.webapp.AccessPoint">
    <property name="accessPointPath" value="http://localhost:8080/wayback/"/>
    <property name="internalPort" value="8080"/>
    <property name="serveStatic" value="true" />
    <property name="bounceToReplayPrefix" value="false" />
    <property name="bounceToQueryPrefix" value="false" />
    <property name="replayPrefix" value="${wayback.urlprefix}" />
    <property name="queryPrefix" value="${wayback.urlprefix}" />
    <property name="staticPrefix" value="${wayback.urlprefix}" />

    <property name="collection" ref="localbdbcollection" />
<!--
    <property name="collection" ref="localcdxcollection" />
-->

    <property name="replay" ref="archivalurlreplay" />
    <property name="query">
      <bean class="org.archive.wayback.query.Renderer">
        <property name="captureJsp" value="/WEB-INF/query/CalendarResults.jsp" />
      </bean>
    </property>

    <property name="uriConverter">
      <bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
        <property name="replayURIPrefix" value="${wayback.urlprefix}"/>
      </bean>
    </property>

    <property name="parser">
      <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser">
        <property name="maxRecords" value="10000" />
      </bean>
    </property>
  </bean>

</beans>

[A work in progress...]

POLLEE online fashion* POLLEE fashion* Small medium bisness digital economy world profile image with GDPR 10 year working on EU staff union member supported working on POLLEE tree 🌴 sitters/ link Or COVID 19 coming al employees teacher after end of road my bisness after honestly hardly working on GDPR with sopported working on organised by POLLEE tree 🌴 sitters/ project: hello world 🌎 Safe in child advisor marketing by all social media with publisher/ Wikipedia pages Pollee search engine Nasiruddin miah* GitHub profile image Nasiruddin miah with everyone side working apple . Microsoft 360 video open Google bisness profile add (POLLEE online fashion) Ur support open my biases return my all employees with after right planning working on everyone very week employees owners live organisation working on with worldwide young group running conference event meeting everything planning stays please me with my working journey Nasiruddin miah (POLLEE online fashion) Google maps * 1 tree 🌴 changing in the world 🌍

Configure Multiple Access Points For Multiple CDX Collections

Introduction

Indexing

Configuration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally