Skip to content

pgEdge/pgedge-docloader

Repository files navigation

pgEdge Document Loader

CI Release

pgEdge Document Loader is a command-line tool for loading documents from various formats into PostgreSQL databases. Full documentation is available here.

The pgEdge Document Loader automatically converts documents (HTML, Markdown, reStructuredText, and SGML/DocBook) to Markdown format and loads them into a PostgreSQL database with extracted metadata.

Features

The pgEdge Document Loader automatically converts documents (HTML, Markdown, reStructuredText, and DocBook SGML/XML) to Markdown format and loads them into a PostgreSQL database with extracted metadata.

Features

  • Multiple Format Support: HTML, Markdown, reStructuredText, and DocBook SGML/XML
  • Git Repository Support: Clone and process docs directly from Git repositories
  • Automatic Conversion: All formats converted to Markdown
  • Metadata Extraction: Titles, filenames, timestamps
  • Flexible Input: Single file, directory, glob patterns, or Git repository URL
  • Database Flexibility: Configurable column mappings
  • Custom Metadata Columns: Add fixed values to custom columns for every row
  • Update Mode: Update existing rows or insert new ones
  • Transactional: All-or-nothing processing with automatic rollback
  • Secure: Password from environment, .pgpass, or interactive prompt
  • Configuration Files: Reusable YAML configuration

Document Loader Quickstart

Before installing and using pgEdge Document Loader, download and install:

  • Go 1.23 or later
  • PostgreSQL 14 or later

Getting started with pgEdge Document Loader involves three steps:

  1. Install the tool.
  2. Create a table in your Postgres database to hold the loaded content.
  3. Run the pgedge-docloader executable.

Installing pgEdge Document Loader

Use the following commands to download and build pgedge-docloader:

git clone https://github.com/pgedge/pgedge-docloader.git
cd pgedge-docloader
make build
make install

Creating a Postgres Table

Before invoking Document Loader, you must configure a Postgres database and create a table with the appropriate columns to hold the extracted documentation content:

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    content TEXT NOT NULL,
    source BYTEA,
    filename TEXT UNIQUE,
    modified TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Invoking pgedge-docloader

When invoking pgedge-docloader, you can specify configuration preferences on the command line, or with a configuration file.

The following command invokes Document Loader on the command line:

# Load Markdown files into PostgreSQL
pgedge-docloader \
  --source ./docs \
  --db-host localhost \
  --db-name mydb \
  --db-user myuser \
  --db-table documents \
  --col-doc-content content \
  --col-file-name filename

To manage deployment preferences in a configuration file, save your deployment details in a file, and then include the --config keyword when invoking pgedge-docloader:

# Create config.yml
cat > config.yml <<EOF
source: "./docs"
db-host: localhost
db-name: mydb
db-user: myuser
db-table: documents
col-doc-content: content
col-file-name: filename
update: true
EOF

# Run with a configuration file
export PGPASSWORD=mypassword
pgedge-docloader --config config.yml

For a comprehensive Quickstart Guide, visit here.

Developer Notes

This project is under active development. See the documentation for the latest features and updates.

The pgEdge Document Loader Makefile includes clauses that run test cases or invoke the go linter. Use the following commands:

Running Tests

make test

Linting

make lint

Your contributions are welcome! Please feel free to submit issues and pull requests.

Support

License

This project is licensed under the PostgreSQL License.

About

A tool for converting HTML and RST docs into Markdown, and loading them into PostgreSQL.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •