This AI quickstart demonstrates how to use lakeFS as an AI data control plane for Red Hat OpenShift AI using the fraud-detection tutorial workflow.
You will deploy MinIO (object storage) and lakeFS, run the fraud-detection notebooks in OpenShift AI, and then repeat the workflow on a new version of the data to show how lakeFS enables reproducibility, safe experimentation, and governed promotion of AI data and model artifacts.
This quickstart intentionally separates responsibilities:
-
Data plane (object storage)
MinIO / S3 stores the bytes: datasets, models, and pipeline artifacts. -
Control plane (lakeFS)
lakeFS adds Git-like semantics (branch, commit, merge, revert) and lineage metadata on top of the data in object storage. -
Compatibility
lakeFS exposes an S3-compatible API, so OpenShift AI and S3-native tools can use it as a drop-in endpoint without code changes.
After running this quickstart you can answer questions like:
- “Which exact dataset version trained the model that’s currently served?”
- “What changed between the dataset used for model v1 and v2?”
- “Can we reproduce last month’s metrics exactly?”
- “Can we roll back immediately if a bad data update ships?”
The purpose of this AI quickstart is to highlight the benefits of data versioning, provided by lakeFS, in an AI/ML environment. lakeFS allows the data engineer to manage the lifecycle of data using the same workflow a developer uses to manage source code, using git. This means that, like source code, data can be versioned, branched, merged and pulled from a git repository, although the data is actually stored in a backend object storage.
The quickstart will allow a demonstrator to quickly deploy both object storage, using MinIO, and lakeFS to serve as a git-like gateway that data engineers can interface with for data access. The following steps can be run very quickly:
- Deploy MinIO (object storage) and lakeFS (S3-compatible versioning gateway)
- Configure OpenShift AI to use lakeFS as its S3 endpoint (data connection)
- Run the fraud-detection notebooks to:
- load training data from lakeFS
- train a model
- write the model artifact back to lakeFS
- Create a lakeFS branch for a data change (e.g., updated labels / new transactions)
- Write updated training data to the branch, commit it, and retrain
- Compare results across versions, then merge the branch to promote (or revert/discard)
- (Optional) Run a pipeline that reads/writes through lakeFS so pipeline outputs are also versioned
TODO: create an arcade?
This quickstart was developed and test on an OpenShift cluster with the following components and resources. This can be considered the minimum requirements.
| Node Type | Qty | vCPU | Memory (GB) |
|---|---|---|---|
| Control Plane | 3 | 8 | 16 |
| Worker | 3 | 8 | 16 |
Note
A GPU is not required for this quickstart
This quickstart was tested with the following software versions:
| Software | Version |
|---|---|
| Red Hat OpenShift | 4.20.5 |
| Red Hat OpenShift Service Mesh | 2.5.11-0 |
| Red Hat OpenShift Serverless | 1.37.0 |
| Red Hat OpenShift AI | 2.25 |
| helm | 3.17.1 |
| lakeFS | 1.73.0 |
| MinIO | TBD |
The user performing this quickstart should have the ability to create a project in OpenShift and OpenShift AI. This requires the cluster role of admin (does not require cluster-admin)
The process is very simple. Just follow the steps below.
The steps assume the following pre-requisite products and components are deployed and functional with required permissions on the cluster:
- Red Hat OpenShift Container Platform
- Red Hat OpenShift Service Mesh
- Red Hat OpenShift Serverless
- Red Hat OpenShift AI
- User has
adminpermissions in the cluster
- Clone this repo
$ git clone https://github.com/rh-ai-quickstart/Fraud-Detection-data-versioning-with-lakeFS.git
- cd to
deploydirectory
$ cd Fraud-Detection-data-versioning-with-lakeFS/deploy
- Login to the OpenShift cluster:
$ oc login --token=<user_token> --server=https://api.<openshift_cluster_fqdn>:6443
- Make sure
deploy.shis executable and run it, passing it the name of the project in which to install. It can be an existing or new project. In this example, it will deploy to thelakefsproject.
# Make script executable
$ chmod + deploy.sh
# Run script passing it the project in which to install
$ ./deploy.sh lakefs
Use the route to access the lakeFS browser-base UI.
- Leave the username set to
admin - Enter your email address (or a bogus email address)
- Download the
access_key_idandsecret_access_keydisplayed on the new page, as they will not be accessible later on - Go back to the login page and log in using those credentials.
The project the apps were installed in can be deleted, which will delete all of the resources in it, including deployments, secrets, pods, configmaps, etc.
oc delete project lakefs
lakeFS exposes an S3-compatible API. In S3 terms:
-
Bucket = lakeFS repository
-
First path segment = branch
-
Object paths follow:
s3://[REPOSITORY]/[BRANCH]/PATH/TO/OBJECT
Example:
- Training data: s3://fraud/main/data/transactions.parquet
- Experiment data: s3://fraud/exp-01/data/transactions.parquet
- Model artifact: s3://fraud/exp-01/models/fraud/1/model.onnx
In real AI platforms, the point isn’t just versioning—it’s controlled promotion:
- Protect
mainso changes only arrive via merges - Add pre-merge hooks (Actions) to enforce data quality checks (schema, format, PII scanning)
- Merge = “publish” approved data/model artifacts to consumers
- Product: OpenShift AI
- Partner: lakeFS
- Partner product: lakeFS
- Business challenge: Fraud detection
