From 299da20ed2116664da160289ee8300fa14d1a5c8 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 5 Jan 2026 16:55:34 +0800 Subject: [PATCH 1/9] docs: Add guide for Lance REST integration with Spark and Ray --- docs/lance-rest-spark-ray-integration.md | 112 +++++++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 docs/lance-rest-spark-ray-integration.md diff --git a/docs/lance-rest-spark-ray-integration.md b/docs/lance-rest-spark-ray-integration.md new file mode 100644 index 00000000000..3087f67c1f2 --- /dev/null +++ b/docs/lance-rest-spark-ray-integration.md @@ -0,0 +1,112 @@ +--- +title: "Lance REST integration with Spark and Ray" +slug: /lance-rest-spark-ray-integration +keywords: + - lance + - lance-rest + - spark + - ray + - integration +license: "This software is licensed under the Apache License version 2." +--- + +## Overview + +This guide shows how to use the Lance REST service from Apache Gravitino with the Lance Spark connector (`lance-spark`) and the Lance Ray connector (`lance-ray`). It builds on the Lance REST service setup described in [Lance REST service](/lance-rest-service). + +## Compatibility matrix + +| Gravitino version (Lance REST) | Supported lance-spark versions | Supported lance-ray versions | +|--------------------------------|--------------------------------|------------------------------| +| 1.1.1 | 0.0.10 – 0.0.15 | 0.0.6 – 0.0.8 | + +:::note +- Update this matrix when newer Gravitino versions (for example 1.2.0) are released. +- Align connector versions with the Lance REST service bundled in the target release. +::: + +## Prerequisites + +- Gravitino server running with Lance REST service enabled (default endpoint: `http://localhost:9101/lance`). +- A Lance catalog created in Gravitino, for example `lance_catalog`. +- Downloaded `lance-spark` bundle JAR that matches your Spark version (set the absolute path in the examples below). +- Python environments with required packages: + - Spark: `pyspark` + - Ray: `ray`, `lance-namespace`, `lance-ray` + +## Using Lance REST with Spark + +The example below starts a local PySpark session that talks to Lance REST and creates a table through Spark SQL. + +```python +from pyspark.sql import SparkSession +import os +import logging +logging.basicConfig(level=logging.INFO) + +# Point to your downloaded lance-spark bundle. +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar --conf \"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --conf \"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" --master local[1] pyspark-shell" + +# Create the Lance catalog named "lance_catalog" in Gravitino beforehand. +spark = SparkSession.builder \ + .appName("lance_rest_example") \ + .config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceNamespaceSparkCatalog") \ + .config("spark.sql.catalog.lance.impl", "rest") \ + .config("spark.sql.catalog.lance.uri", "http://localhost:9101/lance") \ + .config("spark.sql.catalog.lance.parent", "lance_catalog") \ + .config("spark.sql.defaultCatalog", "lance") \ + .getOrCreate() + +spark.sparkContext.setLogLevel("DEBUG") + +# Create schema and table, write, then read data. +spark.sql("create database schema") +spark.sql(""" +create table schema.sample(id int, score float) +USING lance +LOCATION '/tmp/schema/sample.lance/' +TBLPROPERTIES ('format' = 'lance') +""") +spark.sql(""" +insert into schema.sample values(1, 1.1) +""") +spark.sql("select * from schema.sample").show() +``` + +:::note +- Keep the Lance REST service reachable from Spark executors. +- Replace the JAR path with the actual location on your machine or cluster. +- Add your own JVM debugging flags only when needed. +::: + +## Using Lance REST with Ray + +The snippet below writes and reads a Lance dataset through the Lance REST namespace. + +```python +import ray +import lance_namespace as ln +from lance_ray import read_lance, write_lance + +ray.init() + +namespace = ln.connect("rest", {"uri": "http://localhost:9101/lance"}) + +data = ray.data.range(1000).map(lambda row: {"id": row["id"], "value": row["id"] * 2}) + +write_lance(data, namespace=namespace, table_id=["lance_catalog", "schema", "my_table"]) +ray_dataset = read_lance(namespace=namespace, table_id=["lance_catalog", "schema", "my_table"]) + +result = ray_dataset.filter(lambda row: row["value"] < 100).count() +print(f"Filtered count: {result}") +``` + +:::note +- Ensure the target Lance catalog (`lance_catalog`) and schema (`schema`) already exist in Gravitino. +- The table path is represented as `["catalog", "schema", "table"]` when using Lance Ray helpers. +::: + +## Troubleshooting + +- **TypeError `_BaseLanceDatasink.on_write_start()` when using Ray 2.53.3**: downgrade Ray to `2.40.0` (for example, `pip install 'ray==2.40.0'`). +- **ValueError “too many values to unpack” from `lance_ray` datasink**: this is a known issue in `lance-ray`; track progress at [lance-format/lance-ray#68](https://github.com/lance-format/lance-ray/issues/68). From dcf29e9efc3636fc2854b647fa97f6f8272ead1c Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 5 Jan 2026 20:19:20 +0800 Subject: [PATCH 2/9] fix --- docs/lakehouse-generic-lance-table.md | 28 +++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/docs/lakehouse-generic-lance-table.md b/docs/lakehouse-generic-lance-table.md index 62f7b6ef229..1c973dc5020 100644 --- a/docs/lakehouse-generic-lance-table.md +++ b/docs/lakehouse-generic-lance-table.md @@ -302,3 +302,31 @@ done Other table operations (load, alter, drop, truncate) follow standard relational catalog patterns. See [Table Operations](./manage-relational-metadata-using-gravitino.md#table-operations) for details. +### Using Lance table with MinIO +To use Lance tables stored in MinIO with Gravitino, ensure that the MinIO storage backend is properly configured. Below is an example of how to set up and use Lance tables with MinIO. + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ + -H "Content-Type: application/json" -d '{ + "name": "lance_table", + "comment": "Example Lance table", + "columns": [ + { + "name": "id", + "type": "integer", + "comment": "Primary identifier", + "nullable": false + } + ], + "properties": { + "format": "lance", + "location": "s3://bucket1/lance_table", + "lance.storage.access_key_id": "ak", + "lance.storage.endpoint": "http://minio:9000", + "lance.storage.secret_access_key": "sk", + "lance.storage.allow_http": "true" + } +}' http://localhost:8090/api/metalakes/test/catalogs/lance_catalog/schemas/schema/tables + +``` + From 35cee21fd4be43f34ebffd42eae9222aaf809ca2 Mon Sep 17 00:00:00 2001 From: yuqi Date: Mon, 5 Jan 2026 20:21:51 +0800 Subject: [PATCH 3/9] fix --- docs/lance-rest-spark-ray-integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/lance-rest-spark-ray-integration.md b/docs/lance-rest-spark-ray-integration.md index 3087f67c1f2..fe1c2a27731 100644 --- a/docs/lance-rest-spark-ray-integration.md +++ b/docs/lance-rest-spark-ray-integration.md @@ -12,7 +12,7 @@ license: "This software is licensed under the Apache License version 2." ## Overview -This guide shows how to use the Lance REST service from Apache Gravitino with the Lance Spark connector (`lance-spark`) and the Lance Ray connector (`lance-ray`). It builds on the Lance REST service setup described in [Lance REST service](/lance-rest-service). +This guide shows how to use the Lance REST service from Apache Gravitino with the Lance Spark connector (`lance-spark`) and the Lance Ray connector (`lance-ray`). It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). ## Compatibility matrix From 3e66bffed9db0435cdd640de176ead48ddb61213 Mon Sep 17 00:00:00 2001 From: yuqi Date: Tue, 6 Jan 2026 12:09:45 +0800 Subject: [PATCH 4/9] fix --- docs/lance-rest-spark-ray-integration.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/docs/lance-rest-spark-ray-integration.md b/docs/lance-rest-spark-ray-integration.md index fe1c2a27731..e4f177b1722 100644 --- a/docs/lance-rest-spark-ray-integration.md +++ b/docs/lance-rest-spark-ray-integration.md @@ -79,6 +79,22 @@ spark.sql("select * from schema.sample").show() - Add your own JVM debugging flags only when needed. ::: +The storage location in the example above is local path, if you want to use cloud storage, please refer to the following Minio example: + +```python +spark.sql(""" +create table schema.sample(id int, score float) +USING lance +LOCATION 's3://bucket/tmp/schema/sample.lance/' +TBLPROPERTIES ( + 'format' = 'lance', + 'lance.storage.access_key_id' = 'ak', + 'lance.storage.endpoint' = 'http://minio:9000', + 'lance.storage.secret_access_key' = 'sk', + 'lance.storage.allow_http'= 'true' + ) +``` + ## Using Lance REST with Ray The snippet below writes and reads a Lance dataset through the Lance REST namespace. From 4b830458fb9e6a364aefc8c98551bd38fc9366d6 Mon Sep 17 00:00:00 2001 From: yuqi Date: Wed, 7 Jan 2026 21:23:23 +0800 Subject: [PATCH 5/9] fix --- docs/lance-rest-spark-ray-integration.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/lance-rest-spark-ray-integration.md b/docs/lance-rest-spark-ray-integration.md index e4f177b1722..b3c17eff6c7 100644 --- a/docs/lance-rest-spark-ray-integration.md +++ b/docs/lance-rest-spark-ray-integration.md @@ -99,6 +99,12 @@ TBLPROPERTIES ( The snippet below writes and reads a Lance dataset through the Lance REST namespace. +```shell +pip install lance-ray +``` +Please note that ray will also be installed if not already present. Currently lance-ray only tested with ray version 2.50.0, please +ensure ray version compatibility in your environment. + ```python import ray import lance_namespace as ln @@ -121,8 +127,3 @@ print(f"Filtered count: {result}") - Ensure the target Lance catalog (`lance_catalog`) and schema (`schema`) already exist in Gravitino. - The table path is represented as `["catalog", "schema", "table"]` when using Lance Ray helpers. ::: - -## Troubleshooting - -- **TypeError `_BaseLanceDatasink.on_write_start()` when using Ray 2.53.3**: downgrade Ray to `2.40.0` (for example, `pip install 'ray==2.40.0'`). -- **ValueError “too many values to unpack” from `lance_ray` datasink**: this is a known issue in `lance-ray`; track progress at [lance-format/lance-ray#68](https://github.com/lance-format/lance-ray/issues/68). From 587d49e2a7fd7e76d6fcfb5d591343898f956927 Mon Sep 17 00:00:00 2001 From: yuqi Date: Fri, 9 Jan 2026 20:06:11 +0800 Subject: [PATCH 6/9] fix --- docs/lance-rest-service.md | 5 ++++- docs/lance-rest-spark-ray-integration.md | 15 ++++++++------- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/docs/lance-rest-service.md b/docs/lance-rest-service.md index 4b0db579f16..19179a8bcc7 100644 --- a/docs/lance-rest-service.md +++ b/docs/lance-rest-service.md @@ -233,7 +233,6 @@ URL encoded: lance_catalog%24schema%24table01 - Currently supports only **two levels of namespaces** before tables - Tables **cannot** be nested deeper than schema level - Parent catalog must be created in Gravitino before using Lance REST API -- Metadata operations require Gravitino server to be available - Namespace deletion is recursive and irreversible ::: @@ -399,3 +398,7 @@ ns.create_table(create_table_request, body) + +## Integration with Spark and Ray + +Please refer to [lance-rest-spark-ray-integration](./lance-rest-spark-ray-integration.md) for detailed instructions on using Lance REST service with Apache Spark and Ray. \ No newline at end of file diff --git a/docs/lance-rest-spark-ray-integration.md b/docs/lance-rest-spark-ray-integration.md index b3c17eff6c7..58ed177ddc5 100644 --- a/docs/lance-rest-spark-ray-integration.md +++ b/docs/lance-rest-spark-ray-integration.md @@ -20,9 +20,9 @@ This guide shows how to use the Lance REST service from Apache Gravitino with th |--------------------------------|--------------------------------|------------------------------| | 1.1.1 | 0.0.10 – 0.0.15 | 0.0.6 – 0.0.8 | +These version ranges represent combinations that are expected to be compatible. Only a subset of versions within each range may have been explicitly tested, so you should verify a specific connector version in your own environment. :::note -- Update this matrix when newer Gravitino versions (for example 1.2.0) are released. -- Align connector versions with the Lance REST service bundled in the target release. +The compatibility information in this section applies to Gravitino 1.1.1. For newer Gravitino versions, refer to that release's documentation and ensure that your `lance-spark` and `lance-ray` versions are compatible with the Lance REST service bundled with your Gravitino distribution. ::: ## Prerequisites @@ -79,7 +79,7 @@ spark.sql("select * from schema.sample").show() - Add your own JVM debugging flags only when needed. ::: -The storage location in the example above is local path, if you want to use cloud storage, please refer to the following Minio example: +The storage location in the example above is local path, if you want to use cloud storage, please refer to the following MinIO example: ```python spark.sql(""" @@ -91,8 +91,8 @@ TBLPROPERTIES ( 'lance.storage.access_key_id' = 'ak', 'lance.storage.endpoint' = 'http://minio:9000', 'lance.storage.secret_access_key' = 'sk', - 'lance.storage.allow_http'= 'true' - ) + 'lance.storage.allow_http' = 'true' + )""") ``` ## Using Lance REST with Ray @@ -102,8 +102,9 @@ The snippet below writes and reads a Lance dataset through the Lance REST namesp ```shell pip install lance-ray ``` -Please note that ray will also be installed if not already present. Currently lance-ray only tested with ray version 2.50.0, please -ensure ray version compatibility in your environment. +Please note that Ray will also be installed if not already present. Currently lance-ray is only tested with Ray version 2.41.0 to 2.50.0, please ensure Ray version compatibility in your environment. + +After installing `lance-ray`, you can run the following Ray script: ```python import ray From d44226aa16963e47c0a1920d9ad92620790fc199 Mon Sep 17 00:00:00 2001 From: yuqi Date: Sat, 10 Jan 2026 15:07:16 +0800 Subject: [PATCH 7/9] fix --- docs/lance-rest-spark-ray-integration.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/lance-rest-spark-ray-integration.md b/docs/lance-rest-spark-ray-integration.md index 58ed177ddc5..8eaf7aca765 100644 --- a/docs/lance-rest-spark-ray-integration.md +++ b/docs/lance-rest-spark-ray-integration.md @@ -28,7 +28,7 @@ The compatibility information in this section applies to Gravitino 1.1.1. For ne ## Prerequisites - Gravitino server running with Lance REST service enabled (default endpoint: `http://localhost:9101/lance`). -- A Lance catalog created in Gravitino, for example `lance_catalog`. +- A Lance catalog created in Gravitino or via Lance REST namespace API(see `CreateNamespace` in [docs](./lance-rest-service.md)) for example `lance_catalog`. - Downloaded `lance-spark` bundle JAR that matches your Spark version (set the absolute path in the examples below). - Python environments with required packages: - Spark: `pyspark` @@ -117,6 +117,7 @@ namespace = ln.connect("rest", {"uri": "http://localhost:9101/lance"}) data = ray.data.range(1000).map(lambda row: {"id": row["id"], "value": row["id"] * 2}) +# Please note that namespace `schema` should also be created via Lance REST API or Gravitino API beforehand. write_lance(data, namespace=namespace, table_id=["lance_catalog", "schema", "my_table"]) ray_dataset = read_lance(namespace=namespace, table_id=["lance_catalog", "schema", "my_table"]) From 7aa0c5ca55138de8f2091630cc7a576eaf2a6ece Mon Sep 17 00:00:00 2001 From: yuqi Date: Sat, 10 Jan 2026 15:08:27 +0800 Subject: [PATCH 8/9] fix --- docs/lance-rest-spark-ray-integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/lance-rest-spark-ray-integration.md b/docs/lance-rest-spark-ray-integration.md index 8eaf7aca765..19c4b880026 100644 --- a/docs/lance-rest-spark-ray-integration.md +++ b/docs/lance-rest-spark-ray-integration.md @@ -28,7 +28,7 @@ The compatibility information in this section applies to Gravitino 1.1.1. For ne ## Prerequisites - Gravitino server running with Lance REST service enabled (default endpoint: `http://localhost:9101/lance`). -- A Lance catalog created in Gravitino or via Lance REST namespace API(see `CreateNamespace` in [docs](./lance-rest-service.md)) for example `lance_catalog`. +- A Lance catalog created in Gravitino via Lance REST namespace API(see `CreateNamespace` in [docs](./lance-rest-service.md)) or Gravitino REST API, for example `lance_catalog`. - Downloaded `lance-spark` bundle JAR that matches your Spark version (set the absolute path in the examples below). - Python environments with required packages: - Spark: `pyspark` From 2a6a23f484b15b5d932cfd1f5f93830b7bc56e4c Mon Sep 17 00:00:00 2001 From: yuqi Date: Sat, 10 Jan 2026 15:11:29 +0800 Subject: [PATCH 9/9] fix --- docs/lance-rest-spark-ray-integration.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/lance-rest-spark-ray-integration.md b/docs/lance-rest-spark-ray-integration.md index 19c4b880026..adb5a6c9332 100644 --- a/docs/lance-rest-spark-ray-integration.md +++ b/docs/lance-rest-spark-ray-integration.md @@ -12,7 +12,7 @@ license: "This software is licensed under the Apache License version 2." ## Overview -This guide shows how to use the Lance REST service from Apache Gravitino with the Lance Spark connector (`lance-spark`) and the Lance Ray connector (`lance-ray`). It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). +This guide shows how to use the Lance REST service from Apache Gravitino with the [Lance Spark connector](https://lance.org/integrations/spark/) (`lance-spark`) and the [Lance Ray connector](https://lance.org/integrations/ray/) (`lance-ray`). It builds on the Lance REST service setup described in [Lance REST service](./lance-rest-service). ## Compatibility matrix @@ -73,12 +73,6 @@ insert into schema.sample values(1, 1.1) spark.sql("select * from schema.sample").show() ``` -:::note -- Keep the Lance REST service reachable from Spark executors. -- Replace the JAR path with the actual location on your machine or cluster. -- Add your own JVM debugging flags only when needed. -::: - The storage location in the example above is local path, if you want to use cloud storage, please refer to the following MinIO example: ```python @@ -129,3 +123,8 @@ print(f"Filtered count: {result}") - Ensure the target Lance catalog (`lance_catalog`) and schema (`schema`) already exist in Gravitino. - The table path is represented as `["catalog", "schema", "table"]` when using Lance Ray helpers. ::: + + +## Other engines + +Lance REST can also be used with other engines that support Lance format, such as DuckDB and Pandas. Please refer to the respective [integration documentation](https://lance.org/integrations/datafusion/) for details on how to connect to Lance REST from those engines. \ No newline at end of file