Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 75 additions & 1 deletion _posts/Connectors/2020-03-18-Pravega-Spark-Connectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,82 @@ cd spark-connectors
ls -lhR ~/.m2/repository/io/pravega/pravega-connectors-spark
```

## Samples

Set of code examples to demonstrate the capabilities of Pravega as a data stream storage system for Apache Spark.

### Getting Started

#### Install Operating System

Install Ubuntu 18.04 LTS. Other operating systems can also be used but the commands below have only been tested on this version.

#### Install Java 8

```
apt-get install openjdk-8-jdk
```

#### Install Docker and Docker Compose

See [https://docs.docker.com/install/linux/docker-ce/ubuntu/](https://docs.docker.com/install/linux/docker-ce/ubuntu/) and [https://docs.docker.com/compose/install/](https://docs.docker.com/compose/install/).

#### Run Pravega

This will run a development instance of Pravega locally. Note that the default standalone Pravega used for development is likely insufficient for testing video because it stores all data in memory and quickly runs out of memory. Using the procedure below, all data will be stored in a small HDFS cluster in Docker.

In the command below, replace x.x.x.x with the IP address of a local network interface such as eth0.

```
cd
git clone https://github.com/pravega/pravega
cd pravega
git checkout r0.7
cd docker/compose
export HOST_IP=x.x.x.x
docker-compose up -d
```

You can view the Pravega logs with `docker-compose logs --follow`. You can view the stream files stored on HDFS with `docker-compose exec hdfs hdfs dfs -ls -h -R /`.

#### Instructions

1) Install Apache Spark

This will install a development instance of Spark locally.

Download `https://www.apache.org/dyn/closer.lua/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz`.

```
mkdir -p ~/spark
cd ~/spark
tar -xzvf ~/Downloads/spark-2.4.6-bin-hadoop2.7.tgz
ln -s spark-2.4.6-bin-hadoop2.7 current
export PATH="$HOME/spark/current/bin:$PATH"
```

By default, the script run_spark_ap.sh will use an in-process Spark mini-cluster that is started with the Spark job ("--master local[2]").

2) Build and Install the Spark Connector

This will build the Spark Connector and publish it to your local Maven repository.

```
cd
git clone https://github.com/pravega/spark-connectors
cd spark-connectors
./gradlew install
ls -lhR ~/.m2/repository/io/pravega/pravega-connectors-spark
```

3) Running Examples

`spark-connector-examples` repository in the source provides code to connect Spark with Pravega. It has multiple examples demonstrating use cases for spark jobs with Pravega.

## Source
[https://github.com/pravega/spark-connectors](https://github.com/pravega/spark-connectors)
- spark-connectors: [https://github.com/pravega/spark-connectors](https://github.com/pravega/spark-connectors)

- spark-connector-examples: [https://github.com/pravega/spark-connector-examples](https://github.com/pravega/spark-connector-examples)

## Documentation

Expand Down
101 changes: 101 additions & 0 deletions _posts/Connectors/copy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
layout: post
category: Connectors
tags: [pravega, spark, connector]
subtitle: Enable Spark to read and write Pravega streams
technologies: [Pravega, Spark]
img: spark.png
license: Apache
support: Community
author:
name: Luis Liu
description: SDP App Developer
image:
css:
js:
---
This post introduces the Pravega Spark connectors that read and write [Pravega](http://pravega.io/) Streams with [Apache Spark](http://spark.apache.org/), a high-performance analytics engine for batch and streaming data.
<!--more-->

The connectors can be used to build end-to-end stream processing pipelines (see [Samples](https://github.com/pravega/pravega-samples)) that use Pravega as the stream storage and message bus, and Apache Spark for computation over the streams.



## Features & Highlights

- **Exactly-once processing guarantees** for both Reader and Writer, supporting **end-to-end exactly-once processing pipelines**.

- A Spark V2 data source micro-batch reader connector allows Spark Streaming applications to read Pravega Streams.
Pravega stream cuts are used to reliably recover from failures and provide exactly-once semantics.

- A Spark base relation data source batch reader connector allows Spark batch applications to read Pravega Streams.

- A Spark V2 data source stream writer allows Spark Streaming applications to write to Pravega Streams.
Writes are contained within Pravega transactions, providing exactly-once semantics.

- Seamless integration with Spark's checkpoints.

- Parallel Readers and Writers supporting high throughput and low latency processing.

- Reader supports reassembling chunked events to support events of 2 GiB.

## Limitations

- The current implementation of this connector does *not* guarantee that events with the same routing key
are returned in a single partition.
If your application requires this, you must repartition the dataframe by the routing key and sort within the
partition by segment_id and offset.

- Continuous reader support is not available. The micro-batch reader uses the Pravega batch API and works well for
applications with latency requirements above 100 milliseconds.

- The initial batch in the micro-batch reader will contain the entire Pravega stream as of the start time.
There is no rate limiting functionality.

- Read-after-write consistency is currently *not* guaranteed.
Be cautious if your workflow requires multiple chained Spark batch jobs.

## Build and Install the Spark Connector

This will build the Spark Connector and publish it to your local Maven repository.

```
cd
git clone https://github.com/pravega/spark-connectors
cd spark-connectors
./gradlew install
ls -lhR ~/.m2/repository/io/pravega/pravega-connectors-spark
```

## Samples

Set of code examples to demonstrate the capabilities of Pravega as a data stream storage system for Apache Spark.

Environment used is Ubuntu 18.04 LTS, but and other operating system can be used. It also requires Java 8, docker, Pravega, and a local installation of Spark.

Once the Spark Connector build is published to local Maven repository, it can be used to rn the spark-connector examples.

The following samples are available:
- PySpark batch job that reads events from the file *sample_data.json* and writes to a Pravega stream
- PySpark batch job that reads from a Pravega stream and writes to the console
- PySpark Streaming job that writes generated data to a Pravega stream
- PySpark Streaming job that reads from a Pravega stream and writes to the console
- PySpark Streaming job that reads from a Pravega stream and writes to another Pravega stream
- Java Spark Streaming job that reads from a Pravega stream and writes to the console
- PySpark Streaming job in a Spark Cluster

## Source
- spark-connectors: [https://github.com/pravega/spark-connectors](https://github.com/pravega/spark-connectors)

## Documentation

To learn more about how to build and use the Spark Connector library, refer to
[Pravega Samples](https://github.com/pravega/pravega-samples).

## Reference
[http://blog.madhukaraphatak.com/spark-datasource-v2-part-1/](http://blog.madhukaraphatak.com/spark-datasource-v2-part-1/)

## License

Spark connectors for Pravega is 100% open source and community-driven. All components are available
under [Apache 2 License](https://www.apache.org/licenses/LICENSE-2.0.html) on GitHub.