StreamingDataPlatform · dellThejas · Oct 1, 2020 · Oct 1, 2020 · Dec 2, 2020
diff --git a/_posts/Connectors/2020-03-18-Pravega-Spark-Connectors.md b/_posts/Connectors/2020-03-18-Pravega-Spark-Connectors.md
@@ -67,8 +67,82 @@ cd spark-connectors
 ls -lhR ~/.m2/repository/io/pravega/pravega-connectors-spark
 ```
 
+## Samples
+
+Set of code examples to demonstrate the capabilities of Pravega as a data stream storage system for Apache Spark.
+
+### Getting Started
+
+#### Install Operating System
+
+Install Ubuntu 18.04 LTS. Other operating systems can also be used but the commands below have only been tested on this version.
+
+#### Install Java 8
+
+```
+apt-get install openjdk-8-jdk
+```
+
+#### Install Docker and Docker Compose
+
+See [https://docs.docker.com/install/linux/docker-ce/ubuntu/](https://docs.docker.com/install/linux/docker-ce/ubuntu/) and [https://docs.docker.com/compose/install/](https://docs.docker.com/compose/install/).
+
+#### Run Pravega
+
+This will run a development instance of Pravega locally. Note that the default standalone Pravega used for development is likely insufficient for testing video because it stores all data in memory and quickly runs out of memory. Using the procedure below, all data will be stored in a small HDFS cluster in Docker.
+
+In the command below, replace x.x.x.x with the IP address of a local network interface such as eth0.
+
+```
+cd
+git clone https://github.com/pravega/pravega
+cd pravega
+git checkout r0.7
+cd docker/compose
+export HOST_IP=x.x.x.x
+docker-compose up -d
+```
+
+You can view the Pravega logs with `docker-compose logs --follow`. You can view the stream files stored on HDFS with `docker-compose exec hdfs hdfs dfs -ls -h -R /`.
+
+#### Instructions
+
+1) Install Apache Spark
+
+This will install a development instance of Spark locally.
+
+Download `https://www.apache.org/dyn/closer.lua/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz`.
+
+```
+mkdir -p ~/spark
+cd ~/spark
+tar -xzvf ~/Downloads/spark-2.4.6-bin-hadoop2.7.tgz
+ln -s spark-2.4.6-bin-hadoop2.7 current
+export PATH="$HOME/spark/current/bin:$PATH"
+```
+
+By default, the script run_spark_ap.sh will use an in-process Spark mini-cluster that is started with the Spark job ("--master local[2]").
+
+2) Build and Install the Spark Connector
+
+This will build the Spark Connector and publish it to your local Maven repository.
+
+```
+cd
+git clone https://github.com/pravega/spark-connectors
+cd spark-connectors
+./gradlew install
+ls -lhR ~/.m2/repository/io/pravega/pravega-connectors-spark
+```
+
+3) Running Examples
+
+`spark-connector-examples` repository in the source provides code to connect Spark with Pravega. It has multiple examples demonstrating use cases for spark jobs with Pravega.
+
 ## Source
-[https://github.com/pravega/spark-connectors](https://github.com/pravega/spark-connectors)
+- spark-connectors: [https://github.com/pravega/spark-connectors](https://github.com/pravega/spark-connectors)
+
+- spark-connector-examples: [https://github.com/pravega/spark-connector-examples](https://github.com/pravega/spark-connector-examples)
 
 ## Documentation
 

diff --git a/_posts/Connectors/copy.md b/_posts/Connectors/copy.md
@@ -0,0 +1,101 @@
+---
+layout: post
+category: Connectors
+tags: [pravega, spark, connector]
+subtitle: Enable Spark to read and write Pravega streams
+technologies: [Pravega, Spark]
+img: spark.png
+license: Apache
+support: Community
+author: 
+    name: Luis Liu
+    description: SDP App Developer
+    image: 
+css: 
+js: 
+---
+This post introduces the Pravega Spark connectors that read and write [Pravega](http://pravega.io/) Streams with [Apache Spark](http://spark.apache.org/), a high-performance analytics engine for batch and streaming data.
+<!--more-->
+
+The connectors can be used to build end-to-end stream processing pipelines (see [Samples](https://github.com/pravega/pravega-samples)) that use Pravega as the stream storage and message bus, and Apache Spark for computation over the streams.
+
+
+
+## Features & Highlights
+
+  - **Exactly-once processing guarantees** for both Reader and Writer, supporting **end-to-end exactly-once processing pipelines**.
+
+  - A Spark V2 data source micro-batch reader connector allows Spark Streaming applications to read Pravega Streams.
+    Pravega stream cuts are used to reliably recover from failures and provide exactly-once semantics.
+
+  - A Spark base relation data source batch reader connector allows Spark batch applications to read Pravega Streams.
+
+  - A Spark V2 data source stream writer allows Spark Streaming applications to write to Pravega Streams.
+    Writes are contained within Pravega transactions, providing exactly-once semantics.
+
+  - Seamless integration with Spark's checkpoints.
+
+  - Parallel Readers and Writers supporting high throughput and low latency processing.
+
+  - Reader supports reassembling chunked events to support events of 2 GiB.
+
+## Limitations
+
+  - The current implementation of this connector does *not* guarantee that events with the same routing key
+    are returned in a single partition. 
+    If your application requires this, you must repartition the dataframe by the routing key and sort within the
+    partition by segment_id and offset.
+
+  - Continuous reader support is not available. The micro-batch reader uses the Pravega batch API and works well for
+    applications with latency requirements above 100 milliseconds.
+
+  - The initial batch in the micro-batch reader will contain the entire Pravega stream as of the start time.
+    There is no rate limiting functionality.
+
+  - Read-after-write consistency is currently *not* guaranteed.
+    Be cautious if your workflow requires multiple chained Spark batch jobs.
+
+## Build and Install the Spark Connector
+
+This will build the Spark Connector and publish it to your local Maven repository.
+
+```
+cd
+git clone https://github.com/pravega/spark-connectors
+cd spark-connectors
+./gradlew install
+ls -lhR ~/.m2/repository/io/pravega/pravega-connectors-spark
+```
+
+## Samples
+
+Set of code examples to demonstrate the capabilities of Pravega as a data stream storage system for Apache Spark.
+
+Environment used is Ubuntu 18.04 LTS, but and other operating system can be used. It also requires Java 8, docker, Pravega, and a local installation of Spark.
+
+Once the Spark Connector build is published to local Maven repository, it can be used to rn the spark-connector examples.
+
+The following samples are available:
+- PySpark batch job that reads events from the file *sample_data.json* and writes to a Pravega stream
+- PySpark batch job that reads from a Pravega stream and writes to the console
+- PySpark Streaming job that writes generated data to a Pravega stream
+- PySpark Streaming job that reads from a Pravega stream and writes to the console
+- PySpark Streaming job that reads from a Pravega stream and writes to another Pravega stream
+- Java Spark Streaming job that reads from a Pravega stream and writes to the console
+- PySpark Streaming job in a Spark Cluster
+
+## Source
+- spark-connectors: [https://github.com/pravega/spark-connectors](https://github.com/pravega/spark-connectors)
+
+## Documentation
+
+To learn more about how to build and use the Spark Connector library, refer to
+[Pravega Samples](https://github.com/pravega/pravega-samples).
+
+## Reference
+[http://blog.madhukaraphatak.com/spark-datasource-v2-part-1/](http://blog.madhukaraphatak.com/spark-datasource-v2-part-1/)
+
+## License
+
+Spark connectors for Pravega is 100% open source and community-driven. All components are available
+under [Apache 2 License](https://www.apache.org/licenses/LICENSE-2.0.html) on GitHub.