diff --git a/README.md b/README.md index 7daf069..9f8704d 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,11 @@ +This library is implemented originally by Katsunori Kanda [potix2/spark-google-spreadsheets](https://github.com/potix2/spark-google-spreadsheets) and all benefits for it should be addressed to him. + +The changes which were introduced in this fork: + +1. Usage OAuth 2.0 to access Google APIs; +2. Upgrade Spark to version 3.1.1; +3. Miscellaneous code improvements. + # Spark Google Spreadsheets Google Spreadsheets datasource for [SparkSQL and DataFrames](http://spark.apache.org/docs/latest/sql-programming-guide.html) @@ -6,28 +14,23 @@ Google Spreadsheets datasource for [SparkSQL and DataFrames](http://spark.apache ## Notice -The version 0.4.0 breaks compatibility with previous versions. You must -use a ** spreadsheetId ** to identify which spreadsheet is to be accessed or altered. -In older versions, spreadsheet name was used. - -If you don't know spreadsheetId, please read the [Introduction to the Google Sheets API v4](https://developers.google.com/sheets/guides/concepts). +Before you start using this library, please read the [Introduction to the Google Sheets API v4](https://developers.google.com/sheets/guides/concepts) +to understand all basic concepts. ## Requirements -This library supports different versions of Spark: - ### Latest compatible versions | This library | Spark Version | -| ------------ | ------------- | -| 0.1.x | 3.1.1 | +|--------------| ------------- | +| 0.1.1 | 3.1.1 | ## Linking Using SBT: ``` -libraryDependencies += "com.github.riskidentdms" %% "spark-google-spreadsheets" % "0.1.0" +libraryDependencies += "com.github.riskidentdms" %% "spark-google-spreadsheets" % "0.1.1" ``` Using Maven: @@ -36,27 +39,64 @@ Using Maven: com.github.riskidentdms spark-google-spreadsheets_2.12 - 0.1.0 + 0.1.1 ``` -## SQL API +## Using Google application credentials +This library uses OAuth 2.0 to access Google APIs: [Using OAuth 2.0 to Access Google APIs](https://developers.google.com/identity/protocols/oauth2) + +Please read this article in order to set up OAuth 2.0 in your Google Service Account: [Setting up OAuth 2.0](https://support.google.com/cloud/answer/6158849) -TBD: Should be updated +Keep in mind that you have to use the JSON key type, when you create a Service Account key. +A JSON file that contains the private key should be downloaded and stored securely because this key can't be recovered if lost. + +There are two ways of providing authentication credentials to your application code namely: + +- by providing the path to the JSON file that contains private key described above + +```scala +import com.github.riskidentdms.spark.google.spreadsheets.Credentials +val credentials = Credentials.credentialsFromFile("path_to_key_json") +``` +or by adding an input option for the underlying data source + +```scala +.option("credentialsPath", "path_to_key_json") +``` + +```sql +OPTIONS(credentialsPath "path_to_key_json") +``` + +- by providing JSON String that contains private key described above + +```scala +import com.github.riskidentdms.spark.google.spreadsheets.Credentials +Credentials.credentialsFromJsonString("json_string") +``` + +```scala +.option("credentialsJson", "json_key") +``` + +```sql +OPTIONS(credentialsJson 'json_key') +``` + +## Usage examples +### SQL API ```sql CREATE TABLE cars USING com.github.riskidentdms.spark.google.spreadsheets OPTIONS ( path "/worksheet1", - serviceAccountId "xxxxxx@developer.gserviceaccount.com", - credentialPath "/path/to/credential.p12" + credentialsPath "path_to_key_json" ) ``` -## Scala API - -TBD: Should be updated +### Scala API ```scala import org.apache.spark.sql.SparkSession @@ -68,46 +108,50 @@ val sqlContext = SparkSession.builder() // Creates a DataFrame from a specified worksheet val df = sqlContext.read. - format("com.github.riskidentdms.spark.google.spreadsheets"). - option("serviceAccountId", "xxxxxx@developer.gserviceaccount.com"). - option("credentialPath", "/path/to/credential.p12"). - load("/worksheet1") + format("com.github.riskidentdms.spark.google.spreadsheets") + .option("credentialsPath", "path_to_key_json") + .load("/worksheet1") // Saves a DataFrame to a new worksheet df.write. - format("com.github.riskidentdms.spark.google.spreadsheets"). - option("serviceAccountId", "xxxxxx@developer.gserviceaccount.com"). - option("credentialPath", "/path/to/credential.p12"). - save("/newWorksheet") + format("com.github.riskidentdms.spark.google.spreadsheets") + .option("credentialsPath", "path_to_key_json") + .save("/newWorksheet") ``` -### Using Google default application credentials -TBD: Should be updated - -Provide authentication credentials to your application code by setting the environment variable -`GOOGLE_APPLICATION_CREDENTIALS`. The variable should be set to the path of the service account json file. - - ```scala import org.apache.spark.sql.SparkSession val sqlContext = SparkSession.builder() - .master("local[2]") + .master("local[*]") .appName("SpreadsheetSuite") .getOrCreate().sqlContext // Creates a DataFrame from a specified worksheet -val df = sqlContext.read. - format("com.github.riskidentdms.spark.google.spreadsheets"). - load("/worksheet1") +val df = sqlContext.read + .format("com.github.riskidentdms.spark.google.spreadsheets") + .option("credentialsPath", "path_to_key_json") + .load("/worksheet1") ``` More details: https://cloud.google.com/docs/authentication/production -## License +## Local testing + +You have to do some preparations in order to be able to run tests from your machine: -Copyright 2016-2018, Katsunori Kanda +1. Upload `files/SpreadsheetSuite.xlsx` to the [Google Spreadsheet](https://docs.google.com/spreadsheets). + +2. Export the spreadsheet ID of previously uploaded document as `TEST_SPREADSHEET_ID` environment variable. +The spreadsheet ID you can find in the URL of the opened document. The pattern of the URL looks like that: + +`https://docs.google.com/spreadsheets/d/` + +3. Export the JSON private key as `OAUTH_JSON` environment variable. +Please see details [here](<#Using Google application credentials>). + +## License Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at diff --git a/files/SpreadsheetSuite.xlsx b/files/SpreadsheetSuite.xlsx new file mode 100644 index 0000000..5463bb5 Binary files /dev/null and b/files/SpreadsheetSuite.xlsx differ