Published on less than a minute
First contact with spark-jobserver
This describe how to test locally a spark-jobserver instance. This derive from the readme. First impressions are great.
→ Server side
git clone git@github.com:spark-jobserver/spark-jobserver.git
→ Activate Basic Auth
# add this at the end config/local.conf
shiro {
authentication = on
# absolute path to shiro config file, including file name
config.path = "config/shiro.ini.basic"
}
# specify the number of query per slot (default 8)
spark {
jobserver.max-jobs-per-context = 2
}
More detail in the source code
→ Increase the jar size
# in spark section
short-timeout = 60 s
# in root section
spray.can.server {
parsing.max-content-length = 300m
idle-timeout = 400s
request-timeout = 300s
}
→ Start the server
sbt
# Run this
job-server-extras/reStart config/local.conf --- -Xmx8g
→ Client
# compile the jar
sbt job-server-tests/package
# add the jar and give it the "test-binary" name
curl --basic --user 'user:pwd' X POST localhost:8090/binaries/test-binary -H "Content-Type: application/java-archive" --data-binary @job-server-tests/target/scala-2.11/job-server-tests_2.11-0.9.1-SNAPSHOT.jar
# pass the basic auth
curl localhost:8090/contexts --basic --user 'user:pwd' -d "" "localhost:8090/contexts/test-context?num-cpu-cores=4&memory-per-node=512m"
# get the contexts
curl -k --basic --user 'user:pw' https://localhost:8090/contexts
# get the jobs
curl -k --basic --user 'user:pw' https://localhost:8090/jobs
# run a job based on the test-binary file
# also make it syncronized and increase the timeout
curl --basic --user 'user:pwd' -d "input.string = a b c a b see" "localhost:8090/jobs?appName=test-binary&classPath=spark.jobserver.WordCountExample&context=test-context&sync=true&timeout=100000"
→ Async Client
curl --basic --user user:pw' -d "input.string = a b c a b see" "localhost:8090/jobs?appName=omop-spark-job&classPath=io.frama.parisni.spark.omop.job.WordCountExample&context=test-context&sync=false&timeout=100000"
{
"duration": "Job not done yet",
"classPath": "io.frama.parisni.spark.omop.job.WordCountExample",
"startTime": "2020-06-12T16:43:46.980+02:00",
"context": "test-context",
"status": "STARTED",
"jobId": "d492de45-dc1e-4d08-996c-19b47fdb986f",
"contextId": "47607883-d90c-456e-8884-d160f4c41480"
}
curl -X GET --basic --user 'user:pw' "localhost:8090/jobs/d492de45-dc1e-4d08-996c-19b47fdb986f"
{
"duration": "0.157 secs",
"classPath": "io.frama.parisni.spark.omop.job.WordCountExample",
"startTime": "2020-06-12T16:43:46.980+02:00",
"context": "test-context",
"result": {
"b": 2,
"a": 2,
"see": 1,
"c": 1
},
"status": "FINISHED",
"jobId": "d492de45-dc1e-4d08-996c-19b47fdb986f",
"contextId": "47607883-d90c-456e-8884-d160f4c41480"
}
→ Enabling sparkContext with hive support
By extending the SparkSessionJob
class and using the below code, you get the hive support:
context-create:
curl -X POST -k --basic --user '${USER}:${PWD}' "${HOST}:${PORT}/contexts/test-context?num-cpu-cores=4&memory-per-node=512m&context-factory=spark.jobserver.context.SessionContextFactory"
Complex result (use array/maps as object returned):
{
"jobId": "4e85111f-98aa-4736-8153-11f937d4fd2f",
"result": [{
"list": 1,
"count": "800",
"resourceType": "Patient"
}, {
"list": 2,
"count": "100",
"resourceType": "Condition"
}]
}
→ Passing config parameters
One way is to store them into the env.conf
file, in the passthough
section. They are then accessible from the runtime.contextConfig
variable within the runJob
method.
They can be accessed from code thought val ZK = runtime.contextConfig.getString("passthrough.the.variable")
→ Passing extra options
This can be done with the env.sh
file. In this example, we configure the basic auth to solr cloud.
MANAGER_EXTRA_SPARK_CONFS="spark.executor.extraJavaOptions=-Dsolr.httpclient.config=<path-to>/auth_qua.txt -Dhdp.version=2.6.5.0-292|spark.driver.extraJavaOptions=-Dhdp.version=2.6.5.0-292,-Dsolr.httpclient.config=<path-to>/auth_qua.txt,spark.yarn.submit.waitAppCompletion=false|spark.files=$appdir/log4jcluster.properties,$conffile"
This page was last modified: