[SNAP-2231] Limit maximum cores for a job to physical cores on a node by sumwale · Pull Request #972 · TIBCOSoftware/snappydata

sumwale · 2018-02-27T09:48:50Z

See some details in the JIRA https://jira.snappydata.io/browse/SNAP-2231

These changes limit the maximum cores given to a job to the physical cores on a machine.
With the default of (2 * physical cores) in the cluster, this allows other cores to be free
for any other concurrent jobs. Especially important for short point-lookup queries.

Additionally these improve performance for disk intensive queries. For example measured
a 30-50% improvement in performance in TPCH load and some queries when cores were
limited to physical cores and lot of data has overflowed to disk.

Question: should the default cores in ExecutorInitiator be increased to (4 * physical cores)
to allow for more concurrency?

Changes proposed in this pull request

overrides in SnappyTaskSchedulerImpl to track per executor cores used by a job
and cap it to number of physical cores on a node
combined some maps in TaskSchedulerImpl to recover performance due to above
and improve further compared to base TaskSchedulerImpl
property "spark.scheduler.limitJobCores=false" can be set to revert to previous behaviour

Patch testing

precheckin -Pstore -Pspark

TODO: working on porting Spark's TaskScheduler unit tests

ReleaseNotes.txt changes

document the new property and behaviour

Other PRs

TIBCOSoftware/snappy-spark#96

- overrides in SnappyTaskSchedulerImpl to track per executor cores used by a job and cap it to number of physical cores on a node - combined some maps in TaskSchedulerImpl to recover performance due to above and improve further compared to base TaskSchedulerImpl - property "spark.scheduler.limitJobCores=false" can be set to revert to previous behaviour

rishitesh

Some comments and clarifications sought.

rishitesh · 2018-03-07T08:57:25Z

.../src/main/scala/org/apache/spark/scheduler/cluster/SnappyCoarseGrainedSchedulerBackend.scala

+        bid
+      case Some(b) => b._blockId = msg.blockManagerId; b
    }
+    sc.taskScheduler.asInstanceOf[SnappyTaskSchedulerImpl].addBlockId(executorId, blockId)


SnappyTaskSchedulerImpl.addBlockId() method has a condition blockId.numProcessors < blockId.executorCores. From here it will never be satisfied.

The "case None" is for a corner one where blockManager gets added before executor. For normal cases onExecutorAdded will be invoked first where number of physical cores have been properly initialized so addBlockId will work fine. Will add the handling for that case in onExecutorAdded and invoke addBlockId from the Some() match case there.

Will also add removal in onExecutorRemoved.

rishitesh · 2018-03-07T09:21:56Z

cluster/src/main/scala/org/apache/spark/scheduler/SnappyTaskSchedulerImpl.scala

+  private val lookupExecutorCores = new ToLongFunction[String] {
+    override def applyAsLong(executorId: String): Long = {
+      maxExecutorTaskCores.get(executorId) match {
+        case null => Int.MaxValue // no restriction


Should not defaultParallelism be better than Int.maxVal

Null means that cores defined for executor are less than or equal to physical cores on the machine, or limit job has been explicitly disabled. Both cases imply the same thing that is don't put any limits on tasks on a node so this essentially falls back to Spark's TaskSchedulerImpl behaviour.

rishitesh · 2018-03-07T09:42:53Z

cluster/src/main/scala/org/apache/spark/scheduler/SnappyTaskSchedulerImpl.scala

+    val manager = createTaskSetManager(taskSet, maxTaskFailures)
+    val stage = taskSet.stageId
+    val (stageAvailableCores, stageTaskSets) = stageCoresAndAttempts.computeIfAbsent(
+      stage, createNewStageMap)


Should we not be setting the manager for the stageSet. I can see
stageTaskSets(taskSet.stageAttemptId) = manager in original TaskSchedulerImpl.

Yes done below in line number 112.

rishitesh · 2018-03-07T10:16:21Z

cluster/src/main/scala/org/apache/spark/scheduler/SnappyTaskSchedulerImpl.scala

+            taskIdExecutorAndManager.justPut(tid, execId -> taskSet)
+            executorIdToRunningTaskIds(execId).add(tid)
+            if (availableCores ne null) {
+              availableCores.addValue(execId, -CPUS_PER_TASK)


Can we put an assertion similar to assert(availableCpus(i) >= 0) ?
We might catch some of the erroneous updates.

sumwale requested review from rishitesh and suranjan February 27, 2018 09:48

rishitesh approved these changes Mar 7, 2018

View reviewed changes

ashetkar force-pushed the master branch from b73485e to f740fee Compare April 20, 2021 09:04

ashetkar force-pushed the SNAP-2231 branch from 18207ae to 3df2364 Compare April 20, 2021 09:09

sumwale force-pushed the master branch from 1e636db to e1d45b2 Compare June 26, 2021 19:41

sumwale force-pushed the master branch from 8cc4798 to 5f5c15d Compare July 14, 2021 18:12

sumwale force-pushed the master branch 5 times, most recently from 8b43301 to 2b254d9 Compare October 1, 2021 09:23

sumwale force-pushed the master branch 5 times, most recently from 2c254f0 to 0f2888f Compare October 18, 2021 17:01

sumwale force-pushed the master branch 2 times, most recently from a466d26 to ea127bd Compare April 12, 2022 10:05

sumwale force-pushed the master branch 2 times, most recently from 99ec79c to c7b84fa Compare June 12, 2022 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SNAP-2231] Limit maximum cores for a job to physical cores on a node#972

[SNAP-2231] Limit maximum cores for a job to physical cores on a node#972
sumwale wants to merge 1 commit intomasterfrom
SNAP-2231

sumwale commented Feb 27, 2018

Uh oh!

rishitesh left a comment

Uh oh!

rishitesh Mar 7, 2018

Uh oh!

sumwale Mar 8, 2018

Uh oh!

sumwale Mar 8, 2018

Uh oh!

rishitesh Mar 7, 2018

Uh oh!

sumwale Mar 8, 2018

Uh oh!

rishitesh Mar 7, 2018

Uh oh!

sumwale Mar 8, 2018

Uh oh!

rishitesh Mar 7, 2018

Uh oh!

sumwale Mar 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sumwale commented Feb 27, 2018

Changes proposed in this pull request

Patch testing

ReleaseNotes.txt changes

Other PRs

Uh oh!

rishitesh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants