SAMZA-2444: JobModel save in CoordinatorStreamStore resulting flush for each message #1259

alnzng · 2020-01-23T22:04:08Z

Symptom

When Samza's job creates lots of tasks/partitions, it can take over a long time for AM to save the metadata in a run which may cause timeout exception.
We observed if the Samza's job has over 37k tasks/partitions, it takes over 10 min for AM to save it in a run.

Cause

JobModelManager uses CoordinatorStreamStore.put() to save job metadata information which does flush for each message, and the flush operation is heavy especially when the remote server suffering the performance issues.

Changes

Separate flush from put/putAll/delete functions in CoordinatorStreamStore.
Explicity call flush after call put/putAll/delete in related classes
Batch write task partition assignments information to metadata store.
Batch write task container information to metadata store.

Tests

All unit tests and integration tests are passed

API Changes

Replace writeTaskPartitionAssignment with new batch write method writeTaskPartitionAssignments in TaskPartitionAssignmentManager
Replace writeTaskContainerMapping with new batch write method writeTaskContainerMappings in TaskAssignmentManager.

Upgrade Instructions

None

Usage Instructions

None

prateekm · 2020-01-24T17:57:08Z

@bharathkk and @lakshmi-manasa-g can you take a look?

xinyuiscool · 2020-01-27T17:46:34Z

Now I looked at the coodinator stream impl for metadatstore, it seems we shouldn't couple flush with put/putall/delete/deleteall. It should be done as a separate call at the end of updates. The current impl is hideous and can cause further perf problems down the road.

samza-api/src/main/java/org/apache/samza/metadatastore/MetadataStore.java

samza-core/src/main/java/org/apache/samza/coordinator/metadatastore/CoordinatorStreamStore.java

samza-core/src/main/java/org/apache/samza/container/grouper/task/TaskAssignmentManager.java

alnzng · 2020-01-27T18:29:40Z

Now I looked at the coodinator stream impl for metadatstore, it seems we shouldn't couple flush with put/putall/delete/deleteall. It should be done as a separate call at the end of updates. The current impl is hideous and can cause further perf problems down the road.

@xinyuiscool Do you mean we should remove flush calls from put/putall/delete/deleteall? And let the callers of those functions decide when to do flush operation?

xinyuiscool · 2020-01-27T18:54:55Z

Yes, I think that's a typical store api. Not sure why it was doing all the flushes before. @bharathkk might have some context about it, but he is on vacation.

alnzng · 2020-01-27T19:24:04Z

@lakshmi-manasa-g Thanks for the review. Had one discussion with @xinyuiscool and @prateekm , we decided to change with another solution which will move the flush out of all update methods in CoordinatorStreamStore. So above codes with your comments will be removed. I will let you know when the new changes are done.

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

prateekm · 2020-01-27T22:24:27Z

Alan, this PR seems to be solving a similar problem. #1125

Can you check how the problem / solution in that PR relates to this one?

alnzng · 2020-01-27T22:34:43Z

Alan, this PR seems to be solving a similar problem. #1125

Can you check how the problem / solution in that PR relates to this one?

@prateekm
Just checked the PR: #1125. I believe it is solving exactly the same performance issue with this one.
But it is doing with batch updates way, and current PR is doing the way to decouple flush and update methods.

If we solve the issue with this PR, later we can close it.

1. Batch write task partition assignments information to metadata store. 2. Batch write task container information to metadata store. Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

alnzng · 2020-01-28T00:51:36Z

The changes are done and updated the details in the PR's description, please help review, thanks!
@xinyuiscool @prateekm @lakshmi-manasa-g

...re/src/main/java/org/apache/samza/container/grouper/task/TaskPartitionAssignmentManager.java

samza-core/src/main/java/org/apache/samza/coordinator/metadatastore/CoordinatorStreamStore.java

...re/src/main/java/org/apache/samza/container/grouper/task/TaskPartitionAssignmentManager.java

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

mynameborat · 2020-01-28T22:55:05Z

Yes, I think that's a typical store api. Not sure why it was doing all the flushes before. @bharathkk might have some context about it, but he is on vacation.

@alnzng here is the PR for context - #1112
+1 to splitting flush and the API. Refer to the comments in the above PR regarding separating out flush and put APIs

lakshmi-manasa-g

thank you for addressing the comments.

alnzng · 2020-02-04T00:57:33Z

@xinyuiscool @prateekm
Can any of you help check and merge the PR if no new comments? Thanks.

xinyuiscool

LGTM. Thanks for the quick fix!

alnzng requested review from prateekm and xinyuiscool January 24, 2020 00:21

lakshmi-manasa-g reviewed Jan 27, 2020

View reviewed changes

alnzng added 3 commits January 27, 2020 11:37

Remove flush operation out of put functions

b7efe8d

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

Explicitly call flush method after calling put/putAll/delete methods

413c834

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

Check flush call in unit tests

4f57592

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

alnzng added 2 commits January 27, 2020 15:09

Improve performance with batch udpate

7466ca3

1. Batch write task partition assignments information to metadata store. 2. Batch write task container information to metadata store. Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

Fix checkstyle issue

a669601

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

alnzng force-pushed the SAMZA-2444 branch from cd0aaac to a669601 Compare January 27, 2020 23:15

Fix unit test failures

e34e917

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

lakshmi-manasa-g reviewed Jan 28, 2020

View reviewed changes

Remove duplicated codes and useless log

dfe49bc

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>

lakshmi-manasa-g approved these changes Jan 29, 2020

View reviewed changes

xinyuiscool approved these changes Feb 5, 2020

View reviewed changes

xinyuiscool merged commit ca641dc into apache:master Feb 5, 2020

alnzng deleted the SAMZA-2444 branch February 5, 2020 18:02

SAMZA-2444: JobModel save in CoordinatorStreamStore resulting flush for each message #1259

SAMZA-2444: JobModel save in CoordinatorStreamStore resulting flush for each message #1259

Uh oh!

Conversation

alnzng commented Jan 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptom

Cause

Changes

Tests

API Changes

Upgrade Instructions

Usage Instructions

Uh oh!

prateekm commented Jan 24, 2020

Uh oh!

xinyuiscool commented Jan 27, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alnzng commented Jan 27, 2020

Uh oh!

xinyuiscool commented Jan 27, 2020

Uh oh!

alnzng commented Jan 27, 2020

Uh oh!

prateekm commented Jan 27, 2020

Uh oh!

alnzng commented Jan 27, 2020

Uh oh!

alnzng commented Jan 28, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mynameborat commented Jan 28, 2020

Uh oh!

lakshmi-manasa-g left a comment

Choose a reason for hiding this comment

Uh oh!

alnzng commented Feb 4, 2020

Uh oh!

xinyuiscool left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alnzng commented Jan 23, 2020 •

edited

Loading