Skip to content

Conversation

@alnzng
Copy link
Contributor

@alnzng alnzng commented Jan 23, 2020

Symptom

When Samza's job creates lots of tasks/partitions, it can take over a long time for AM to save the metadata in a run which may cause timeout exception.
We observed if the Samza's job has over 37k tasks/partitions, it takes over 10 min for AM to save it in a run.

Cause

JobModelManager uses CoordinatorStreamStore.put() to save job metadata information which does flush for each message, and the flush operation is heavy especially when the remote server suffering the performance issues.

Changes

  • Separate flush from put/putAll/delete functions in CoordinatorStreamStore.
  • Explicity call flush after call put/putAll/delete in related classes
  • Batch write task partition assignments information to metadata store.
  • Batch write task container information to metadata store.

Tests

  • All unit tests and integration tests are passed

API Changes

  • Replace writeTaskPartitionAssignment with new batch write method writeTaskPartitionAssignments in TaskPartitionAssignmentManager
  • Replace writeTaskContainerMapping with new batch write method writeTaskContainerMappings in TaskAssignmentManager.

Upgrade Instructions

None

Usage Instructions

None

@prateekm
Copy link
Contributor

@bharathkk and @lakshmi-manasa-g can you take a look?

@xinyuiscool
Copy link
Contributor

Now I looked at the coodinator stream impl for metadatstore, it seems we shouldn't couple flush with put/putall/delete/deleteall. It should be done as a separate call at the end of updates. The current impl is hideous and can cause further perf problems down the road.

@alnzng
Copy link
Contributor Author

alnzng commented Jan 27, 2020

Now I looked at the coodinator stream impl for metadatstore, it seems we shouldn't couple flush with put/putall/delete/deleteall. It should be done as a separate call at the end of updates. The current impl is hideous and can cause further perf problems down the road.

@xinyuiscool Do you mean we should remove flush calls from put/putall/delete/deleteall? And let the callers of those functions decide when to do flush operation?

@xinyuiscool
Copy link
Contributor

Yes, I think that's a typical store api. Not sure why it was doing all the flushes before. @bharathkk might have some context about it, but he is on vacation.

@alnzng
Copy link
Contributor Author

alnzng commented Jan 27, 2020

@lakshmi-manasa-g Thanks for the review. Had one discussion with @xinyuiscool and @prateekm , we decided to change with another solution which will move the flush out of all update methods in CoordinatorStreamStore. So above codes with your comments will be removed. I will let you know when the new changes are done.

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>
Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>
Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>
@prateekm
Copy link
Contributor

Alan, this PR seems to be solving a similar problem. #1125

Can you check how the problem / solution in that PR relates to this one?

@alnzng
Copy link
Contributor Author

alnzng commented Jan 27, 2020

Alan, this PR seems to be solving a similar problem. #1125

Can you check how the problem / solution in that PR relates to this one?

@prateekm
Just checked the PR: #1125. I believe it is solving exactly the same performance issue with this one.
But it is doing with batch updates way, and current PR is doing the way to decouple flush and update methods.

If we solve the issue with this PR, later we can close it.

1. Batch write task partition assignments information to metadata store.
2. Batch write task container information to metadata store.

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>
Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>
Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>
@alnzng
Copy link
Contributor Author

alnzng commented Jan 28, 2020

The changes are done and updated the details in the PR's description, please help review, thanks!
@xinyuiscool @prateekm @lakshmi-manasa-g

Signed-off-by: Alan Zhang <shuai.xyz@gmail.com>
@mynameborat
Copy link
Contributor

Yes, I think that's a typical store api. Not sure why it was doing all the flushes before. @bharathkk might have some context about it, but he is on vacation.

@alnzng here is the PR for context - #1112
+1 to splitting flush and the API. Refer to the comments in the above PR regarding separating out flush and put APIs

Copy link
Contributor

@lakshmi-manasa-g lakshmi-manasa-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for addressing the comments.

@alnzng
Copy link
Contributor Author

alnzng commented Feb 4, 2020

@xinyuiscool @prateekm
Can any of you help check and merge the PR if no new comments? Thanks.

Copy link
Contributor

@xinyuiscool xinyuiscool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the quick fix!

@xinyuiscool xinyuiscool merged commit ca641dc into apache:master Feb 5, 2020
@alnzng alnzng deleted the SAMZA-2444 branch February 5, 2020 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants