[Magma] - Opening bucket during recovery takes a long time

Description

Script to Repro

Steps to Repro
1. Create a 5 node cluster
2021-07-13 15:05:12,207 | test | INFO | pool-3-thread-6 | [table_view:display:72] Rebalance Overview
------------------------------------------------------------------------------

Nodes

Services

Version

CPU

Status

------------------------------------------------------------------------------

172.23.98.196	kv	7.1.0-1066-enterprise	1.73779708663	Cluster node
172.23.98.195	None			<--- IN ---
172.23.120.206	None			<--- IN ---
172.23.104.186	None			<--- IN ---
172.23.120.201	None			<--- IN ---

------------------------------------------------------------------------------

2. Create bucket/scopes/collections/data
2021-07-13 15:10:04,253 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
---------------------------------------------------------------------------------------------------------------------+---------------------------------------------

Bucket

Type

Storage Backend

Replicas

Durability

TTL

Items

RAM Quota

RAM Used

Disk Used

---------------------------------------------------------------------------------------------------------------------+---------------------------------------------

r4dK9wYeZBZ%iqABnPUVxh6g5r2Sj4CAH8N89or9vhAsV3Sv3AK-22-897000

couchbase

magma

none

3000000

10485760000

2407566136

1992997674

---------------------------------------------------------------------------------------------------------------------+---------------------------------------------

3. Do a graceful failover/delta recovery/rebalance of node 172.23.104.186.

4. Do a graceful failover/delta recovery/rebalance of another node 172.23.120.201.

This rebalance fails as shown below.

Some interesting things to note.
1. This is fairly easily reproducible but not on every single run. At least 3/10 times we can hit this from 10-15 runs I have had with each run having 15-20 tests.
2. This is only seen on the second graceful failover/delta recovery/rebalance(step 4) and when it does on first graceful failover/delta recovery/rebalance(step 3) it fails like .

cbcollect_info attached.

Components

Affects versions

7.1.0

Fix versions

7.1.0

Labels

Environment

7.1.0-1066-enterprise

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

Linked issues

Activity

Show:

Steve Watanabe October 18, 2021 at 10:06 PM

Is there a way for ns_server to determine, via Stats, the bucket is ready to have dcp connections created successfully?

Ben Huddleston October 18, 2021 at 9:57 PM

It is indeed . https://github.com/couchbase/kv_engine/blob/master/engines/ep/src/warmup.h#L112-L219 Gives a good description of the various phases of warmup. After the PopulateVBucketMap phase vBucket stats should be retrievable. However, it's not until the Done state that Dcp Consumers are createable. The bulk of warmp (certainly for couchstore buckets) is going to be in some of the data loading phases done after PopulateVBucketMap. The changes that I mentioned earlier didn't change this though. They changed it so that after we enable traffic (mutations etc.) we also have to wait for all warmup threads to finish before we accept a Dcp Consumer (to prevent a race condition).

(Side note - comment on isFinishedLoading() function allowing creation of DcpConsumers is out of data, it should be isComplete() now - will update).

Steve Watanabe October 18, 2021 at 9:45 PM

Node .105 determines the state of "default" by obtaining stats (command 0x10) from memcached on node .111. We do this to "poll" the state of the bucket on the other node. At some point memcached returns the state of each vbucket in the stats. e.g. here's some debug logs.

and ns_considers this to mean the bucket is ready. Is it possible memcached is returning the vbucket stats when it's not completed all it's warm up activities (which means it'll return Etmpfail for dcp connections)?

Steve Watanabe October 15, 2021 at 9:39 PM
Edited

Looking at the debug.log for node .111 we see the bucket loaded and we start a dcp connection which errors out due to (per Ben) the bucket not being entirely warmed.

And then the bucket does get marked as warmed

So it seems as if we're prematurely trying to establish the dcp connection. On node .105 we see the bucket has become ready and so dcp connections will be set up (and fail). So it appears node .105's determination of the bucket being ready is wrong.

Ben Huddleston October 15, 2021 at 7:39 AM

/ http://review.couchbase.org/c/kv_engine/+/162170 Feels like a likely cause for this.

Looking at the logs picked out, "Engine warmup is complete" is a bit of a rubbish log message, it gets logged per shard I think rather than per bucket and with the changes for we now require all of the warmup tasks to actually finish for the bucket to allow a DCP consumer to be created. We actually want to look at the last occurence of that message (we then see a DCPConsumer successfully open after that).

hopefully this is enough info for you to go on. When exactly is ns_server expecting to be able to open a DCPConsumer?

Steve Watanabe October 14, 2021 at 7:36 PM

This appears to be a different cause. Looking at node 107.105 the janitor_agent has died while attempting to establish a dcp connection to node 107.111 which fails with a etmpfail

Looking at the memcached.log on node 107.111 I don't see what may be leading to the error (with my untrained memcached-eye).

Assigning this ticket to kv for further triage. Feel free to return the ticket to me afterwards.

Steve Watanabe September 20, 2021 at 9:48 PM

Based on updates above from who asked "We will address this post Neo. Please let us know if that is OK." and the response from "yes post Neo is fine. " I'm changing the target to Morpheus.

Meni Hillel September 18, 2021 at 4:40 AM

For Neo, it seems that is the best we can do - increase the timeout. What would be an acceptable timeout under extreme condition? More importantly - we'll probably should document what one may need to do if timeout hits. Is there a way to estimate the duration of a given request?

Shivani Gupta September 17, 2021 at 6:19 PM

Ok that's good. So from end-user perspective it is not bucket creation that can take long and time-out. Which is good because bucket creation will be frequent in slower developer type VMs who are just trying it out.

I suggest we change the title of the ticket to 'Opening a Magma bucket during recovery takes a long time'. This issue is about opening an existing bucket, not about creating a new bucket.

Sarath Lakshman September 17, 2021 at 6:15 PM
Edited

Bucket creation should not take time. It will take time only in the case of the bucket with existing data. For an empty bucket, it won't create even a single sstable until the data is inserted. ns_server/kv team can clarify cases other than delta recovery.

Shivani Gupta September 17, 2021 at 6:09 PM
Edited

so can the creation of a new bucket also take this long? Actually I am not sure why the ticket header is 'Decrease creation time of bucket'?

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Balakumaran Gopal
Reporter
Balakumaran Gopal
Is this a Regression?
No
Triage
Triaged
Operating System
Centos 64-bit
Issue Impact
external
Story Points
1
Priority
Critical
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created July 14, 2021 at 7:55 AM

Updated 4 days ago

Resolved February 10, 2022 at 3:12 PM

Instabug

[Magma] - Opening bucket during recovery takes a long time

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Attachments

Linked issues

is cloned by

is duplicated by

relates to

Activity

Steve Watanabe October 18, 2021 at 10:06 PM

Ben Huddleston October 18, 2021 at 9:57 PM

Steve Watanabe October 18, 2021 at 9:45 PM

Steve Watanabe October 15, 2021 at 9:39 PMEdited

Ben Huddleston October 15, 2021 at 7:39 AM

Steve Watanabe October 14, 2021 at 7:36 PM

Steve Watanabe September 20, 2021 at 9:48 PM

Meni Hillel September 18, 2021 at 4:40 AM

Shivani Gupta September 17, 2021 at 6:19 PM

Sarath Lakshman September 17, 2021 at 6:15 PMEdited

Shivani Gupta September 17, 2021 at 6:09 PMEdited

DetailsAssigneeBalakumaran GopalBalakumaran GopalReporterBalakumaran GopalBalakumaran GopalIs this a Regression?NoTriageTriagedOperating SystemCentos 64-bitIssue ImpactexternalStory Points1PriorityCriticalInstabugOpen Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Operating System

Issue Impact

Story Points

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Steve Watanabe October 15, 2021 at 9:39 PM
Edited

Sarath Lakshman September 17, 2021 at 6:15 PM
Edited

Shivani Gupta September 17, 2021 at 6:09 PM
Edited

Details
Assignee
Balakumaran Gopal
Reporter
Balakumaran Gopal
Is this a Regression?
No
Triage
Triaged
Operating System
Centos 64-bit
Issue Impact
external
Story Points
1
Priority
Critical
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support