Comments (10)
FWIW https://github.com/scylladb/scylla-dtest/pull/4337 will prevent this from happening
but I agree we need to get to the bottom of it and understand why is view building happening only when Draining view builder
.
I'm not sure how tablets contribute to that.
Maybe they change the timing
from scylladb.
@bhalevy perhaps the
system_distributed
keyspace is also using tablets when we're in tablets mode? This could provide a hint why thesystem_distributed:view_build_status
writes failed, if there's a difference in replication and so on.OTOH also
system
writes failed, so it's weird that this failure is only tablets specific.
The system_distributed keyspaces is created with SimpleStrategy so it can't support tablets (though we need to improve the log printout to make that clearer.)
https://jenkins.scylladb.com/job/scylla-master/job/tablets/job/gating-dtest-release-with-tablets/77/artifact/logs-full.release.001/1716857989819_concurrent_schema_changes_test.py%3A%3ATestConcurrentSchemaChanges%3A%3Atest_changes_while_node_down/node1.log
INFO 2024-05-28 00:59:08,721 [shard 0:strm] migration_manager - Create new Keyspace: KSMetaData{name=system_distributed, strategyClass=org.apache.cassandra.locator.SimpleStrategy, strategyOptions={"replication_factor": "3"}, cfMetaData={}, durable_writes=true, userTypes=org.apache.cassandra.config.UTMetaData@0x600004ba72a8}
from scylladb.
Maybe some tablets specific task is somehow locking data structures that are also accessed by view building, delaying view building until the end of the test.
So could be that these ERRORs were always possible and depended on view building timing, it's just that tablets stuff somehow delays it
from scylladb.
Maybe some tablets specific task is somehow locking data structures that are also accessed by view building, delaying view building until the end of the test.
Cc @tgrabiec
And I wouldn't say it's the end of the test, but rather its the node shutdown, which happens automatically when the test exits.
We can try to reproduce by stopping the nodes in the test body
from scylladb.
Might similarly reproduced as:
storage_proxy - exception during mutation write to 127.0.70.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:raft (pk{00101bbaa8e036ab11ef8ff250cf0fc55735}) to commitlog): seastar::gate_closed_exception (gate closed)']
from scylladb.
reproduction with pytest:
@pytest.mark.asyncio
async def test_mv_build_18929(manager: ManagerClient):
cfg = {'enable_tablets': True}
servers = await manager.servers_add(3, config=cfg)
cql = manager.get_cql()
await cql.run_async("CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}")
await cql.run_async("CREATE TABLE ks.t (pk int primary key, v int)")
for i in range(100):
await cql.run_async(f"insert into ks.t (pk, v) values ({i}, {i+1})")
await manager.server_stop_gracefully(servers[2].server_id)
await cql.run_async("CREATE materialized view ks.t_view1 AS select pk, v from ks.t where v is not null primary key (v, pk)")
await cql.run_async("CREATE materialized view ks.t_view2 AS select pk, v from ks.t where v is not null primary key (v, pk)")
await asyncio.sleep(10)
await cql.run_async("DROP materialized view ks.t_view1")
await manager.server_stop_gracefully(servers[1].server_id)
await asyncio.sleep(10)
await manager.server_stop_gracefully(servers[0].server_id)
await asyncio.sleep(10)
from scylladb.
@mlitvk in what point of your test does the problem reproduce? Maybe you don't need those 30 seconds of sleeps and 3 nodes to reproduce it (in theory if it's a shutdown thing, it might be reproducible on just one node).
from scylladb.
@mlitvk in what point of your test does the problem reproduce? Maybe you don't need those 30 seconds of sleeps and 3 nodes to reproduce it (in theory if it's a shutdown thing, it might be reproducible on just one node).
probably the reproduction is not the most efficient but it's useful for the investigation, I can try to improve it as I understand the problem better.
the problem reproduces on the second node shutdown when it's draining the view builder
INFO 2024-07-03 17:25:13,837 [shard 0:main] view - Draining view builder
INFO 2024-07-03 17:25:13,838 [shard 1:main] view - Draining view builder
ERROR 2024-07-03 17:25:13,838 [shard 0: gms] storage_proxy - exception during mutation write to 127.198.249.2: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
system_distributed:view_build_status (pk{00026b730007745f7669657731}) to commitlog): seastar::gate_closed_exception (gate closed)
ERROR 2024-07-03 17:25:13,838 [shard 0: gms] storage_proxy - exception during mutation write to 127.198.249.2: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
system:built_views (pk{00026b73}) to commitlog): seastar::gate_closed_exception (gate closed)
ERROR 2024-07-03 17:25:13,838 [shard 0: gms] storage_proxy - exception during mutation write to 127.198.249.2: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
system:scylla_views_builds_in_progress (pk{00026b73}) to commitlog): seastar::gate_closed_exception (gate closed)
WARN 2024-07-03 17:25:13,838 [shard 1: gms] view - Unexcepted error executing build step: seastar::sleep_aborted (Sleep is aborted). Ignored.
WARN 2024-07-03 17:25:13,838 [shard 0: gms] view - Failed to cleanup view ks.t_view1: exceptions::mutation_write_failure_exception (Operation failed for system.scylla_views_builds_in_progre
ss - received 0 responses and 1 failures from 1 CL=ONE.)
ERROR 2024-07-03 17:25:13,838 [shard 1: gms] storage_proxy - exception during mutation write to 127.198.249.2: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
system:scylla_views_builds_in_progress (pk{00026b73}) to commitlog): seastar::gate_closed_exception (gate closed)
from scylladb.
@mlitvk in what point of your test does the problem reproduce? Maybe you don't need those 30 seconds of sleeps and 3 nodes to reproduce it (in theory if it's a shutdown thing, it might be reproducible on just one node).
probably the reproduction is not the most efficient but it's useful for the investigation, I can try to improve it as I understand the problem better.
Sure, thanks, it's indeed very useful.
If later you'll want to make it into a regression test, you'll also need to check for the log message that we don't want to see. Or maybe just put in dtest, where slow tests with obsessive log checking is the norm :-)
from scylladb.
suspected scenario:
- creating and starting to build view1 and getting errors because a node is down
INFO 2024-05-15 01:08:01,854 [shard 0:strm] schema_tables - Creating ks_ns1.index_ns1_index i
d=9387f3f1-1257-11ef-98d0-09495dfd9fdd version=9387f3f2-1257-11ef-98d0-09495dfd9fdd
INFO 2024-05-15 01:08:01,862 [shard 0:strm] view - Building view ks_ns1.index_ns1_index, starting at token minimum token
WARN 2024-05-15 01:08:01,863 [shard 0:strm] view - Error applying view update to 127.0.42.2 (view: ks_ns1.index_ns1_index, base token: -8839064797231613815, view token: 8833996863197925870): exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
WARN 2024-05-15 01:08:01,863 [shard 0:strm] view - Error executing build step for base ks_ns1.cf_ns1: exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
- we retry with exponential backoff
WARN 2024-05-15 01:08:02,863 [shard 0:strm] view - Error executing build step for base ks_ns1.cf_ns1: exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requi
res 1, alive 0)
WARN 2024-05-15 01:08:04,863 [shard 0:strm] view - Error executing build step for base ks_ns1.cf_ns1: exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
- create view2
INFO 2024-05-15 01:08:11,576 [shard 0:strm] schema_tables - Creating ks_ns1.index2_ns1_index id=99542ce1-1257-11ef-98d0-09495dfd9fdd version=99542ce2-1257-11ef-98d0-09495dfd9fdd
- drop view1
INFO 2024-05-15 01:08:11,989 [shard 0:strm] database - Truncating ks_ns1.index_ns1_index with auto-snapshot
- view1 retry continues until the node shuts down, then we drain the view builder and abort the build step
WARN 2024-05-15 01:08:08,864 [shard 0:strm] view - Error executing build step for base ks_ns1.cf_ns1: exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
INFO 2024-05-15 01:08:16,534 [shard 0:main] view - Draining view builder
WARN 2024-05-15 01:08:16,534 [shard 0:strm] view - Unexcepted error executing build step: seastar::sleep_aborted (Sleep is aborted). Ignored.
- note that the database was shutdown before draining view builder
INFO 2024-05-15 01:08:16,484 [shard 0:main] database - Flushing non-system tables
INFO 2024-05-15 01:08:16,484 [shard 0:main] database - Flushed non-system tables
INFO 2024-05-15 01:08:16,484 [shard 1:main] database - Flushing non-system tables
INFO 2024-05-15 01:08:16,494 [shard 1:main] database - Flushed non-system tables
INFO 2024-05-15 01:08:16,494 [shard 0:main] database - Flushing system tables
INFO 2024-05-15 01:08:16,494 [shard 1:main] database - Flushing system tables
INFO 2024-05-15 01:08:16,498 [shard 1:main] database - Flushed system tables
INFO 2024-05-15 01:08:16,531 [shard 0:main] database - Flushed system tables
INFO 2024-05-15 01:08:16,534 [shard 0:main] view - Draining view builder
- at this point we start to build view2. I think it was waiting until now for a semaphore held by view1 build
INFO 2024-05-15 01:08:16,534 [shard 0:strm] view - Building view ks_ns1.index2_ns1_index, starting at token -8839064797231613815
- view2 build fails because commitlog was shutdown
ERROR 2024-05-15 01:08:16,534 [shard 0:strm] storage_proxy - exception during mutation write t
o 127.0.42.3: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
system:scylla_views_builds_in_progress (pk{00066b735f6e7331}) to commitlog): seastar::gate_clo
sed_exception (gate closed)
ERROR 2024-05-15 01:08:16,534 [shard 0:strm] storage_proxy - exception during mutation write t
o 127.0.42.3: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
system_distributed:view_build_status (pk{00066b735f6e73310010696e646578325f6e73315f696e646578}
) to commitlog): seastar::gate_closed_exception (gate closed)
ERROR 2024-05-15 01:08:16,534 [shard 0:strm] view - Error setting up view for building ks_ns1.
index2_ns1_index: exceptions::mutation_write_failure_exception (Operation failed for system_di
stributed.view_build_status - received 0 responses and 3 failures from 1 CL=ONE.)
- view1 cleanup fails for same reason
ERROR 2024-05-15 01:08:16,535 [shard 0:strm] storage_proxy - exception during mutation write t
o 127.0.42.3: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
system:built_views (pk{00066b735f6e7331}) to commitlog): seastar::gate_closed_exception (gate
closed)
ERROR 2024-05-15 01:08:16,535 [shard 0:strm] storage_proxy - exception during mutation write t
o 127.0.42.3: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
system:scylla_views_builds_in_progress (pk{00066b735f6e7331}) to commitlog): seastar::gate_clo
sed_exception (gate closed)
ERROR 2024-05-15 01:08:16,535 [shard 0:strm] storage_proxy - exception during mutation write t
o 127.0.42.3: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
system_distributed:view_build_status (pk{00066b735f6e7331000f696e6465785f6e73315f696e646578})
to commitlog): seastar::gate_closed_exception (gate closed)
WARN 2024-05-15 01:08:16,535 [shard 0:strm] view - Failed to cleanup view ks_ns1.index_ns1_in
dex: exceptions::mutation_write_failure_exception (Operation failed for system.scylla_views_bu
ilds_in_progress - received 0 responses and 1 failures from 1 CL=ONE.)
Basically I think we need to drain the view builder before draining the db
co_await _db.invoke_on_all(&replica::database::drain);
co_await _sys_ks.invoke_on_all(&db::system_keyspace::shutdown);
co_await _repair.invoke_on_all(&repair_service::shutdown);
co_await _view_builder.invoke_on_all(&db::view::view_builder::drain);
from scylladb.
Related Issues (20)
- doc: add OS Support for version 6.2
- Major bug: Repair will cause index table data loss! HOT 7
- conf_changes_2_drops failed HOT 3
- DESCRIBE doesn't work with maintenance_socket_role_manager HOT 11
- [gossip topology] test_simple_removenode_4 dtest failure: after removenode failure, restarted node fails to become a voter, throws `raft::commit_status_unknown`
- [x86_64, release] boost/secondary_index_test failed with Entering HOT 2
- [x86_64, dev] topology_custom/test_gossip_boot failed with in
- [x86_64, dev] topology_custom/test_initial_token failed to add server and crashed . Error before crash - raft_state_monitor_fiber aborted with raft::stopped_error (Raft instance is stopped, reason: "raft group0 is stopped") HOT 3
- CMake build regressed by commit 6517ca892000 ("service/qos/service_level_controller: Describe service levels", 2024-09-23) HOT 8
- 'nodetool stop cleanup' not working properly HOT 1
- 6.2-rc1/dtest-debug: test_mv_populating_from_existing_data_during_node_remove fails with ASAN heap-use-after-free HOT 2
- Node decommission should be abortable without restarting the decommissioned node
- [x86_64, dev] topology_custom/test_topology_failure_recovery failed with ConfigurationException HOT 1
- A REST command of task-manager fails with: ({"message": "seastar::rpc::closed_error (connection is closed)", "code": 500}) HOT 3
- [CI] test_tablets.test_read_of_pending_replica_during_migration[true] flaky HOT 5
- Avoid killing a node when reading from sstables
- rename cql-pytest without a hyphen HOT 1
- pull_github_pr.sh: `--force` results in "Invalid numeric literal..."
- Prepared statement invalidation loopholes
- topology/test-mv/test_tombstone_gc_not_inherited fails with tablets HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scylladb.