Giter Club home page Giter Club logo

Comments (10)

bhalevy avatar bhalevy commented on September 28, 2024

FWIW https://github.com/scylladb/scylla-dtest/pull/4337 will prevent this from happening
but I agree we need to get to the bottom of it and understand why is view building happening only when Draining view builder.
I'm not sure how tablets contribute to that.
Maybe they change the timing

from scylladb.

bhalevy avatar bhalevy commented on September 28, 2024

@bhalevy perhaps the system_distributed keyspace is also using tablets when we're in tablets mode? This could provide a hint why the system_distributed:view_build_status writes failed, if there's a difference in replication and so on.

OTOH also system writes failed, so it's weird that this failure is only tablets specific.

The system_distributed keyspaces is created with SimpleStrategy so it can't support tablets (though we need to improve the log printout to make that clearer.)
https://jenkins.scylladb.com/job/scylla-master/job/tablets/job/gating-dtest-release-with-tablets/77/artifact/logs-full.release.001/1716857989819_concurrent_schema_changes_test.py%3A%3ATestConcurrentSchemaChanges%3A%3Atest_changes_while_node_down/node1.log

INFO  2024-05-28 00:59:08,721 [shard 0:strm] migration_manager - Create new Keyspace: KSMetaData{name=system_distributed, strategyClass=org.apache.cassandra.locator.SimpleStrategy, strategyOptions={"replication_factor": "3"}, cfMetaData={}, durable_writes=true, userTypes=org.apache.cassandra.config.UTMetaData@0x600004ba72a8}

from scylladb.

kbr-scylla avatar kbr-scylla commented on September 28, 2024

Maybe some tablets specific task is somehow locking data structures that are also accessed by view building, delaying view building until the end of the test.

So could be that these ERRORs were always possible and depended on view building timing, it's just that tablets stuff somehow delays it

from scylladb.

bhalevy avatar bhalevy commented on September 28, 2024

Maybe some tablets specific task is somehow locking data structures that are also accessed by view building, delaying view building until the end of the test.

Cc @tgrabiec

And I wouldn't say it's the end of the test, but rather its the node shutdown, which happens automatically when the test exits.
We can try to reproduce by stopping the nodes in the test body

from scylladb.

yarongilor avatar yarongilor commented on September 28, 2024

Might similarly reproduced as:

storage_proxy - exception during mutation write to 127.0.70.1: utils::internal::nested_exception<std::runtime_error> (Could not write mutation system:raft (pk{00101bbaa8e036ab11ef8ff250cf0fc55735}) to commitlog): seastar::gate_closed_exception (gate closed)']

https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-debug/305/testReport/schema_test/TestSchema/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split002___test_restart_with_large_tables/

from scylladb.

mlitvk avatar mlitvk commented on September 28, 2024

reproduction with pytest:

@pytest.mark.asyncio
async def test_mv_build_18929(manager: ManagerClient):
    cfg = {'enable_tablets': True}
    servers = await manager.servers_add(3, config=cfg)

    cql = manager.get_cql()
    await cql.run_async("CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}")
    await cql.run_async("CREATE TABLE ks.t (pk int primary key, v int)")

    for i in range(100):
        await cql.run_async(f"insert into ks.t (pk, v) values ({i}, {i+1})")

    await manager.server_stop_gracefully(servers[2].server_id)

    await cql.run_async("CREATE materialized view ks.t_view1 AS select pk, v from ks.t where v is not null primary key (v, pk)")
    await cql.run_async("CREATE materialized view ks.t_view2 AS select pk, v from ks.t where v is not null primary key (v, pk)")

    await asyncio.sleep(10)
    await cql.run_async("DROP materialized view ks.t_view1")
    await manager.server_stop_gracefully(servers[1].server_id)
    await asyncio.sleep(10)
    await manager.server_stop_gracefully(servers[0].server_id)
    await asyncio.sleep(10)

from scylladb.

nyh avatar nyh commented on September 28, 2024

@mlitvk in what point of your test does the problem reproduce? Maybe you don't need those 30 seconds of sleeps and 3 nodes to reproduce it (in theory if it's a shutdown thing, it might be reproducible on just one node).

from scylladb.

mlitvk avatar mlitvk commented on September 28, 2024

@mlitvk in what point of your test does the problem reproduce? Maybe you don't need those 30 seconds of sleeps and 3 nodes to reproduce it (in theory if it's a shutdown thing, it might be reproducible on just one node).

probably the reproduction is not the most efficient but it's useful for the investigation, I can try to improve it as I understand the problem better.

the problem reproduces on the second node shutdown when it's draining the view builder

INFO  2024-07-03 17:25:13,837 [shard 0:main] view - Draining view builder
INFO  2024-07-03 17:25:13,838 [shard 1:main] view - Draining view builder
ERROR 2024-07-03 17:25:13,838 [shard 0: gms] storage_proxy - exception during mutation write to 127.198.249.2: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
 system_distributed:view_build_status (pk{00026b730007745f7669657731}) to commitlog): seastar::gate_closed_exception (gate closed)
ERROR 2024-07-03 17:25:13,838 [shard 0: gms] storage_proxy - exception during mutation write to 127.198.249.2: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
 system:built_views (pk{00026b73}) to commitlog): seastar::gate_closed_exception (gate closed)
ERROR 2024-07-03 17:25:13,838 [shard 0: gms] storage_proxy - exception during mutation write to 127.198.249.2: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
 system:scylla_views_builds_in_progress (pk{00026b73}) to commitlog): seastar::gate_closed_exception (gate closed)
WARN  2024-07-03 17:25:13,838 [shard 1: gms] view - Unexcepted error executing build step: seastar::sleep_aborted (Sleep is aborted). Ignored.
WARN  2024-07-03 17:25:13,838 [shard 0: gms] view - Failed to cleanup view ks.t_view1: exceptions::mutation_write_failure_exception (Operation failed for system.scylla_views_builds_in_progre
ss - received 0 responses and 1 failures from 1 CL=ONE.)
ERROR 2024-07-03 17:25:13,838 [shard 1: gms] storage_proxy - exception during mutation write to 127.198.249.2: utils::internal::nested_exception<std::runtime_error> (Could not write mutation
 system:scylla_views_builds_in_progress (pk{00026b73}) to commitlog): seastar::gate_closed_exception (gate closed)

from scylladb.

nyh avatar nyh commented on September 28, 2024

@mlitvk in what point of your test does the problem reproduce? Maybe you don't need those 30 seconds of sleeps and 3 nodes to reproduce it (in theory if it's a shutdown thing, it might be reproducible on just one node).

probably the reproduction is not the most efficient but it's useful for the investigation, I can try to improve it as I understand the problem better.

Sure, thanks, it's indeed very useful.
If later you'll want to make it into a regression test, you'll also need to check for the log message that we don't want to see. Or maybe just put in dtest, where slow tests with obsessive log checking is the norm :-)

from scylladb.

mlitvk avatar mlitvk commented on September 28, 2024

suspected scenario:

  1. creating and starting to build view1 and getting errors because a node is down
INFO  2024-05-15 01:08:01,854 [shard 0:strm] schema_tables - Creating ks_ns1.index_ns1_index i
d=9387f3f1-1257-11ef-98d0-09495dfd9fdd version=9387f3f2-1257-11ef-98d0-09495dfd9fdd
INFO  2024-05-15 01:08:01,862 [shard 0:strm] view - Building view ks_ns1.index_ns1_index, starting at token minimum token
WARN  2024-05-15 01:08:01,863 [shard 0:strm] view - Error applying view update to 127.0.42.2 (view: ks_ns1.index_ns1_index, base token: -8839064797231613815, view token: 8833996863197925870): exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
WARN  2024-05-15 01:08:01,863 [shard 0:strm] view - Error executing build step for base ks_ns1.cf_ns1: exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
  1. we retry with exponential backoff
WARN  2024-05-15 01:08:02,863 [shard 0:strm] view - Error executing build step for base ks_ns1.cf_ns1: exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requi
res 1, alive 0)

WARN  2024-05-15 01:08:04,863 [shard 0:strm] view - Error executing build step for base ks_ns1.cf_ns1: exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)
  1. create view2
INFO  2024-05-15 01:08:11,576 [shard 0:strm] schema_tables - Creating ks_ns1.index2_ns1_index id=99542ce1-1257-11ef-98d0-09495dfd9fdd version=99542ce2-1257-11ef-98d0-09495dfd9fdd
  1. drop view1
INFO  2024-05-15 01:08:11,989 [shard 0:strm] database - Truncating ks_ns1.index_ns1_index with auto-snapshot
  1. view1 retry continues until the node shuts down, then we drain the view builder and abort the build step
WARN  2024-05-15 01:08:08,864 [shard 0:strm] view - Error executing build step for base ks_ns1.cf_ns1: exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)

INFO  2024-05-15 01:08:16,534 [shard 0:main] view - Draining view builder
WARN  2024-05-15 01:08:16,534 [shard 0:strm] view - Unexcepted error executing build step: seastar::sleep_aborted (Sleep is aborted). Ignored.
  1. note that the database was shutdown before draining view builder
INFO  2024-05-15 01:08:16,484 [shard 0:main] database - Flushing non-system tables
INFO  2024-05-15 01:08:16,484 [shard 0:main] database - Flushed non-system tables
INFO  2024-05-15 01:08:16,484 [shard 1:main] database - Flushing non-system tables
INFO  2024-05-15 01:08:16,494 [shard 1:main] database - Flushed non-system tables
INFO  2024-05-15 01:08:16,494 [shard 0:main] database - Flushing system tables
INFO  2024-05-15 01:08:16,494 [shard 1:main] database - Flushing system tables
INFO  2024-05-15 01:08:16,498 [shard 1:main] database - Flushed system tables
INFO  2024-05-15 01:08:16,531 [shard 0:main] database - Flushed system tables
INFO  2024-05-15 01:08:16,534 [shard 0:main] view - Draining view builder
  1. at this point we start to build view2. I think it was waiting until now for a semaphore held by view1 build
INFO  2024-05-15 01:08:16,534 [shard 0:strm] view - Building view ks_ns1.index2_ns1_index, starting at token -8839064797231613815
  1. view2 build fails because commitlog was shutdown
ERROR 2024-05-15 01:08:16,534 [shard 0:strm] storage_proxy - exception during mutation write t
o 127.0.42.3: utils::internal::nested_exception<std::runtime_error> (Could not write mutation 
system:scylla_views_builds_in_progress (pk{00066b735f6e7331}) to commitlog): seastar::gate_clo
sed_exception (gate closed)
ERROR 2024-05-15 01:08:16,534 [shard 0:strm] storage_proxy - exception during mutation write t
o 127.0.42.3: utils::internal::nested_exception<std::runtime_error> (Could not write mutation 
system_distributed:view_build_status (pk{00066b735f6e73310010696e646578325f6e73315f696e646578}
) to commitlog): seastar::gate_closed_exception (gate closed)
ERROR 2024-05-15 01:08:16,534 [shard 0:strm] view - Error setting up view for building ks_ns1.
index2_ns1_index: exceptions::mutation_write_failure_exception (Operation failed for system_di
stributed.view_build_status - received 0 responses and 3 failures from 1 CL=ONE.)
  1. view1 cleanup fails for same reason
ERROR 2024-05-15 01:08:16,535 [shard 0:strm] storage_proxy - exception during mutation write t
o 127.0.42.3: utils::internal::nested_exception<std::runtime_error> (Could not write mutation 
system:built_views (pk{00066b735f6e7331}) to commitlog): seastar::gate_closed_exception (gate 
closed)
ERROR 2024-05-15 01:08:16,535 [shard 0:strm] storage_proxy - exception during mutation write t
o 127.0.42.3: utils::internal::nested_exception<std::runtime_error> (Could not write mutation 
system:scylla_views_builds_in_progress (pk{00066b735f6e7331}) to commitlog): seastar::gate_clo
sed_exception (gate closed)
ERROR 2024-05-15 01:08:16,535 [shard 0:strm] storage_proxy - exception during mutation write t
o 127.0.42.3: utils::internal::nested_exception<std::runtime_error> (Could not write mutation 
system_distributed:view_build_status (pk{00066b735f6e7331000f696e6465785f6e73315f696e646578}) 
to commitlog): seastar::gate_closed_exception (gate closed)
WARN  2024-05-15 01:08:16,535 [shard 0:strm] view - Failed to cleanup view ks_ns1.index_ns1_in
dex: exceptions::mutation_write_failure_exception (Operation failed for system.scylla_views_bu
ilds_in_progress - received 0 responses and 1 failures from 1 CL=ONE.)

Basically I think we need to drain the view builder before draining the db

    co_await _db.invoke_on_all(&replica::database::drain);                                                                                                                                    
    co_await _sys_ks.invoke_on_all(&db::system_keyspace::shutdown);                                                                                                                           
    co_await _repair.invoke_on_all(&repair_service::shutdown);                                                                                                                                
    co_await _view_builder.invoke_on_all(&db::view::view_builder::drain);                                                                                                                     

from scylladb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.