Giter Club home page Giter Club logo

ddl's Introduction

Coverage Status

DDL

DDL module for Tarantool 1.10+

The DDL module enables you to describe data schema in a declarative YAML-based format. It is a simpler alternative to describing data schema in Lua and doesn't require having a deep knowledge of Lua. DDL is a built-in Cartridge module. See more details about Tarantool's data mode in documementation.

Contents

API

Set spaces format

`ddl.set_schema(schema)`
- If no spaces existed before, create them.
- If a space exists, check the space's format and indexes.
- If the format/indexes are different from those in the database,
  return an error.
- The module doesn't drop or alter any indexes.
- Spaces omitted in the DDL are ignored, the module doesn't check them.

Return values: `true` if no error, otherwise return `nil, err`

Call of function ddl.set_schema(schema) creates a space _ddl_sharding_key with two fields: space_name with type string and sharding_key with type array.

Similarly for sharding_func: call of function ddl.set_schema(schema) creates a space _ddl_sharding_func with three fields: space_name, sharding_func_name and sharding_func_body with type string.

If you want to use sharding function from some module, you need to require and set to _G the module with sharding function first. For example: to use sharding functions like vshard.router.bucket_id_strcrc32 and vshard.router.bucket_id_mpcrc32 from vshard module you need to require vshard module.

Also, you can pass your own sharding function by defining the function name in _G or by specifying lua code directly in body field:

sharding_func = {
  body = 'function(key) return <...> end'
}

Your defined sharding function in _G should have type function or table | cdata | userdata with __call metamethod.

Defined function must have a prototype according to the following rules:

Parameters

key (number | table | string | boolean) – a sharding key.

Return value

bucket identifier (number)

Check compatibility

`ddl.check_schema(schema)`
- Check that a `set_schema()` call will raise no error.

Return values: `true` if no error, otherwise return `nil, err`

Get spaces format

`ddl.get_schema()`
- Scan spaces and return the database schema.

Return values: table with space's schemas (see "Schema example")

Get bucket id

`ddl.bucket_id(space_name, sharding_key)`
- Calculate bucket id for a specified space and sharding key.
Method uses sharding function specified in DDL schema.

Method is not transactional in the sense that it catches up
`_ddl_sharding_func` changes immediatelly: it may see changes that're
not committed yet and may see a state from another transaction,
which should not be visible in the current transaction.

Return values: bucket_id if no error, otherwise return `nil, err`

Input data format

format = {
    spaces = {
        [space_name] = {
            engine = 'vinyl' | 'memtx',
            is_local = true | false,
            temporary = true | false,
            format = {
                {
                    name = '...',
                    is_nullable = true | false,
                    type = 'unsigned' | 'string' | 'varbinary' |
                            'integer' | 'number' | 'boolean' |
                            'array' | 'scalar' | 'any' | 'map' |
                            'decimal' | 'double' | 'uuid' | 'datetime' |
                            'interval'
                },
                ...
            },
            indexes = {
                -- array of index parameters
                -- integer keys are used as index.id
                -- index parameters depend on the index type
                {
                    type = 'TREE'|'HASH',
                    name = '...',
                    unique = true|false, -- hash index is always unique
                    parts = {
                        -- array of part parameters
                        {
                            path = field_name.jsonpath,
                            -- may be multipath if '[*]' is used,
                            type = 'unsigned' | 'string' | 'varbinary' |
                                'integer' | 'number' | 'boolean' | 'scalar' |
                                'decimal' | 'double' | 'uuid' | 'datetime',
                            is_nullable = true | false,
                            collation = nil | 'none' |
                                'unicode' | 'unicode_ci' | '...',
                            -- collation must be set, if and only if
                            -- type == 'string'.
                            -- to see full list of collations
                            -- just run box.space._collation:select()
                        }
                    },
                    sequence = '...', -- sequence_name
                    function = '...', -- function_name
                }, {
                    type = 'RTREE',
                    name = '...',
                    unique = false, -- rtree can't be unique
                    parts = {
                        -- array with only one part parameter
                        {
                            path = field_name.jsonpath,
                            type = 'array',
                            -- rtree index must use array field
                            is_nullable = true|false,
                        }
                    },
                    dimension = number,
                    distance = 'euclid'|'manhattan',
                }, {
                    type = BITSET,
                    name = '...',
                    unique = false, -- bitset index can't be unique
                    parts = {
                        -- array with only one part parameter
                        {
                            path = field_name.jsonpath,
                            type = 'unsigned' | 'string',
                            -- bitset index doesn't support any other
                            -- field types
                            is_nullable = true|false,
                        }
                    },
                },
                ...
            },
            sharding_key = nil | {
                -- array of strings (field_names)
                --
                -- sharded space must have:
                -- field: {name = 'bucket_id', is_nullable = false, type = 'unsigned'}
                -- index: {
                --     name = 'bucket_id',
                --     type = 'TREE',
                --     unique = false,
                --     parts = {{path = 'bucket_id', is_nullable = false, type = 'unsigned'}}
                -- }
                --
                -- unsharded spaces must NOT have
                -- field and index named 'bucket_id'
            },
            sharding_func = 'dot.notation' | 'sharding_func_name_defined_in_G'
                            {body = 'function(key) return <...> end'},
        },
        ...
    },
    functions = { -- Not implemented yet
        [function_name] = {
            body = '...',
            is_deterministic = true|false,
            is_sandboxed = true|false,
            is_multikey = true|false,
        },
        ...
    },
    sequences = {
        [sequence_name] = {
            start = start,
            min = min,
            max = max,
            cycle = cycle,
            cache = cache,
            step = step,
        },
    },
}

Schema example

local schema = {
    spaces = {
        customer = {
            engine = 'memtx',
            is_local = false,
            temporary = false,
            format = {
                {name = 'customer_id', is_nullable = false, type = 'unsigned'},
                {name = 'bucket_id', is_nullable = false, type = 'unsigned'},
                {name = 'fullname', is_nullable = false, type = 'string'},
            },
            indexes = {{
                name = 'customer_id',
                type = 'TREE',
                unique = true,
                parts = {
                    {path = 'customer_id', is_nullable = false, type = 'unsigned'}
                }
            }, {
                name = 'bucket_id',
                type = 'TREE',
                unique = false,
                parts = {
                    {path = 'bucket_id', is_nullable = false, type = 'unsigned'}
                }
            }, {
                name = 'fullname',
                type = 'TREE',
                unique = true,
                parts = {
                    {path = 'fullname', is_nullable = false, type = 'string'}
                }
            }},
            sharding_key = {'customer_id'},
        },
        account = {
            engine = 'memtx',
            is_local = false,
            temporary = false,
            format = {
                {name = 'account_id', is_nullable = false, type = 'unsigned'},
                {name = 'customer_id', is_nullable = false, type = 'unsigned'},
                {name = 'bucket_id', is_nullable = false, type = 'unsigned'},
                {name = 'balance', is_nullable = false, type = 'string'},
                {name = 'name', is_nullable = false, type = 'string'},
            },
            indexes = {{
                name = 'account_id',
                type = 'TREE',
                unique = true,
                parts = {
                    {path = 'account_id', is_nullable = false, type = 'unsigned'}
                }
            }, {
                name = 'customer_id',
                type = 'TREE',
                unique = false,
                parts = {
                    {path = 'customer_id', is_nullable = false, type = 'unsigned'}
                }
            }, {
                name = 'bucket_id',
                type = 'TREE',
                unique = false,
                parts = {
                    {path = 'bucket_id', is_nullable = false, type = 'unsigned'}
                }
            }},
            sharding_key = {'customer_id'},
            sharding_func = 'vshard.router.bucket_id_mpcrc32',
        },
        tickets = {
            engine = 'memtx',
            is_local = false,
            temporary = false,
            format = {
                {name = 'ticket_id', is_nullable = false, type = 'unsigned'},
                {name = 'customer_id', is_nullable = false, type = 'unsigned'},
                {name = 'bucket_id', is_nullable = false, type = 'unsigned'},
                {name = 'contents', is_nullable = false, type = 'string'},
            },
            indexes = {{
                name = 'ticket_id',
                type = 'TREE',
                unique = true,
                parts = {
                    {path = 'ticket_id', is_nullable = false, type = 'unsigned'}
                },
                sequence = 'ticket_seq',
            }, {,
                name = 'customer_id',
                type = 'TREE',
                unique = false,
                parts = {
                    {path = 'customer_id', is_nullable = false, type = 'unsigned'}
                }
            }, {
                name = 'bucket_id',
                type = 'TREE',
                unique = false,
                parts = {
                    {path = 'bucket_id', is_nullable = false, type = 'unsigned'}
                }
            }},
            sharding_key = {'customer_id'},
            sharding_func = 'vshard.router.bucket_id_mpcrc32',
        },
    },
    sequences = {
        ticket_seq = {
            start = 1,
            min = 1,
            max = 10000,
            cycle = false,
        },
    },
}

Building and testing

tt rocks make
tt rocks install luatest 0.5.7
tt rocks install luacheck 0.25.0
make test -C build.luarocks ARGS="-V"

ddl's People

Contributors

0x501d avatar akudiyar avatar ananek avatar andreyaksenov avatar curiousgeorgiy avatar differentialorange avatar grishnov avatar gumix avatar hustonmmmavr avatar ligurio avatar mkostoevr avatar no1seman avatar olegrok avatar rosik avatar slavakirichenko avatar totktonada avatar ylobankov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ddl's Issues

ddl.bucket_id() speed up for vshard.router.bucket_id*() functions

We're implementing caching of sharding functions in the scope of #82.

There is the difference how we cache a sharding function, which is defined by name and defined with full body. The latter case is the good one: we can track changes of the function with the on_replace trigger, so we just store a callable object in the cache. The former one (stored name) is tricky. One can replace implementation of the function (just rawset(_G, 'my_sharding_func', <...>)) and the caching code will not be aware of this. So we store only name of the function.

As result some extra code is working each time to get the callable. It traverses the list of name chunks like {'vshard', 'router', 'bucket_id_mpcrc32'} and returns _G.vshard.router.bucket_id_mpcrc32. In measurements it shows 1.5x performance drop, see PR #85 for numbers.

I see two ways to proceed here:

  1. Look deeper at the difference: maybe we can optimize the traversal code and make the difference negligible.
  2. Implement the exceptions list: function names, which are known to be safe for caching (because they're never changing). Add vshard functions (vshard.router.bucket_id_mpcrc32, vshard.router.bucket_id_strcrc32, vshard.router.bucket_id) here by default and leave a user ability to manage the list.
  3. Ignore the difference. It is tiny in absolute numbers: 550ns vs 380ns (according to numbers in PR #85).

Unsupported fieldno index part

Now it crashes migrations with _tcf space

reproducer:

localhost:3301> ddl.check_schema(ddl.get_schema())
---
- true
...

localhost:3301> tmp = box.schema.create_space('tmp', {if_not_exists = true}); tmp:create_index('pri')
---
...

localhost:3301> ddl.check_schema(ddl.get_schema())
---
- null
- 'spaces["tmp"].indexes["pri"].parts[1].path: bad value (string expected, got number)'
...

Allow custom fields in space:format

I can use additional fields in space:format with some metadata. For example:

localhost:3302> box.space.my_space:format({{name = 'a', type = 'string' }, {name = 'b', type = 'string', comment = 'awesome comment'}})
localhost:3302> box.space.my_space:format()
---
- [{'name': 'a', 'type': 'string'}, {'type': 'string', 'comment': 'awesome comment',
    'name': 'b'}]
...

But when I try to add this field in Cartridge DDL:

spaces:
  my_space:
    format:
      - is_nullable: false
        type: unsigned
        name: bucket_id
        comment: blabla

I see the error: spaces["my_space"].format["bucket_id"]: redundant key "comment"

Use cache for database schema in get_schema()

CRUD will use DDL schema to obtain a sharding key using get_schema(). However, on every get_schema() call DDL module scan spaces to build the current database schema. Need to cache a database schema.

Release 1.0.0

The most significant logic seems to be OK, so it's time to release first major version. Any other changes will be handled according to the semantic versioning rules.

And don't forget to add changelog

Types map and array and JSON path are not support in DDL schema

Types map, array and JSON path are not supported in DDL schema.
We need to describe it in DDL documentation.

https://github.com/tarantool/ddl/blob/master/ddl/check.lua#L720-L746

    -- check sharding_key path is valid
    for _, path_to_field in ipairs(space.sharding_key) do
        local path = get_path_info(path_to_field)
        if path.type ~= 'regular' then
            return nil, string.format(
                "spaces[%q].sharding_key[%q]: key containing JSONPath isn't supported yet",
                space.name, path_to_field
            )
        end

        local field = space.fields[path.field_name]
        if not field then
            return nil, string.format(
                "spaces[%q].sharding_key[%q]: invalid reference to format[%q], no such field",
                space.name, path_to_field, path.field_name
            )
        end

        if field.type == 'map' or field.type == 'array' then
            return nil, string.format(
                "spaces[%q].sharding_key[%q]: key references to field " ..
                "with %s type, but it's not supported yet",
                space.name, path_to_field, field.type
            )
        end
    end
    return true
end

Wrong assumption for sharding_index name

vshard configuration has an option to specify a name or id of sharding index, by default it values is 'bucket_id'. It's up to the user to change it or keep the default value [1].

Source code of ddl/check.lua has a wrong assumption for sharding index name:

  1. https://www.tarantool.io/en/doc/latest/reference/reference_rock/vshard/vshard_ref/#confval-shard_index

Scalar types at index parts

Tarantool allows to create indexes with scalar parts type, that references to fields (which has types like string, integer, unsigned), but in current ddl version this will raise an error. Should we change this behaviour?

Add sharding_key to ddl

In order to understand how a space is sharded (if at all), I suggest to add sharding_key field to DDL spec. It should be a list of strings, that are JSON paths to fields that are used to calculate sharding key.

If sharding_key is present, there should also be a bucket_id field in that space.

Add metadata info into DDL

As for now different tarantool modules or/and applications used it's own mechanisms of manipulating and persisting spaces metadata. Seems that it's time to make single source of truth for the following metadata:

  1. Human readable space name - (string). The main difference from box.space[name] is that human readable space name may contain spaces and even a short description of space contents.
  2. Space description - (string) - Short but exact description of space in a few sentences.
  3. Field description - (string) - Short but exact description of each field of the space in a few sentences.
  4. Field validation rules - (table) - a number of rules to validate space field input data.
  5. Index description - (string) - Short but exact description of index in a few sentences.

Check input parameters and output values for sharding function

In #71 we will add a section with sharding function to DDL schema and in #76 we will add a method that will calculate bucket id using sharding function specified in DDL schema.
I think we need to check sharding function specified in DDL schema for input arguments and make sure that function returns a number. Otherwise sharding will not work.

Regression tests contain select(nil)

When luatest running with -c (no capture) logs contains many warnings about using select() without limitations:

2022-07-12 21:10:14.225 [6437] main/103/luatest I> set 'log_format' configuration option to "plain"
2022-07-12 21:10:14.231 [6437] main/103/luatest C> Potentially long select from space '_ddl_sharding_key' (512)                     stack traceback:                                                                                                                  
        builtin/box/schema.lua:2382: in function 'check_select_safety'
        builtin/box/schema.lua:2396: in function 'select'                                                                          
        ...geyb/sources/MRG/ddl/test/set_sharding_metadata_test.lua:180: in function 'method'                   
        ...ources/MRG/ddl/.rocks/share/tarantool/luatest/runner.lua:348: in function <...ources/MRG/ddl/.rocks/share/tarantool/luatest/runner.lua:347>    

Version: ac006ff (tag 1.6.1)

Add bucket_id calculation function

After #71 we'll have all information that is necessary for calculating bucket_id by a sharding key. The proposal is to implement this helper function in the module.

Example

Schema:

TBD

Assume we serve HTTP requests:

function handler(req)
    local bucket_id = ddl.bucket_id('users', req:json()['user_id'])
    local ok, err = vshard.router.callrw(bucket_id, <..args..>)
    <...>
end

The key point is that ddl knows, which sharding function is is use. Without the function we would do the following:

function handler(req)
    local bucket_id = vshard.router.bucket_id_mpcrc32(req:json()['user_id']) -- !!
    local ok, err = vshard.router.callrw(bucket_id, <..args..>)
    <...>
end

Caching of the sharding keys

We have cache for sharding function: #85.
Logically, we can also cache the sharding keys and get rid of get_metadata() finction.

box.space compatible API

As DDL module is a high-level interface for compatibility of different versions of Tarantool it must have API maximally compatible with the structure of box.space*. Now ddl.get_schema() have the following problems (the list may not be complete):

  1. ddl.get_schema().spaces[] whereas box.space[]
  2. ddl.get_schema().spaces[].indexes whereas box.space[].index
  3. ddl.get_schema().spaces[].indexes doesn't contain an id field
  4. ddl.get_schema().spaces[] doesn't contain an ck_constraint description

extend sharding metadata

As for now ddl module creates _ddl_sharding_key space which contains only sharding key fileds but it's not enough for automatic data manipulation? for example by crud module. Need to add also the following data:

  1. array of space fields which is used in sharding function to calculate sharding_key.
  2. ability to store custom sharding function.

[doc] clarify sharding_key parameter

In readme there's an example of space definition with such code:

sharding_key = {'customer_id'},

However it's not clear at all what this parameter does and how it is related to bucket_id.
Please provide more information about the parameter.

Is this definition correct for space with a bucket_id field also defined ?
When is it needed to redefine the sharding key ?

ddl.check_schema fails, when run inside tarantool console

Tarantool Enterprise For Mac 2.2.1-13-g8bfcb4422
type 'help' for interactive help
tarantool> require('ddl').check_schema({})
---
- null
- '...kspace/channelcontrol/scp/.rocks/share/tarantool/ddl.lua:10: attempt to concatenate
  local ''caller_name'' (a nil value)'
...

cache: hot reload techniques support

The сache in the module DLL does not respect module hot reloads.
This entails an on_replace trigger duplication if hot reload occurs:

tarantool> box.space._ddl_sharding_func:on_replace()
---
- - 'function: 0x41d07030'
...
tarantool> for k in pairs(package.loaded) do if k:startswith('ddl') then package.loaded[k]=nil end end
---
...
tarantool> ddl = require('ddl')
---
...
tarantool> ddl.bucket_id('users', 42)
---
- 42
...
tarantool> box.space._ddl_sharding_func:on_replace()
---
- - 'function: 0x41d07030'
  - 'function: 0x41dccaf8'
...

There are some ideas how to keep a state between reloads in Consider protecting the default instance from well known hot reload implementations conf#2 (at end of the issue description).

Wrong sharding function check?

ddl/ddl/check.lua

Lines 745 to 764 in c88c3ea

local function check_sharding_func_name(sharding_func_name)
-- split sharding func name in dot notation by dot
-- foo.bar.baz -> chunks: foo bar baz
-- foo -> chunks: foo
local chunks = string.split(sharding_func_name, '.')
-- check is the each chunk an identifier
for _, chunk in pairs(chunks) do
if not check_name_isident(chunk) then
return false
end
end
local sharding_func = rawget(_G, sharding_func_name)
if sharding_func == nil then
return false
end
return is_callable(sharding_func)
end

Say, for 'foo.bar' we'll check _G['foo.bar'], not _G.foo.bar.

BTW, shouldn't we use strict-mode-checked access instead of rawget(_G, <...>)? (not within this issue).

check_schema: is_nullable should be optional parameter

From tarantool space:format() documentation:

the optional is_nullable value may be either true or false

tarantool> (function()
         >     local res, err = require('ddl').check_schema({
         >         spaces = {
         >             processes = {
         >                 engine = 'memtx',
         >                 is_local = false,
         >                 temporary = false,
         >                 format = {
         >                     { name = 'foo', type = 'string' }
         >                 },
         >                 indexes = {},
         >             }
         >         }})
         >     return res, err
         > end)()
---
- null
- 'space["processes"].fields["foo"]: bad argument ''is_nullable'' (boolean expected,
  got nil)'
...

Please make it consistent with tarantool space format

ddl.get_schema should return space IDs

It doesn't pair with retrieving metadata from system spaces and is required by connectors (e.g. cartridge-java).

This relates to the index metadata too, of course.

Migration with invalid schema can be applied.

The invalid migration leaves cluster in inconsistent state when using
'migrations' module in cartridge application.

Steps to reproduce:

  1. Create cartridge application with migrations module and with 2 or more storages and bootstrap it.
  2. Create a migration file in application dir 'migrations/01_test.lua' with the following contents:
    return {
        up = function()
            local utils = require('migrator.utils')
            local f = box.schema.create_space('test', {
                format = {
                    { name = 'bucket_id', type = 'integer', is_nullable = false },
                    { name = 'ID', type = 'string', is_nullable = false },
                },
                if_not_exists = true,
            })
            f:create_index('test_reference_pk', {
                parts = { 'ID'},
                if_not_exists = true,
            })
            f:create_index('bucket_id', {
                parts = { 'bucket_id' },
                if_not_exists = true,
                unique = false
            })
            utils.register_sharding_key('test', {'bucket_id'})
            return true
        end
    }
  1. Apply migrations with command
    curl -X POST http://localhost:8081/migrations/up
  1. POST request returns error so cluster configuration should not be commited. Error text:
 CheckSchemaError: spaces["test"].format["bucket_id"].type: bad value (unsigned expected, got integer)
  1. Connect to storage instance and observer the 'test' space that should not be created.

Cluster logs:

    test.router | 2020-11-16 15:45:41.447 [25742] main/188/http/127.0.0.1:57362 I> Migrations to be applied: ["01_test.lua"]
    test.router | 2020-11-16 15:45:41.448 [25742] main/188/http/127.0.0.1:57362 I> Preparing to run migrations on localhost:3301
    test.router | 2020-11-16 15:45:41.448 [25742] main/188/http/127.0.0.1:57362 I> Preparing to run migrations on localhost:3302
    test.router | 2020-11-16 15:45:41.453 [25742] main/188/http/127.0.0.1:57362 twophase.lua:199 W> Updating config clusterwide...
    test.router | 2020-11-16 15:45:41.454 [25742] main/188/http/127.0.0.1:57362 twophase.lua:300 W> (2PC) Preparation stage...
    test.s1-master | 2020-11-16 15:45:41.454 [25751] main/118/main twophase.lua:54 W> CheckSchemaError: spaces["test"].format["bucket_id"].type: bad value (unsigned expected, got integer)
    test.s1-master | stack traceback:
    test.s1-master |        ...st/test/.rocks/share/tarantool/cartridge/ddl-manager.lua:114: in function 'validate_config'
    test.s1-master |        ...st/test/.rocks/share/tarantool/cartridge/confapplier.lua:205: in function 'validate_config'
    test.s1-master |        ...ltest/test/.rocks/share/tarantool/cartridge/twophase.lua:52: in function <...ltest/test/.rocks/share/tarantool/cartridge/twophase.lua:49>
    test.s1-master |        [C]: in function 'xpcall'
    test.s1-master |        .../work/git/ddltest/test/.rocks/share/tarantool/errors.lua:145: in function <.../work/git/ddltest/test/.rocks/share/tarantool/errors.lua:139>
    test.s1-master |        [C]: at 0x00608f60
    test.router | 2020-11-16 15:45:41.454 [25742] main/191/main twophase.lua:54 W> CheckSchemaError: spaces["test"].format["bucket_id"].type: bad value (unsigned expected, got integer)
    test.router | stack traceback:
    test.router |   ...st/test/.rocks/share/tarantool/cartridge/ddl-manager.lua:114: in function 'validate_config'
    test.router |   ...st/test/.rocks/share/tarantool/cartridge/confapplier.lua:205: in function 'validate_config'
    test.router |   ...ltest/test/.rocks/share/tarantool/cartridge/twophase.lua:52: in function <...ltest/test/.rocks/share/tarantool/cartridge/twophase.lua:49>
    test.router |   [C]: in function 'xpcall'
    test.router |   .../work/git/ddltest/test/.rocks/share/tarantool/errors.lua:145: in function <.../work/git/ddltest/test/.rocks/share/tarantool/errors.lua:139>
    test.router |   [C]: at 0x00608f60
    test.router | 2020-11-16 15:45:41.455 [25742] main/188/http/127.0.0.1:57362 twophase.lua:319 E> Error preparing for config update at localhost:3301:
    test.router | CheckSchemaError: spaces["test"].format["bucket_id"].type: bad value (unsigned expected, got integer)
    test.router | stack traceback:
    test.router |   ...st/test/.rocks/share/tarantool/cartridge/ddl-manager.lua:114: in function 'validate_config'
    test.router |   ...st/test/.rocks/share/tarantool/cartridge/confapplier.lua:205: in function 'validate_config'
    test.router |   ...ltest/test/.rocks/share/tarantool/cartridge/twophase.lua:52: in function <...ltest/test/.rocks/share/tarantool/cartridge/twophase.lua:49>
    test.router |   [C]: in function 'xpcall'
    test.router |   .../work/git/ddltest/test/.rocks/share/tarantool/errors.lua:145: in function <.../work/git/ddltest/test/.rocks/share/tarantool/errors.lua:139>
    test.router |   [C]: at 0x00608f60
    test.router | during net.box call to localhost:3301, function "_G.__cartridge_clusterwide_config_prepare_2pc"
    test.router | stack traceback:
    test.router |   ...t/ddltest/test/.rocks/share/tarantool/cartridge/pool.lua:151: in function <...t/ddltest/test/.rocks/share/tarantool/cartridge/pool.lua:141>
    test.router | 2020-11-16 15:45:41.455 [25742] main/188/http/127.0.0.1:57362 twophase.lua:319 E> Error preparing for config update at localhost:3302:
    test.router | CheckSchemaError: spaces["test"].format["bucket_id"].type: bad value (unsigned expected, got integer)
    test.router | stack traceback:
    test.router |   ...st/test/.rocks/share/tarantool/cartridge/ddl-manager.lua:114: in function 'validate_config'
    test.router |   ...st/test/.rocks/share/tarantool/cartridge/confapplier.lua:205: in function 'validate_config'
    test.router |   ...ltest/test/.rocks/share/tarantool/cartridge/twophase.lua:52: in function <...ltest/test/.rocks/share/tarantool/cartridge/twophase.lua:49>
    test.router |   [C]: in function 'xpcall'
    test.router |   .../work/git/ddltest/test/.rocks/share/tarantool/errors.lua:145: in function <.../work/git/ddltest/test/.rocks/share/tarantool/errors.lua:139>
    test.router |   [C]: at 0x00608f60
    test.router | during net.box call to localhost:3302, function "_G.__cartridge_clusterwide_config_prepare_2pc"
    test.router | stack traceback:
    test.router |   ...t/ddltest/test/.rocks/share/tarantool/cartridge/pool.lua:151: in function <...t/ddltest/test/.rocks/share/tarantool/cartridge/pool.lua:141>
    test.router | 2020-11-16 15:45:41.455 [25742] main/188/http/127.0.0.1:57362 twophase.lua:359 W> (2PC) Abort stage...
    test.router | 2020-11-16 15:45:41.455 [25742] main/188/http/127.0.0.1:57362 twophase.lua:383 E> Clusterwide config update failed
    test.router | 2020-11-16 15:45:41.455 [25742] main/188/http/127.0.0.1:57362 server.lua:745 E> unhandled error: CheckSchemaError: spaces["test"].format["bucket_id"].type: bad value (unsigned expected, got integer)
    test.router | stack traceback:
    test.router |   ...st/test/.rocks/share/tarantool/cartridge/ddl-manager.lua:114: in function 'validate_config'
    test.router |   ...st/test/.rocks/share/tarantool/cartridge/confapplier.lua:205: in function 'validate_config'
    test.router |   ...ltest/test/.rocks/share/tarantool/cartridge/twophase.lua:52: in function <...ltest/test/.rocks/share/tarantool/cartridge/twophase.lua:49>
    test.router |   [C]: in function 'xpcall'
    test.router |   .../work/git/ddltest/test/.rocks/share/tarantool/errors.lua:145: in function <.../work/git/ddltest/test/.rocks/share/tarantool/errors.lua:139>
    test.router |   [C]: at 0x00608f60
    test.router | during net.box call to localhost:3302, function "_G.__cartridge_clusterwide_config_prepare_2pc"
    test.router | stack traceback:
    test.router |   ...t/ddltest/test/.rocks/share/tarantool/cartridge/pool.lua:151: in function <...t/ddltest/test/.rocks/share/tarantool/cartridge/pool.lua:141>
    test.router | stack traceback:
    test.router |   .../git/ddltest/test/.rocks/share/tarantool/http/server.lua:743: in function 'process_client'
    test.router |   .../git/ddltest/test/.rocks/share/tarantool/http/server.lua:1199: in function <.../git/ddltest/test/.rocks/share/tarantool/http/server.lua:1198>
    test.router |   [C]: in function 'pcall'
    test.router |   builtin/socket.lua:1081: in function <builtin/socket.lua:1079>
    test.router | request:
    test.router | POST /migrations/up? HTTP/1.1
    test.router | Host: localhost:8081
    test.router | Accept: */*
    test.router | User-agent: curl/7.68.0
    test.router |
``

ddl.get_schema must return field name in index parts

Currently there is no way to easily determine the field number or the field name from the returned "path" field: either a sophisticated logic is needed (splitting the value by dot, etc) or it is completely impossible for multikey indexes.

This is needed for connectors (e.g. cartridge-java).

Fail to set schema with vshard sharding function

ddl.set_schema(schema) fails with schema containing vshard sharding func. It doesn't matter if vshard module was required on ddl storage. The problem is that checking sharding function in dot notation is performed by checking each item separated by dot in sharding function name, item should be contained in global environment _G. But in vshard sharding function case item routers (field in vshard table) is an empty table because we store schema info on storage. So field bucket_id_strcrc32 is nil in 'vshard.router.bucket_id_strcrc32'.
My proposal is to treat vshard sharding functions as separate cases.

Possible bug

Hello. I looked at the code and found a place where a mistake is possible.
I think this code can't create index with sequence.
What do you think?

seuence = sequence_name,

Expanding schema by adding nullable field

The ddl module is responsible for applying the schema on the cluster.
This module does not allow to modify the schema after applying it.

In Tarantool, there are two types of schema migration that do not require data migration:

  • adding a field to the end of a space;
  • creating an index.

To make it easier to write migrations, we should add the ability to
expand the schema by adding nullable field to the end of the space.

Example of migrations:

Migration 1:

return {
    up = function()
        local ddl = require('ddl')
        local schema = {
            spaces = {
                keyval = {
                    engine = 'memtx',
                    is_local = false,
                    temporary = false,
                    format = {
                        { name = 'bucket_id', type = 'unsigned', is_nullable = false },
                        { name = 'key', type = 'string', is_nullable = false },
                        { name = 'value', type = 'string', is_nullable = false }
                    },
                    indexes = {{
                        name = 'pk',
                        type = 'TREE',
                        unique = true,
                        parts = {
                            { path = 'key', type = 'string', is_nullable = false }
                        }
                    }, {
                        name = 'bucket_id',
                        type = 'TREE',
                        unique = false,
                        parts = {
                            { path = 'bucket_id', type = 'unsigned', is_nullable = false }
                        }
                    }},
                    sharding_key = { 'key' }
                }
            }
        }

        assert(ddl.check_schema(schema))
        ddl.set_schema(schema)
    end
}

Migration 2:

return {
    up = function()
        local ddl = require('ddl')
        local schema = {
            spaces = {
                keyval = {
                    engine = 'memtx',
                    is_local = false,
                    temporary = false,
                    format = {
                        { name = 'bucket_id', type = 'unsigned', is_nullable = false },
                        { name = 'key', type = 'string', is_nullable = false },
                        { name = 'value', type = 'string', is_nullable = false },
                        { name = 'state', type = 'map', is_nullable = true }
                    },
                    indexes = {{
                        name = 'pk',
                        type = 'TREE',
                        unique = true,
                        parts = {
                            { path = 'key', type = 'string', is_nullable = false }
                        }
                    }, {
                        name = 'bucket_id',
                        type = 'TREE',
                        unique = false,
                        parts = {
                            { path = 'bucket_id', type = 'unsigned', is_nullable = false }
                        }
                    }},
                    sharding_key = { 'key' }
                }
            }
        }

        assert(ddl.check_schema(schema))
        ddl.set_schema(schema)
    end
}

Use transactional ddl when possible

Nowadays the DDL module performs check_schema by creating _ddl_dummy space and then dropping it. This approach has several drawbacks:

  • It makes unnecessary transactions.
  • It can't be applied on replicas (even writable), because it produces replication conflicts.

Let's wrap check_schema in box.begin() / box.rollback() and ddl.set_schema in box.begin() / box.commit().

N.B. Tarantool 1.10 doesn't have transactional ddl, it'll only work in 2.x.

Caching of sharding function

(There is more general task about caching #69, but I want to eat this pie piece by piece.)

We're going to add ddl.bucket_id() function (see #76). The function may be called quite frequently, so it worth to take care to its performance.

The ddl.bucket_id() function needs to know a sharding function. It is costly to obtain the function declaration / definition stored in the _ddl_sharding_func space, mainly due to those actions:

  1. MsgPack decoding.
  2. loadstring() is the function is declared as code ({body = <...>}).
  3. Extra Lua GC pressure on re-cretion of usually same objects.

Ideally obtaining of the function should be just Lua table lookup. And it is possible to achieve.

The only way to track _ddl_sharding_func changes is to set a trigger on the space to track modifications (on_replace)1. Since it is not always possible to set a trigger when the module is just loaded2, I propose a trick.

The key idea is to generate an initial cache value and set the trigger when we access the sharding function information first time. After this the cache will be updated 'in background' (by the trigger) and we can just access the cache.

What to consider:

  • Take care to access from different fibers (see the locked() function and the synchronized module for examples).
  • But implement this locking in a way that works good with hot reload (including cartridge's one, which cleans up all globals).
  • How to remove the old trigger if our code was unloaded (especially if _G was cleaned up by cartridge)? There are some ideas how to keep a state between reloads in https://github.com/tarantool/conf/issues/2 (at end of the issue description). A Lua/C module may also use Lua registry, but it is not our case.
  • Don't use the cache in a transaction? That's slow, but how else we can ensure that the cache will correspond to the state visible in the transaction?
  • Perform loadstring() when the function is defined as code (to don't do that each time).
  • But don't save the function itself if it is defined by name: a user may replace it. Maybe just parse 'dot.notation' into something like {'dot', 'notation'}.

Optimization trick:

We can use the trick with two implementations of the cache access function (see src/box/lua/load_cfg.lua in tarantool for example). The first function doing all the work: check whether the trigger is set (and the initial cache is generated), set the trigger and generate the cache if necessary, access cache, replace itself with the second function. The second function skips extra checks and just access the cache.

I'll note that the first function must not set the trigger unconditionally, because it may be called after hot reload. See tarantool/tarantool#5826.


Looks a bit tricky, but doable. Opinions?

Footnotes

  1. I filed https://github.com/tarantool/tarantool/issues/6544 and https://github.com/tarantool/tarantool/issues/6545 to track it using the database schema version in a future. I think it may simplify some future code.

  2. box may be unconfigured (or not fully loaded), when the module is required first time.

Collect test coverage

It's not usually critical, but the DDL module has tons of conditions. It'd be nice to check it tests cover them all.

Explain how the module is intended to be used

I guess the module simplifies the task to keep consistent database schema on sharding storages, but there are nothing about usage in the README. It would be good to add some description and maybe even a usage example.

I guess it is used in cartridge internally, but since the module is extracted and tends to be general (not the cartridge framework specific), it should have clear usage scenario.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.