datajoint / datajoint-matlab Goto Github PK

View Code? Open in Web Editor NEW

42.0 42.0 36.0 5.28 MB

Relational data pipelines for the science lab

License: MIT License

MATLAB 99.83% M 0.03% Objective-C 0.14%

datajoint-matlab's People

Contributors

Stargazers

Watchers

datajoint-matlab's Issues

support of more fancy MATLAB objects

Hi,

I've been looking at DataJoint for a few days since my lab needs a Data Management program. It seems that this program is tightly coupled with MATLAB. I have two questions:

In case my program generates very large files (say using datajoint to automate spike sorting of NEV files), and there's no way or meaning to store them in MySQL, is there any other solution? I wonder if I can let DataJoint help me scp my file to some server.
Seems that mym has its own way to serialize and deserialize MATLAB objects. I wonder how it can handle some more fancy MATLAB objects, such as table, which is like dataframe in R and very useful for my research.

Thanks.

No slashes displayed in path when updating class via addAttribute

addAttribute(experiment.Stack,'filename : varchar(255) # file base name')
table updated
Update table declaration in C:workpipelinematlabschemas+experimentStack.m? (yes/no) > yes
updated table definition

However, the class is indeed updated correctly.

perform joins on primary key attributes only

DataJoint only supports natural joins. In the vast majority of cases, the joins are performed on the primary key of one of the operands.

Starting in version 2.9.0, I propose to designate the * (mtimes) operator as the primary join. The primary join performs the natural join using common attributes that are included in the primary keys of at least of the operands. If a dependent attribute shares the same name in both relations, an error will be issued prompting the user to project out or rename the offending attribute.

This will allow users to have identically named generic dependent attribute names such as 'comment' or 'timestamp' without causing poorly detectable bugs.

The operator .* (times) will be turned into the full join, which is the regular natural join on all identically named attributes, as implemented by * now.

Warning to old DataJoint users: .* was used as the semijoin operator in DataJoint 1. Since DataJoint 2.0, semijoin is implemented as the & operator but .*remained for backward compatibility. Several recent versions have been showing error messages when .* to allow reclaiming this operator for other purposes.

Explanation

Dependencies between data are implemented as foreign keys referencing primary keys of parent tables. In most cases, joins are performed between directly or indirectly dependent relations. The join is always performed on primary key attributes. This change will help avoid unintended inclusions of dependent attributes in the join.

Cases when a join needs to be performed on purely dependent attributes are outside the prescribed design patters prescribed by DataJoint. I cannot think of a case when such joins might be necessary. To still allow such joins, we implement the full natural join that joins on all identically named attributes.

Computed (but not stored) fields

Apologies if this feature, or a reason for its absence, is documented somewhere - I couldn't find it.

For some large fields, you might consider a trade-off between storage space and time to regenerate it on demand. For example, with large images generated from a random seed, you might only store the seed in the DB and reconstruct the image each time it is needed during analysis. Of course, with DJ you can still do this by keeping reconstruction code separate and storing only the seed, but it would be semantically "nice" to be able to fetch() a field from a table that is recomputed each time.

aggregation should use the left join

Presently, aggregation uses the natural join.

For example if table A contains these entries:

subject_id 
----------
         1
         2
         3

and table B contains these entries

subject_id   cell_id
---------- ---------
         1         1
         1         3
         1         5
         2         2

Then

>> A.pro(B, 'count('*')->n', 'max(cell_id)->m')

subject_id      n      m
---------- ------ ------
         1      3      5
         2      1      2

After this fix, the output will become become

>> A.pro(B, 'count('*')->n', 'max(cell_id)->m')

subject_id      n      m
---------- ------ ------
         1      3      5
         2      1      2
         3      0      NULL

The null will become a nan upon fetching.

The latter behavior is more desirable since the projection promises not to change the cardinality of the original relation.

Here is more information on how aggregation functions handle NULLs, which is what the left join will fill in places of the missing values.

Error: dj.set('restore') Parameter name does not exist

>> dj.set('restore')
Error using dj.set (line 33)
Parameter name does not exist

dj.set.m, Line 33:

    assert(isfield(STATE,name), 'Parameter name does not exist')

'restore' is not a field of STATE

Add non-interactive options for interactive tools like `dj.new` and `dj.createSchema`

Interactive tools like dj.new and dj.createSchema offers convenient ways of creating new schemas/tables. However, there are no non-interactive equivalent to these tools such that there is no way to create schema/table (easily) from a non-attended script. It would perhaps be nice to either offer non-interactive modes to these functions or come up with non-interactive alternatives all together.

deletes and table drops don't always cascade in correct order

Currently, deletes and table drops cascade using the breadth-first algorithm, which works in the vast majority of cases, but fails for some more complex networks with convergent dependencies.

add `dj.setPassword`

Similar to datajoint-python dj.set_password()

Difficulties clearing classes

I started having problems when issuing clear classes:

>> hr = acq.HammerRecordings;
>> clear classes
Warning: Objects of 'acq.Events' class exist.  Cannot clear this class or any of its super-classes. 
Warning: Objects of 'acq.NeuroTimebases' class exist.  Cannot clear this class or any of its super-classes. 
Warning: Objects of 'dj.Table' class exist.  Cannot clear this class or any of its super-classes. 
>> whos
>> clear classes
Warning: Objects of 'acq.Events' class exist.  Cannot clear this class or any of its super-classes. 
Warning: Objects of 'acq.NeuroTimebases' class exist.  Cannot clear this class or any of its super-classes. 
Warning: Objects of 'dj.Table' class exist.  Cannot clear this class or any of its super-classes.

There are no persistent variables in use and I do not remember encountering this issue a few versions back. I am concerned that we accidentally created circular references across RelVars, and Table objects. In particular, the changes to get.parents(), get.references(), get.children(), get.referencing() and get.descendants() could be responsible.

I'll bisect our commit history to figure out when this got introduced. In the meantime: Does anybody else encounter this behavior? Just bring a few BaseRelvars into your workspace, preferably ones that reference a lot of other tables and are referenced by by other tables, and then try clear classes.

Insert should execute in one query

Connecting to remote servers such as Amazon RDS incurs a 20-50 ms network latency. Therefore, it proved useful to execute inserts of multiple tuples as a single query rather than a sequence of separate insert queries.

This was already implemented on the python side in datajoint/datajoint-python#194

create databases automatically

upon the first invocation of a schema, if it does not exist, create it automatically with a user prompt.

Also, perhaps, upon dropping the last table from a database, drop the database too.

Trigger `dj.createSchema` from `dj.new`

Currently when you want to create a new table in a new schema, it is a two steps process: you have to call dj.createSchema first to create the schema (with appropriate getSchema.m) and then you can call dj.new to create a new table in an existing schema. If you try to use dj.new on a non-existent schema, an error is thrown.

Although having a separate dj.createSchema makes sense and should be maintained, it will make sense for dj.new to trigger dj.createSchema if the user specifies to create a table in a non-existent schema. This will greatly simplify the table creation process.

Formalize the Master-Part relationship.

On the Python side, we have formalized the concept of a master-part relationship. Let's mirror this formalism in the MATLAB implementation.

In a master-part relationship, there is a master table and one or more part tables. We have formerly referred to part tables as "subtables". Tuples in the master table always appear together with all their matching tuples in the part tables. This is accomplished by disabling independent populate and delete methods of the part tables. The part tables must be populated from the master table's makeTuples call as part of one transaction. The master-part relationship enables dependencies on the master tuples and their matching parts, together.

In Python, part tables subclass dj.Part and are declared as nested classes of their master. A part's internal table name has the form master_table__part_table with double underscores separating the master table's name from the part table's name.

MATLAB does not allow nested classes. Therefore, I propose the following solution that will keep the implementation compatible across the two languages:

The tier of a part table will be indicated as part of schema.MasterClass instead of the conventional manual, lookup, imported, and 'computed`.
Just like in Python, the table name will be master_table__part_table, where part_table is derived from the part class name. To allow avoiding name collisions between identically named part tables of different master tables, the part table class can be prefixed with the master table class and it will be stripped internally.

Example (Python):

@schema
class Geometry(dj.Imported):
    definition = """  # probe geometry
        -> Design
        ----
        timestamp=CURRENT_TIMESTAMP : timestamp  # automatic timestamp
        """

    class EmitterLocation(dj.Part):
        definition = """  # 3D positions of emitters
            -> Geometry
            emitter :smallint
            ----
            emitter_x   :float  # (um)
            emitter_y   :float  # (um)
            emitter_z   :float  # (um)
            """

    class DetectorLocation(dj.Part):
        definition = """ # 3D position of detectors
            -> Geometry
            detector :smallint
            ----
            detector_x   :float   # (um)
            detector_y   :float   # (um)
            detector_z   :float   # (um)
            """

Equivalent MATLAB:

%{
probe.Geometry (imported)  # probe geometry
-> probe.Design
----
timestamp=CURRENT_TIMESTAMP : timestamp  # automatic timestamp
%}
classdef Geometry < dj.Relvar
begin

%{
probe.EmitterLocation (part of probe.Geometry)  # 3D positions of emitters
-> probe.Geometry
emitter :smallint
----
emitter_x   :float  # (um)
emitter_y   :float  # (um)
emitter_z   :float  # (um)
%}
classdef EmitterLocation < dj.Relvar
end

%{
probe.DetectorLocation (part of probe.Geometry)  # 3D positions of detectors
-> probe.Geometry
detector :smallint
----
detector_x   :float  # (um)
detector_y   :float  # (um)
detector_z   :float  # (um)
%}
classdef DetectorLocation < dj.Relvar
end

Support 'ORDER BY'

It would be nice to be able to order by a field and return a limited number. This could accelerate some things over a slower network connection.

Better handling of Ctrl-C in del() and parpopulate() calls

Hitting Ctrl-C while del, parpopulate, or any other function that starts a transaction are executing will leave the user "trapped" inside of this transaction.
Given that InnoDB enforces isolation, this can cause highly unexpected behavior for the user, including the loss of inserted tuples and seemingly inconsistent query results.

Proposal:
Add a call to cancelTransaction via an onCleanup handler. Functions registered via onCleanup execute even in the case of Ctrl-C, whereas statements in a catch block do not.

enable delayed inserts

R2013b introduced the parfeval function, which can now allow non-blocking inserts. This is important for time-critical applications such as stimulus display with database logging.

We should implement something like dj.BaseRelvar.queueInsert to implement non-blocking inserts with subsequent check of success.

This feature will only function in R2013b+

"MySQL server has gone away" when inserting many or large rows

I often get the mYm error MySQL server has gone away when I try to insert multiple tuples at once. It appears to be related to the overall amount of data (e.g. >100 MB per call to insert causes error) or number of rows (e.g. >1000 rows per call to insert causes error).

Can I perhaps fix this with the right settings for my MySQL server?

These are a few settings I have already played with. I thought these settings should allow inserts of many hundreds of MB, at least, but I get errors for 100 MB-inserts. What are other important settings?

[mysqld]
max_allowed_packet = 2048M
innodb_buffer_pool_size = 2G
innodb_additional_mem_pool_size = 1024M
innodb_log_file_size = 2G

Thanks!
Matthias

EDIT: Slightly related: I get even worse errors (Matlab segfault/crash) when trying to make many small inserts in rapid succession (e.g. in a for loop that executes quickly).

Implement unit tests

MATLAB provides unit testing
http://www.mathworks.com/help/matlab/matlab-unit-test-framework.html

Implement comprehensive tests for datajoint-matlab.

foreign key properties

I propose to allow specification of foreign key properties unique and required.

I propose the following syntax.

-> [unique, optional] lab.Animal

"Unique" means that each animal in lab.Animal can be referenced only once from the referencing table.

"Optional" means that the foreign key values can be null. When null, the unique constraint is not applicable.

By default, foreign keys are not unique and not optional.

This syntax is compatible with the syntax for renamed foreign keys:

(subject_id) -> [unique] lab.Animal

self joins throw an error

>> common.Animal*common.Animal

Object dj.GeneralRelvar

Primary key:  animal_id
Dependent attributes:  real_id, date_of_birth, sex, owner, line, animal_notes, animal_ts

 Contents: 
Error using mym
Not unique table/alias: 'animal'

Error in dj.Connection/query (line 348)
                ret=mym(self.connId, queryStr, v{:});

Error in dj.GeneralRelvar/exists (line 178)
            yes = self.conn.query(sprintf('SELECT EXISTS(SELECT 1 FROM %s LIMIT 1) as yes', sql));

Error in dj.GeneralRelvar/display (line 83)
            if self.exists

Mutation of floating-point values stored in the database

Let me preface this by saying that the issue I describe here is not specific to datajoint or mym. However, I think that anybody who uses MySQL as a storage backend for scientific data should be aware of certain issues regarding floating-point columns.

Expected behavior

When storing a value x in a database column of an appropriate datatype, we expect to be able to retrieve the same value later. In particular we hope that a 32-bit float ("single") value with a certain binary representation in memory will have that same binary representation when retrieved from a "float" column in the database. The same should hold for a 64-bit floats ("double") and "double" columns.

Potential problem sources

Data exchange between MySQL and datajoint or other connectors is entirely text based. Hence, all floating-point numbers passed to and from datajoint will go through repeated sprintf-style formatting into a string and will later be parsed back into a binary representation. Whether the original value is retained or not depends on the number of digits and algorithms used for conversion.

Observed behavior for 32-bit floats

32-bit floats are returned from the database with significant deviations from the original values. For this experiment, only values that had an exact representation as 32-bit floats were inserted into the table. MySQL formats the values in float columns very aggressively using no more than six digits in the output:

MariaDB [lfp]> SELECT fval FROM tests LIMIT 5;
+----------+
| fval     |
+----------+
| -9162.98 |
| -894.557 |
| -8368.18 |
|  3223.23 |
|  -464.57 |
+----------+
5 rows in set (0.00 sec)

This level of precision is not sufficient to recover the original float value. Most deviations are within 50_ULP(float) but there are outliers of up to 90_ULP(float)

As indicated in the datajoint documentation, the float datatype should not be used when high precision is required. Furthermore, the MySQL documentation is quite devoid of any guarantees regarding floats or their conversion from and to strings or the decimal data type. However, scrambling the last six bits of the mantissa of an already imprecise value just a bit more seems unnecessary.

Observed behavior for 64-bit floats

MySQL uses the dtoa library for conversion of double-precision floating points to and from strings. However, "correct" retrieval from the database as specified above is only possible if the decimal representation uses 15 digits or less. Again, this causes slight mutations of values returned from database queries, most in the range of 3_ULP(double), a few reach 4_ULP(double).
While also quite unnecessary - a single additional digit could have made all the difference - this strikes me as acceptable for most scenarios.

Work-arounds

Use "decimal" columns whenever the value range of your data is predictable.
Avoid the "float" data type, unless precision is really not a concern
Float-values can be widened to double precision by multiplying with the numeric literal 1.0 in SELECT statements. This produces more digits in the output and completely abolishes the problem for 32-bit floats.
Use blobs to store data that can remain opaque to the database, i.e. won't be used in restrictions
NEVER restrict Relvars using equality with a floating-point value. This is a bad practice for floating point numbers in any setting. In MySQL, things get even worse because "float" columns are widened to double before performing comparisons. So you can insert the a tuple with the float value 0.3, then query for it and get zero hits.

Additional unique indexes in table definitions

I am proposing to enhance the table definition language in datajoint to allow for the specification of additional uniqueness constraints. As an example where this may be beneficial, let us consider a table that stores FIR bandpass filters:

%{
sigproc.BandpassFilters (manual) # FIR Bandpass filters
filter_transition1_low      : decimal(9,8)      # Start of lower transition band [0,1]
filter_transition1_high     : decimal(9,8)      # End of lower transition band [0,1]
filter_transition2_low      : decimal(9,8)      # Start of upper transition band [0,1]
filter_transition2_high     : decimal(9,8)      # End of upper transition band [0,1]
filter_ripple_pass_db       : decimal(9,5)      # Passband ripple in dB
filter_ripple_stop_db       : decimal(9,5)      # Stopband ripple in dB

---
filter_response             : mediumblob        # Filter response
%}

This natural key is fine as long as the table is used in isolation. But every time it is referenced by another table, it adds six fields to that table. This creates field bloat and increases the risk of accidental joins, especially if sigproc.BandpassFilters is referenced in the key section.
So it seems more reasonable to revert to a surrogate key for this table:

%{
sigproc.BandpassFilters (manual) # FIR Bandpass filters
bandpassfilter_id        : int unsigned AUTO_INCREMENT   # BP filter ID

---
filter_transition1_low   : decimal(9,8)      # Start of lower transition band [0,1]
filter_transition1_high  : decimal(9,8)      # End of lower transition band [0,1]
....

This comes with two major drawbacks:

We just lost descriptiveness regarding our data. The uniqueness constraint for primary keys prevented us from inserting filters with the same specification twice. This is gone now and we have to manually add logic to check for the existence of a filter before inserting it into the table.
The primary key is also used as an index to accelerate queries. The table with the surrogate key will perform significantly worse in queries that are based on the filter specification.

From a database design standpoint, this is not a problem at all. One would just create an additional index / uniqueness constraint for the table:

ALTER TABLE ... ADD CONSTRAINT uc_filter_spec (field1,field2,...)

Both problems are solved. Unfortunately, there is currently no means to include such additional uniqueness constraints in the table definition. An enhanced definition could look like this:

%{
sigproc.BandpassFilters (manual) # FIR Bandpass filters
bandpassfilter_id        : int unsigned AUTO_INCREMENT   # BP filter ID

---
filter_transition1_low   : decimal(9,8)      # Start of lower transition band [0,1]
filter_transition1_high  : decimal(9,8)      # End of lower transition band [0,1]
filter_transition2_low   : decimal(9,8)      # Start of upper transition band [0,1]
filter_transition2_high  : decimal(9,8)      # End of upper transition band [0,1]
filter_ripple_pass_db    : decimal(9,5)      # Passband ripple in dB
filter_ripple_stop_db    : decimal(9,5)      # Stopband ripple in dB
filter_response          : mediumblob        # Filter response
+++
UNIQUE uc_filter_spec(filter_transition1_low,filter_transition1_high,
                      filter_transition2_low,filter_transition2_high,
                      filter_ripple_pass_db, filter_ripple_stop_db)
%}

We have been using such constraints in our lab-internal fork and I am trying to gauge upstream interest.

Lock wait timeouts

I'm getting a lot of "lock wait timeout" errors these days when using the cluster to parpopulate tables (happens already with 16 workers but increasingly often with larger numbers). I already researched it quite a bit and it is related to InnoDB's row level locking but I don't understand yet what causes the lock. I posted a question on stackexchange as well: http://dba.stackexchange.com/questions/31611/innodb-row-level-locks-without-selects

I was wondering if anyone else has the same issue and/or any ideas what causes it or how to fix it.

Error in restriction of projection of restriction

Restricting a project of restricted table yields an error due to incorrect MySQL syntax.

For example,

pro(acq.Sessions & 'subject_id = 1') & 'subject_id = 1'

produces an error: Error using mym You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'WHERE (subject_id = 1) LIMIT 1) as yes' at line 1

Surely enough, checking the SQL query produced via above operation yields:

`acq`.`sessions` WHERE (subject_id = 1) WHERE (subject_id = 1)

Package name as prefix in mysql

While in the current version of datajoint having multiple schemes in most cases also means having a database for each of those schemes it is impractical for our centralised multiuser environment to have a mysql database for every user (mainly for administrative and backup reasons).

While it is possible to have multiple packages access the same database, it is impossible to have a tables with the same name in both of those schemes.

Therefore it would be nice if it would be possible to make datajoint add the package name as a prefix for those multiuser schemes. Once again this should not be the default but should be an option somewhere.

I thought about adding the option to activate it to the dj.Scheme constructor and the implementation to dj.Table. Any input is greatly appreciated.

Could not find mymSetup when following setup instructions

support the new table datatype

R2013b introduced the table datatype, which provides pretty neat features like joins, outerjoins, sorting, easy column extractions, etc.

DataJoint should support the table datatype just as it currently does with structure arrays.

fetch commands should have the option of returning tables rather than structure arrays.

restrict should accept tables as restrictions.

In the future, when we can require R2013b or above, we can simplify many functions by using table instead of structure arrays.

Populate all ancestors recursively.

In rev. 2.7.4, dj.AutoPopulate/populate and dj.AutoPopulate/parpopulate will have the option to populate all ancestors (referenced tables) recursively.

For example, if computed table 'C' has a foreign key reference to computed table B and table B in turn refers to table A, then the statement

populate(C,restriction)

will be equivalent to

populate(A,restriction)
populate(B,restriction)
populate(C,restriction)

The feature is activated by setting the flag populateAncestors to true:

dj.set('populateAncestors',true)

By default, populateAncestor is set to false to avoid the overhead associated with populating the ancestors.

Note that the restriction, if any, must be then applicable to all the ancestors.

erd should cross schema boundaries

Primary key depends on the order of join's operands

When joining on non-primary keys, the table may have multiple primary keys. The current implementation uses the primary key of the first plus all the primary key attributes of the second that are not already included. This rule is asymmetric, producing different primary keys depending on the order of the join. So the join is not strictly commutative. Boyce-Codd Normal Form may be helpful for defining a better rule.

Loading of dependencies prevents delete

I don't know when this was introduced, but basically the current mechanism of loading table dependencies effectively prevents me from deleting anything in our common schemas. The reason is that the recursive loading of dependencies will eventually hit a schema of some other user for which I don't have access to the code, i.e. cannot activate it.

It used to be implemented in a way that if there were no tuples in other schemas affected it would delete and cascade just fine. Does anyone know when and why this behavior changed? It's kind of crucial to restore the old behavior since we have a lot of people basing their analyses on a common schema and it's impossible to keep all of those people's code somewhere around just for the purpose of loading dependencies that are completely irrelevant.

correct cascaded deletes across renamed foreign keys

master branch contains dangerous bugs

Today I realized that the master branch still contains issue #4 which causes our users to delete almost all of the data from our database when removing even a single electrode array from our database.

There are a few more fixes that have never been merged into master.

Therefore I would recommend to update master to a newer version from future or some guidance what versions you guys think should be used for stable operation. I always thought master resembles the stable branch but that does not seem to be the case anymore.

There is also a bugfix branch but it has not seen an update for some time, so I guess it is not used by anyone either.

provide control of datajoint's internal behaviors

I implemented the function dj.set, which controls the global state of datajoint. Here is what it look like.

>> dj.set
                  suppressPrompt: 0
    reconnectTimedoutTransaction: 1
                   populateCheck: 1

>> dj.set('suppressPrompt',true)
>> del(package.MyStuff)
>> dj.set('suppressPrompt',false)

This can be used to control debug level, etc.

I am going to merge this in unless this seems non-kosher.

aggregation should use the left join

Presently, aggregation uses the natural join.

For example if table A contains these entries:

subject_id 
----------
         1
         2
         3

and table B contains these entries

subject_id   cell_id
---------- ---------
         1         1
         1         3
         1         5
         2         2

Then

>> A.pro(B, 'count(*)->n', 'max(cell_id)->m')

subject_id      n      m
---------- ------ ------
         1      3      5
         2      1      2

After this fix, the output will become become

>> A.pro(B, 'count(*)->n', 'max(cell_id)->m')

subject_id      n      m
---------- ------ ------
         1      3      5
         2      1      2
         3      0   NULL

The null will become a nan upon fetching.

The latter behavior is more desirable since the projection promises not to change the cardinality of the original relation.

Here is more information on how aggregation functions handle NULLs, which is what the left join will fill in places of the missing values: https://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html

Toolbox documentation in MATLAB missing

Back in 5279624, the Matlab documentation for DataJoint was removed. Eventually, it would be great to get it back, with content that is as close to the wiki pages as possible.

What was the original motivation to remove it? Did the format change such that it was impossible to write documentation that woks in all supported versions of MATLAB?

"Escape backslash in dj.ask" breaks string formatting.

So I made pull request #56 recently to escape backslashes in dj.ask so that paths are correctly displayed. I didn't think this through enough, for which I apologize: Some other functions (e.g. dj.Schema/makeClass) pass strings with special characters (e.g. \n) to dj.ask. These get escaped-out as well now. I'm not sure what the best approach is. Damn Windows for using the backslash as a file separator, I guess.

Cascading delete fails with non-congruent non-pk foreign refs

Follow set of tables lead to incorrect cascading delete:

Image

%{
test.Image (manual) # collection of all image data
image_id          :int          #unique image id
-----
image_data=null   :longblob     #image data
%}
classdef Image < dj.Relvar
end

MonkeyImage

%{
test.MonkeyImage (manual) # collection of monkey images
subj_id         :int      #unique id of the monkey
-----
-> test.Image
%}
classdef MonkeyImage < dj.Relvar
end

ModifiedMonkeyImage

%{
test.ModifiedMonkeyImage (manual) # collection of modified monkey images
-> test.MonkeyImage
modification_type    :varchar(255)      #type of modification
-----
-> test.Image
%}
classdef ModifiedMonkeyImage < dj.Relvar
end

In the above three tables, Image is where all the images are actually stored as longblob, with each image given a unique image_id. MonkeyImage is a convenient collection of all images for monkeys, with unique entry for each monkey identified by subj_id. Finally, ModifiedMonkeyImage is a collection of modified images for the monkey, uniquely identified by the subj_id of the monkey as the modification_type descriptor. Note that although ModifiedMonkeyImage is a child of MonkeyImage and that both have non-pk foreign ref back to Image, they will point to different image entries!

With the three tables defined as above, the follow code illustrates the issue:

% insert two images, one for monkey unmodified, and one for cropped monkey image
insert(test.Image, struct('image_id', 1);
insert(test.Image, struct('image_id', 2);

% insert the image under MonkeyImage
insert(test.MonkeyImage, struct('subj_id', 100, 'image_id', 1);

% insert the modified image for "subj_id = 100", where "crop" has been applied
% note that here the image_id=2, not 1!
insert(test.ModifiedMonkeyImage, struct('subj_id', 100, 'modification_type', 'crop', 'image_id', 2);

% now try to delete "image_id=2" from "Image", this should cause cascading delete in 
% test.ModifiedMonkeyImage

del(test.Image & 'image_id =  2')

% OUTPUT BELOW
ABOUT TO DELETE:
       1 tuples from `edgar_sandbox`.`image` (manual)

Proceed to delete? (yes/no) > yes
Deleting from test.Image

 ** delete rolled back due to to error
Error using mym
Cannot delete or update a parent row: a foreign key constraint fails (`edgar_sandbox`.`modified_monkey_image`, CONSTRAINT
`modified_monkey_image_ibfk_2` FOREIGN KEY (`image_id`) REFERENCES `image` (`image_id`) ON UPDATE CASCADE)

This issue has been mentioned on datajoint/datajoint-python#15 and I am not sure if it has been appropriately dealt with in datajoint-python either. This is caused by an inappropriate cascading when there exists multiple paths from a parent table (e.g. Image) to the target (ModifiedMonkeyImage, here paths are Image -> ModifiedMonkeyImage and Image -> MonkeyImage -> ModifiedMonkeyImage).

implicit commits

MySQL commits the ongoing transaction when a data definition command is issued.
https://dev.mysql.com/doc/refman/5.7/en/implicit-commit.html

DataJoint declares tables on the fly and a table may get declared during a transaction, triggering an implicit commit, causing anomalous behavior.

For example, when a table attempts to populate its subtable that has not yet been declared, the first insert into the subtable will trigger its declaration and an implicit commit of the incomplete transaction.

Loading table dependencies takes too long

Loading table dependencies typically takes around 10-15 seconds. Does anyone have an idea what's going on there? This should be a (few) pretty trivial query(ies). Why is it taking so long?

remove queries to information_schema

apparently, queries to the information schema are inefficient and generally should be avoided. I suggest that we replace all queries to information schema. It may require more parsing but should improve overall performance.

Changes

Get table list (dj.Schema/reload) from SHOW TABLES FROM <schema>
Load field information in (dj.Schema/reload) by parsing SHOW CREATE TABLE ...
Load foreign keys (dj.Schema/get.dependencies), by parsing SHOW CREATE TABLE ...
Get foreign key names (dj.Table/dropForeignKey), by parsing SHOW CREATE TABLE ...

Appropriately handle `CTRL-C` in the middle of `populate` and `parpopulate`

In general, any exception that occurs during populate and parpopulate gets handled appropriately to ensure that any standing transactions are rolled back. However, when user issues KeyboardInterrupt via CTRL-C during populate/parpopulate, this does not trigger an exception and thus at the end leaves the existing DB tansaction open. This can be really confusing to users as all of the sudden DJ seems to be inconsistent from the actual state of the database but DJ superficially seems to continue to function. CTRL-C in the middle of populate/parpopulate should be gracefully handled to ensure that any open transaction is closed before handing back the control to the user.

Furthermore, when parpopulate has been terminated by CTRL-C, the status of an appropriate tuple in the jobs table must be updated. Currently, the interrupted tuple remains with status reserved, making it impossible to distinguish it an interrupted process from a very slow or hanging computation.

Default popRel

The Python version of datajoint implements a default value for the populate relation, i.e. the product of the parent tables. The majority of current DataJoint classes use this default.

I propose to replicate this feature in MATLAB.

use the database toolbox instead of mym?

Let's switch to using Mathworks' database toolbox.

Does anyone have any objections? It's a quick change. But please test that you can connect to your databases from all your platforms using the database toolbox.

incorrect cascading deletes in cases of non-primary foreign keys

Currently, DataJoint implements its own cascading deletes instead of relying on MySQL's native cascading deletes.

This solution

allows reviewing the contents to be deleted
compensates for MySQL's multiple-path problem for cascading deletes

dj.BaseRelvar/del generates the lists of all dependent tables and deletes from them, starting from the bottom. Each delete is restricted by the top relation. This rule works well when all foreign key fields are also primary key fields in the tables involved but can result in deleting non-dependent tuples if one of the tables along the cascade includes non-primary fields in its foreign key.

This issue may be fixed by simply not cascading down from any table that does not include all its foreign key fields in their primary key.

colons : in enum values cause declaration errors

For example, in Jake's patch.Session, he declared

craniotomy_location=null    : enum('V1: 2.7-3 lateral, just anterior of lambda','S1: 3.2-3.5 lateral, 1.2-1.5 posterior (bregma)') #

This throws an error because : is used to separate attribute name from the datatype.

Matlab floats need inf

addForeignKey() on lookup tables leads to problems

Problem

I noticed that adding foreign keys via addForeignKey() on a table of type lookup leads to various problems. In the best case it results in a MySQL error and fails to add the foreign key; in the worst case it creates the foreign key, but any future changes will fail and the only solution is to drop the table and re-create it.

Reason

When MySQL automatically creates the name for the foreign key (usually <tablename>_ibfk_x), the # character that is prepended to the table name for lookup tables creates problems. The following seems to happen:

When the foreign key constraint is declared during table declaration everything works fine.
When adding a foreign key via ALTER TABLE (as is done in addForeignKey()) the # is translated to @0023. If table foo has already a foreign key named #foo_ibfk_1, the new foreign key gets named @0023foo_ibfk_1 (note the 1 instead of 2 as should be expected).
Now, depending on the MySQL server version, this either (a) creates an error 1050 because the two are interpreted as being named identically or (b) works fine, creates the foreign key, but prevents any future changes to the table, which will fail because apparently two identically named keys exist.

Bottom line: there seem to be situations in which MySQL thinks # is the same as @0023 and other situations where it doesn't. This is either a bug in MySQL or # is not supposed to be part of a table name (or both).