tornede / py_experimenter Goto Github PK

The PyExperimenter is a tool for the automatic execution of experiments, e.g. for machine learning (ML), capturing corresponding results in a unified manner in a database.

Home Page: https://tornede.github.io/py_experimenter/

License: MIT License

Python 96.00% TeX 4.00%

database executor experiments python

py_experimenter's Introduction

PyExperimenter

PyExperimenter is a tool to facilitate the setup, documentation, execution, and subsequent evaluation of results from an empirical study of algorithms and in particular is designed to reduce the involved manual effort significantly. It is intended to be used by researchers in the field of artificial intelligence, but is not limited to those.

The empirical analysis of algorithms is often accompanied by the execution of algorithms for different inputs and variants of the algorithms (specified via parameters) and the measurement of non-functional properties. Since the individual evaluations are usually independent, the evaluation can be performed in a distributed manner on an HPC system. However, setting up, documenting, and evaluating the results of such a study is often file-based. Usually, this requires extensive manual work to create configuration files for the inputs or to read and aggregate measured results from a report file. In addition, monitoring and restarting individual executions is tedious and time-consuming.

These challenges are addressed by PyExperimenter by means of a single well defined configuration file and a central database for managing massively parallel evaluations, as well as collecting and aggregating their results. Thereby, PyExperimenter alleviates the aforementioned overhead and allows experiment executions to be defined and monitored with ease.

For more details check out the PyExperimenter documentation:

Cite PyExperimenter

If you use PyExperimenter in a scientific publication, we would appreciate a citation in one of the following ways.

Citation String

Tornede et al., (2023). PyExperimenter: Easily distribute experiments and track results. Journal of Open Source Software, 8(84), 5149, https://doi.org/10.21105/joss.05149

BibTex

@article{Tornede2023, 
    title = {{PyExperimenter}: Easily distribute experiments and track results}, 
    author = {Tanja Tornede and Alexander Tornede and Lukas Fehring and Lukas Gehring and Helena Graf and Jonas Hanselle and Felix Mohr and Marcel Wever}, 
    journal = {Journal of Open Source Software},
    publisher = {The Open Journal},  
    year = {2023}, 
    volume = {8}, 
    number = {84}, 
    pages = {5149}, 
    doi = {10.21105/joss.05149}, 
    url = {https://doi.org/10.21105/joss.05149}
}

py_experimenter's People

Contributors

Stargazers

Watchers

Forkers

kiudee helegraf stheid

py_experimenter's Issues

Example for Singularity and SLURM

Create an example folder containing

Example experiment configuration
Python example code
Singularity recipe
SLURM script for execution

Create example general usage

With SQLite, no database credentials are needed. Therefore, everything could be executed without creating an experiment configuration file.

Example config too generic

For an unknown user, it will probably be very hard to set up the configuration file due to the very generic example given. It would be nice to have a small example config + run.py where only the database information has to be added and a complete run can be executed.

experimenter does not fetch experiments with float key fields

Really weird behavior now: The experimenter correctly gets the table, but does not suck experiments.

config file

[PY_EXPERIMENTER]
provider = mysql 
database = experiments
table = test

keyfields = dataset:int, seed:int, epsilon:float
dataset = 61
seed = 0:2
epsilon = 0.1

cpu.max = 1 

resultfields = analysis:LONGTEXT
resultfields.timestamps = false

Code

from py_experimenter.experimenter import PyExperimenter
from py_experimenter.result_processor import ResultProcessor

experimenter = PyExperimenter(config_file="config/experiments-test.cfg")
experimenter.fill_table_from_config()

experimenter.get_table() # will correctly print a table with 3 entries

def f(keyfields: dict, result_processor: ResultProcessor, custom_config):
    print("GO")

experimenter.execute(f, max_experiments=-1, random_order=True) # will do nothing

Here is a screenshot:

Slow addition of experiment configurations to the database

Currently, we add experiments to the database one by one, making the addition relatively slow. One might be able to fix this by doing junk inserts.

Primary Key missing in DB

Machine column not filled correctly

The "machine" column in the database should not be filled with the PID, but rather with the name of the machine the code was executed on.

add function to completely erase/reset setup

the experimenter currently has a function to fill in experiments.

However, specifically at the beginning of the workflow, one often experiments a bit. It would be great if there would be a function that completely drops the table and recreates it with the new experiments. Currently, the only way to repair your setup when you change something in the table attributes is to manually delete it.

Requirements too strict

The requirements are too strict, that is why we have not been able to install it. It would be better to have requirements with ">= version"

Remove Random Order Parameter

For parallel execution of experiments we previously used the parameter random_order in execution. Since we reworked the execution functionality this parameter can be removed.

experimenter is slow in fetching experiments for 10k+ experiments

I have setups with above 100k experiments. However, even for only 10k experiments, the time until when the run_experiment function is called is unacceptably high (something like 30 seconds). This occurs independently of whether random_order is true or not.

This happens on my localhost, so it is no network issue.

The time for an experiment to start should be independent of the table size (and take less than a second on localhost).

Adjust requirements.txt to avoid utf-8 problem in mysql connector

With the latest version of the mysql-connector-python, one might experiment problems in the charset:

DatabaseConnectionError: Character set 'utf8' unsupported

ProgrammingError: Character set 'utf8' unsupported

This can be solved by downgrading the connector:

pip install mysql-connector-python==8.0.29

Reset experiments throws error if no experiments has according status

The reset of experiments with a specific status is throwing an error in case there is no experiment with such a status. It should rather log the info and continue without any changes.

experimenter.reset_experiments(status='error')

Reset Experiments

Provide a reset functionality of experiments having a given status.

Start date is not updated once an experiment runs

Observation: When running experiments, once an experiment is started the start_date column is not updated for the already started experiments. In fact the start_date column is NULL for experiments of any status (running, error, done).

Proposed fix: Update the newly assigned experiment with the actual start_date.

Fill table based on dictionary where all options are automatically combined to an entry

Currently there are only two options to fill the database. The first is by defining each entry individually

experimenter.fill_table(own_paramerters=[
    {'datasetName': '1', 'internal_performance_measure': '1', 'featureObjectiveMeasure': '1', 'seed': 1}])

, whereas the second option is to write a config file where all options of column values are combined:
experimenter.fill_table()

It would be great to have the second option available not via a config file, but via input of a dictionary that is build like a config file.

Improve experiment configuration file

Recombine all mandatory information of the PyExperimenter into a single block/section, to avoid errors.

[PY_EXPERIMENTER]
provider = sqlite
database = database_name
table = table_name

keyfields = seed:int
resultfields = sin, cos

cpu.max = 5

seed = 1,2,3,4,5

[CUSTOM]
data = path/to/folder

Error in Poetry install

Currently, poetry install fails with the debugpy dependency which we hotifxed by using poetry 1.4.0 or disabling modern install.

However, once microsoft/debugpy#1246 is solved we should be able to change the documentation and workflow to the original state.

Generation of DB view for job summary

Create DB view containing summary / amount of jobs of the following categories:

open jobs
running
terminated
failed/crashed

Fetching experiments does not work anymore on MySQL with certain versions

Hi,

with the new version of the experimenter (1.1.0) when running the execute function I now receive the following error message (without stack trace or anything):

ERROR:root:error 
 MySQL server version (5, 5, 5) does not support this feature raised. 
 Please check if fill_table() was called correctly.

Not sure where the error is risen exactly. My version is Server version: 10.9.3-MariaDB Arch Linux

The error does not occur on plain MySQL. Here, I have one with version mysql Ver 8.0.20 for Linux on x86_64 (MySQL Community Server - GPL).

MariaDB is usually a largely compatible equivalent for MySQL (on arch linux it is the mysql replacement, because there is no official mysql package). Have you any idea which functionality used from MySQL might this behavior when using MariaDB instead?

Restructure SQL-Connector

Problem in the current code structure

As of now, the database_connector.DatabaseConnector (and its subclasses) handle support two main functionalities

Starting and maintaining connections to the database
Generating rows from the given input parameters
Due to the current code structure we have the table's name in all classes. However, this might be unnecessary.
Naming of variables and functions regarding underscores before the function's name is inconsistent.
In addition to these functionalities, a DatabaseConnector object is held by all ResultProcessor instances in addition to the PyExperimenter object and creates the log-tables.

Together this leads to the DatabaseConnector code being hard to understand and adapt.

How to fix the problem

This problem should be approached by splitting the connector into Interface classes (for mysql, sqlite,...) and classes that generate the entries.
In addition, it is a goal for an experimenter to only use one connection at a time (currently multiple threads can have connections simultaneously) that inserts, deletes, and modifies numerous table entries at a time.
Another goal is that the functions used for creating tables and fetching data, ... get adapted not to use execute. This is motivated by the goal of using prepared statements for escaping instead of a separate function as mentioned in #61

Allow categorizing jobs into different categories/ buckets for runners to pull from

Currently, any runner can pull any open experiment from the database. However, not all experiments may have the same resource requirements (e.g. requiring/ not requiring GPU, different runtime, etc. which on a cluster would be executed by scripts with different limits). It would be practical to assign jobs a category, and set runners up so they can only pull jobs from that category as an optional feature (default category to keep current functionality). Maybe this could be realized by a kind of tagging feature?

Allow configurations without resultfields

Problem

It might be useful to allow configurations without a resultfield. For example if we intend to only write into logtables #84.

Solution

The solution is pretty easy. One just has to adapt the syntax check of configurations.

`reset_experiments` documention

Hi, the reset_experiments documentation mention states as parameter, which is in disagreement with the general usage documentation that uses the status parameter. After installing pyexperimenter using the pip command, and attempting to replay the commands, it seems that neither status or states are accepted as parameter, I keep getting a got an unexpected keyword argument exception. The current source code indicates states, in disagreement with the documentation. It only works when providing the argument without the parameter name.

This is probably related to the usage o *states, especifically the *, requiring it to be a positional argument instead of a keyword argument. So, if you preceed states by *, then you can't use it as a named argument as instructed in the documentation. More about this in the Python documentation.

mysql-connector handling of Longtext

In der Datenbank werden Longtext Felder standardmäßig als utf8_bin encodiert. Fälschlicherweise hat der mysql-connector in python bis zu 8.0.24 longtext werte, die gefetched werden zu strings umgeformt. Eigentlich hätten es schon immer bytes sein sollen, das war aber nicht so (https://dev.mysql.com/doc/relnotes/connector-python/en/news-8-0-24.html). Wenn nun die bytest verarbeitet werden sollen entstehen diverse probleme im weiteren Ablauf.

Das ganze lässt sich zum Beispiel beheben indem die funktion DatabaseConnector.fetchall in DatabaseConnectorMYSQL überschrieben wird und das entsprechedne decodieren übernimmt.

Improve Configuration Handling

So far, there is no single point of configuration handling, which should be improved in a later version to increase maintainability. This means encapsulating all accesses, and modifications of the config in one class ConfigHandler.

PyExperimenter attaches handler to all loggers

The library attaches handlers to all loggers in the system, which makes it impossible to reasonably log applications during the experiments (with one's own logging configs). Here an MWE:

from py_experimenter.experimenter import PyExperimenter
from py_experimenter.result_processor import ResultProcessor
import logging

logger1 = logging.getLogger("l1")
logger1.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger1.addHandler(ch)

logger2 = logging.getLogger("l2")
logger2.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger2.addHandler(ch)


logger1.info("pre 1")
logger2.info("pre 2")

experimenter = experimenter = PyExperimenter(config_file="config/experiments.cfg")

logger1.info("post 1")
logger2.info("post 2")

And here the console output:

2022-10-25 15:05:01,787 - l1 - INFO - pre 1
2022-10-25 15:05:01,787 - l2 - INFO - pre 2
2022-10-25 15:05:01,792 - l1 - INFO - post 1
INFO:l1:post 1
2022-10-25 15:05:01,792 - l2 - INFO - post 2
INFO:l2:post 2

My suggestion is to add a parameter logger that can be str or a logger from python. The library should not define any handlers itself imho. It should just use the pre-defined logger (if any) for logging its messages. It is up to the user to set up handlers.

Allow for Interval Value Ranges

It should be possible to specify numeric keyfields via an interval, e.g. 1-30:1, meaning ranging from 1 to 30 with a step size of 1: [1-30:1] [start-end:stepsize]. Additionally, multiple intervals should be allowed, like 1-30:2, 60-90:2

requirements.txt is incomplete

mock
pytest

Update "How to Contribute" to fetch branches from all remotes

If no fetch from all remotes is done, only the main branch might be present in the forked repository.

This refers to how-to-contribute (docs branch). Step 4 in Fork Project needs an additional git fetch --all as the first step.

Stack traces are not properly logged into MySQL

I presume that if a stack trace contains simple quotation marks ('), then it is not properly logged. Instead it logs the following error message:

1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '<something>'])

Log Memory Usage

The max memory should be noted in the experiment configuration (memory.max) and database in an according column.

Renamings

PyExperimenter

init()
- config_path -> config_file
- experimenter_name -> name
get_results_table() -> get_table()

own_function()

parameters -> keyfields
custom_config -> custom_fields

Assign array of experimentIDs to the PyExperimenter to be executed

Instead of having the PyExperimenter select the next experiment on its own, it should be possible to pass an array of experiment IDs which the experimenter executes.

Writing resultfields having '

Pipelines cannot be written to the database in the current version due to string issues ('), like in the following example:

result = {'final_pipeline': "Pipeline(steps=[('svc', SVC(gamma='auto'))])"}
result_processor.process_results(resultfields)

ParameterCombinationError: No parameter combination found! is raised if all combinations allready exist in the database

README example inconsistency

The example at https://github.com/tornede/py_experimenter#defining-the-experiment-function is inconsistent in the README. It features "custom_fields" as parameter, but accesses "custom_config" within the code.

CI Tests

The tests should be executed/checked with each push. Note that we need a workaround for the MySQL tests.

Additionally, the pass of the tests should be mandatory for a Merge Request.

Update Experiment Handling

The current implementation of the execution of multiple experiments does not scale well for a large number of experiments. The reason is how the given max_experiments (e.g. =5) are selected and executed. First the complete table is pulled, then locally it is shuffled and the top 5 experiments are selected for execution. But the execution is usually not parallel, but sequential, may resulting in an overlap with another PyExperimenter instance. At the end it could be that not the full amount of experiments are executed.

Solution

The experiment configuration field cpu.max will be renamed to number_parallel_experiments.
Rename max_experiments to max_number_experiments_to_execute
PyExperimenter.execute() will spawn as many worker as defined by number_parallel_experiments
The open experiment will not be pulled once in advance, but within each call of the PyExperimenter._execution_wrapper(). This will be completely handled by the SELECT call, including the randomize (if given), and limits the results to 1. In the same transaction of the pull of an open experiment, its status will be set to running.
An open experiment will only be pulled if max_number_experiments_to_execute has not been reached (except for -1)

Impacted Issues

Error in _valid_configuration method

Add
else: return True
if host, user and password is in config

Start Date not set

The start date is not set

Identification of Experiments

While the table key 'id' was introduced at some point, we adapted neither structure of the database_connector classes nor of the result_processor. Therefore we internally still identify experiments by their keyfield, keyfield_value pairs.

This can be seen, e.g. in the line
self._where = ' AND '.join([f"{str(key)}='{str(value)}'" for key, value in condition.items()])
of the result_processor class.

Timestamps for Result Fields

It should be optionally configurable that result fields are written into the experiment table together with a timestamp.

In the configuration file, this would be defined via resultfields.timestamps = true

Log Resource Consumption

The resource consumption could be measured, e.g. via code carbon and logged into an according column of the DB table.

Add job information

The PyExperimenter constructor should have an optional field for providing job information in terms of a string which is logged in a distinct database column. This functionality helps to associate DB rows with, e.g., log files on a cluster.

Add Database Logging Functionality

The general idea is to add the functionality to log values for a single field over time, like the new_best_performance observed, or information about some epochs like its runtime or performance. Therefore logtables are introduced, each being able to log predefined information over time.

Example

Having a look at a concrete example, one would like to have the main table experiment_example with

dataset
seed (both keyfields), as well as
pipeline,
accuracy, and
f1 (all three resultfields).

And additionally logged information within the run of each experiment will be stored in separated tables, each having multiple entries for each experiment_id for different timestamps. Each entry of a logtable is automatically associated with the according experiment_id and a timestamp.

experiment_example__new_best_performance will store
- experiment_id
- timestamp
- new_best_performance
experiment_example__epochs will store
- experiment_id
- timestamp
- runtime
- performance

Configuration File

The configuration file could look like in the following. logtables is the keyword for defining new tables for logging, supporting two types:

Usual SQL type, like FLOAT
Complex types, which has to have a unique name, like LogEpochs and can consist of multiple SQL types like FLOAT or INT

Each item in logtables (be it a simple or complex type) will create an entirely new table. This needs to be kept in mind for aggregating the results, as needlessly splitting up logs over multiple tables will need to costly joins when aggregating.

[PY_EXPERIMENTER]
provider = sqlite
database = experiment_database
table = experiment_example

keyfields = dataset, seed:int(1)
resultfields = pipeline:LONGTEXT, accuracy:FLOAT, f1:FLOAT
logtables = new_best_performance:INT, epochs:LogEpochs                    # <- This is new
LogEpochs = runtime:FLOAT, performance:FLOAT                              # <- This is new

[CUSTOM]
output_folder = output/

Usage

The ResultProcessor will get the new method process_logs() to handle the logging. It gets a dict with the logtable name as keys and depending on the type the value:

Usual SQL types: value
Complex types: dict of complex type names and values

result_processor.process_logs({
    'new_best_performance': 0.86,
    'epochs': {
        'runtime': 42,
        'performance': 0.79
    }
})

Performance

To avoid ddossing the own database, it is necessary to buffer the logging (dynamically).

Update README

Deadlock when fetching experiments simultaneously.

I frequently get an error when starting various jobs at the same time, but sometimes also when starting just a small number (64 jobs):

1205 (HY000): Lock wait timeout exceeded; try restarting transaction
 raised when executing sql statement.

The jobs won't start. I suppose it's a problem at the moment of fetching jobs from the table. Even restarting the mysql server does not help immediately. Sometimes, after several dozens of these errors, he eventually grabs the experiments and starts through.

There is no stack trace, and I have currently no clue why exactly it is happening. But it is a rather severe problem, because valuable cluster compute time is burnt only in waiting for the job to start.

I would suggest to rethink the way how the experiments are fetched. You might want to checkout the Java version in AILibs, in which we do a combined SELECT/INSERT statement to immediately reserve a job when it is being fetched (in the same query, without need of transaction).

Change order of table columns in database

ID
keyfields / input fields
Creation Date
Status
Start Date
Name / Job ID
resultfields / output fields
End Date
Error

Create conditional parameter grid example (SVM)

An example shall be created in which the use of conditional parameter grids is showcased. An easy way is configuring a support vector machine including the kernel parameters conditioned on the chosen kernel.

tornede / py_experimenter Goto Github PK

py_experimenter's Introduction

PyExperimenter

Cite PyExperimenter

Citation String

BibTex

py_experimenter's People

Contributors

Stargazers

Watchers

Forkers

py_experimenter's Issues

config file

Code

Problem in the current code structure

How to fix the problem

Problem

Solution

Solution

Impacted Issues

Example

Configuration File

Usage

Performance

Recommend Projects

Recommend Topics

Recommend Org