A utility to help apply expert rules to Sonobat or Kaleidoscope auto-id output. However, the program is intended to be quite general, and can likely be applied to outputs from other automatic classifiers that produce spreadsheet-like outputs.
Getting started:
- openpyxl (version 2.5.14, not newer!)
- click
Install Python on your computer.
Make sure to check box to add Python to Path.
Install echoclean:
pip install echoclean
Verify that echoclean
installed correctly by typing the following in
your terminal of choice (CMD prompt on Windows):
echoclean
This should show:
Usage: echoclean [OPTIONS] COMMAND [ARGS]...
and not an error.
Install Python on your computer.
Make sure to:
- Check box to add Python to Path
- Use Customize Installation
- Be sure that box to install Pip is checked
- Advanced Options: be sure box to Add Python to Environment variables is checked
- Specify the directory to which you want Python installed so that you can find it if needed
(e.g., to
~\Programs\Python\Python310\
)
Run the CMD prompt by typing "CMD" from Windows search.
Verify that Python
is installed correctly by typing the following in the CMD
prompt:
python --version
If you get an error message about python not being installed, try the following:
py --version
Either python
or py
should
This should show the Python version and not an error.
Verify that pip
installed correctly by typing the following in the CMD prompt:
pip --version
This should display the version number and not an error.
Retrive the list of places where python and packages are installed by typing the following in the CMD prompt:
python -c "import site; print(site.getsitepackages())"
OR
py -c "import site; print(site.getsitepackages())"
This will return a list of directories.
Update the System PATH environment variable to include the following directories. Follow the instructions [here](https://www.wikihow.com/Change-the- PATH-Environment-Variable-on-Windows) to update environment variables.
- add each of the directories above, separated by a semicolon
- add the Scripts subdirectory of Python; if you installed Python to
~\Programs\Python\Python310
this will be~\Programs\Python\Python310\Scripts
Now install echoclean by typing the following in the CMD prompt:
pip install echoclean
Verify that echoclean
installed correctly by typing the following in the CMD
prompt:
echoclean
This should show
Usage: echoclean [OPTIONS] COMMAND [ARGS]...
and not an error.
Download or clone this repository, then run:
python setup.py install
Save your rules to a known location on your computer (e.g., C:\Users\YourName\Desktop\Sonobat\Rules
)
Open your command window and navigate to that directory. On windows, open your CMD prompt by typing "CMD" in the Windows search box, and then type "cd " to navigate to that directory (replace with the actual name of your directory above).
Then type the following:
echoclean apply "<rules>.xlsx" "<dataset>.csv"
where <rules>
and <dataset>
are the names of your rule spreadsheet and
Sonobat outputs respectively.
You will be prompted to select the worksheet in your rules workbook that contains the rules you want to apply.
This will create a new file called "_out.xlsx" in the same
directory as your rules (basically, replaces .csv
with _out.xlsx
).
TIP: Make sure that all the columns are correctly lined up in the Sonobat output
Echoclean works by applying expert rules from the RULES file to the DATA file. The files can be XLSX, CSV, or Tab-delimited. If XLSX and more than one sheet is found, you will be prompted to select to correct sheet.
The program first compares the columns between the DATA and RULES files. Any column that is present in both is used to construct a rule. Any column that is found only in the RULE file is used as the result set when that rule is applied to a row in DATA. Any column in the DATA file that is not found in the rules is copied to the OUTPUT. This flexibility allows you to develop rules for nearly any simple data that can be represented as row / column format.
Note: if you run the OUTPUT of this program back through as DATA, you must rename the columns from the result set, or they will be used for criteria and produce unexpected results.
This program does not care how you name your columns, but for a rule to apply to a column in DATA, the names must be exactly the same (including capitalization). This program does not expect any particular column, but if it finds one called 'Filename' it will use that to attempt to parse out a datestamp (using Sonobat format). These datestamps are then used to determine the night on which the file was recorded, and any time before noon is assumed to be part of the previous night.
Rules are applied in sequence, in the order they are read in from the file.
The OUTPUT file is always an XLSX spreadsheet with multiple sheets. One sheet includes the classification results, which are each row from the DATA file, preceded by the result set from the RULES file for the rule that classified that row. For each column in the result set, an additional sheet is created with the count of rows for each unique value of that result which was found in the classification results.
A rule is a collection of criteria that must be met to apply that rule. As soon as a rule is applied, no other rules are tested.
It is convenient to think of rules as forming a dichotomous key; as a row from DATA fails a given rule, it is tested against the next one, and thus does not qualify against ANY of the preceding rules. Thus your rules can be additive because you can safely assume that the row did not meet any of the prior criteria, and you should construct your rules from the most strict to the least strict.
Example:
Given the following rules:
Rule 1 Consensus: 'ANPA'
Rule 2 Consensus: 'MYLU'
These will be evaluated such that any row from DATA that is tested against Rule 2 is implicitly NOT any of the criteria in Rule 1. So in this case, it is equivalent to:
Consensus is 'MYLU' AND NOT 'ANPA'
.
We recommend that you number your rules to assist with debugging. Doing so makes it easier to visually compare the values in a given row to the rule applied or expected.
Note: this program does not check your rules for validity. You are responsible for making sure that your rules are logically correct and in the correct order.
The criteria must be matched against the data type of the column. Range or equality comparisons are only appropriate for numeric types.
any If a column is left blank in for a given rule, it allows any value for that column in the data to match. You can also provide the value 'any' but this is unnecessary.
blank If you specify 'blank' for a column in a given rule, then that column must be blank in the data for it to match.
numeric types You can specify equality comparisons like these:
> 10
< 12
>= 40
<= 20
= 2
4 - 10 (equivalent to >= 4 AND <= 10)
text types You can specify a list of possible values to compare against. The row in the data only needs to contain one of them.
Example:
Rule 1 Consensus: LANO, COTO, TABR, LACI, or Blank
would match COTO
but would not match EPFU
. This is also a special
case where a blank would match, but any value not in the list would fail.
You can also use a negatory condition, but only one value is allowed:
Consensus: Not EPFU
See examples/README.md for example rule files to start from.
If you need assistance installing or running echoclean, please contact [email protected].
For help developing and applying rulesets, please contact [email protected].