garfieldnate / algorithm-am Goto Github PK

Perl module implementing Analogical Modeling

License: Other

Perl 79.50% XS 20.50%

algorithm-am's Introduction

Hi there! 👋

I'm Nathan, a freelance software engineer with over 10 years of experience specializing in natural language processing. I'm familiar with a wide range of computer technologies as wells as human languages. If you'd like to learn more about me or my work, you can visit my website at nateglenn.com. If you'd like a free consultation on NLP or general engineering work, send me an email at [email protected]. Talk to you soon!

algorithm-am's People

Contributors

Stargazers

Watchers

algorithm-am's Issues

allow comments in data files

This is long overdue; don't know how I did without it. Just use # or something. Or let the user set it.

add pure perl version

AM 2.0 was pure Perl, and it is possible that some people will still not like or be able to compile XS code. Add back that old Perl code as a pp option and let users decide on pp or xs versions at install time. This would make it at least a little more future-proof, since there seem to be so many problems trying to get the XS to not crash Perl on some systems.

investigate alternatives to AM_BIG_INT

AM_BIG_INT holds 2 bytes in each element and has 8 elements, meaning it is 16 bytes long, or 128-bit. Are there any alternatives (e.g. libraries) that give this precision? It sure would be easier to work with something more natural than arrays representing single numbers.

use Travis-CI

So we don't have to wait for CPAN testers, although those are great, too.

functionality for reading old data sets

Having to convert data sets will be a barrier to entry for many AM::Parallel users, so add a function for reading data stored in the old format.

handle features that evaluate to false better

I was wondering why there was more than one gang being printed with the same context label; the actual problem was feature vectors like K Y 0 = 0 = 0 = T E. I would check to see if a feature was false, and if so it would be printed as -. But 0 is also false in Perl, and here it is being used as a feature.

strange 0.000000% classification errors

Even in the latest dev release, many platforms are failing badly, with all of the classification results giving 0.00000% for each possible outcome. All of *nix still fails, and for whatever reason 5.18.0 on Windows also fails with the same problem.

Prereqs not getting installed

'perl Makefile.PL && make' or 'dzil build' does not install prereqs (like Text::Table, Crypt::PRNG, Exporter::Easy, Test::LongString)

After forking the project I had to manually install these, even though for instance Text::Table is in the Makefile.

AM::classify should die if test_item is undef or the wrong type

There are probably lots of places where better checking would be useful, but I was just bitten by this. The error instead happens when classify tries to call cardinality on undef.

make giant eval editing less fragile

Right now the giant code string in the __DATA__ section is marked at edit points with innocent-looking comments. If those are removed, then creating the eval string fails silently, which causes the program to die loudly and strangely.

If the large code string cannot be avoided, use less fragile editing techniques, like a templating system of some kind.

printing gangs can cause fatal error

Running finnverb with analogize.pl causes a failure around line 406 because $gang->{data}->{$class} is undefined.

You show up twice under contributors

I imagine you just have to do some git-fu to rebuild from a prior point as you are using the contributors plugin.

If you fix this could you change my author info to "Nick Logan [email protected]"?

report ties

Currently, AM reports Correct outcome predicted if the correct outcome has the highest number of pointers. But several outcomes might have the highest number of pointers! Handle this case separately and report ties. This may actually be very bad for previous published results, depending on if they simply greped amcresults for Correct or if they did their own calculating. The manner of calculation used by the finnverb example is also similarly flawed.

give a warning for commas in a nocommas format file

Give a nice error for this so the user sees where their mistake is immediately.

document how big integers work in AM guts

Document how the long arrays are used as big integers. Reference another work if possible, like Java's BigInteger class code or Hacker's Delight or whatever can be found. Related questions:

Why do we know that char outspace[55]; is long enough? 55 is magic there.
Why do we set the NV and PV in normalize, when Perl can very well make the PV itself?
Why use counthi and countlo?

remove note about identical code

AM.xs, line 768. This was removed a long time ago.

move batch processing into a separate package

All possible AM functionality, including batch-related features such as hooks and probabilities, are baked into AM.pm. Move it into a separate package for greater flexibility and organization.

fix indents in AM.xs

Change all tabs to spaces and make sure indenting is done properly. It's a bit hard to read at some places right now.

add detailed tests of statistical results

Data driven for this would be the best. #42 hurt, and unfortunately I don't know how to unit test XS code.

don't print classification results

The most annoying thing about the software at this point is that the only way to get classification data is to read what it prints out. Instead of printing all of that data to stdout, create some structure or class and return that to represent the results.

May have to wait for #7.

Makefile.PL version incorrect due to missing git tags for previous release versions

Generated Makefile.PL still has version at 3.03 (will lead to object linking problems). Need to tag versions 3.03 and 3.04.

sort results by order of probability

Currently the statistical outcomes seem to be printed alphabetically, which is less helpful than in order of decreasing probability. This would of course also make #42 easy to look at.

output version with help in analogize.pl

Need to figure out how to do that with Pod::Usage, first.

document data sets

Now that we can have comments in data sets (#43), add documentation comments to all of them. It might be good to download the other available AM data sets and comment them, as well.

require c99 in Makefile.PL

If #2 is solved by using C99 features, then the Makefile should be made to check for C99 and warn/die if it is not available.

https://http://metacpan.org/module/Module::Install::XSUtil and https://metacpan.org/source/GFUJI/Data-MessagePack-0.33/Makefile.PL

change project organization

The data file has all of the exemplars and their outcomes, and the outcomes file has a longer name for those outcomes, but instead of being a simple mapping of short to long outcomes it is required to have the same number of lines as the data file and contain a long outcome string for each data item. Was this considered ideal at the time of writing it, or is it something that could be done away with? I would rather accept a simple mapping file, or just allow it to be placed in the data file optionally.

replace Text::Table with Text::Table::Tiny

Since it's a lot faster.

Win32 5.16.3 Perl crash

Christian Walde's smoke tester has consistently failed all tests, with Perl crashing for every single one. Here's an example report.

Christian's smoketester is available on GitHub. Add a .bat file containing the following lines to the root in order to use this install easily (assuming you placed in in c:\Perl1664\):

set PATH=c:\Perl1664\site\bin;c:\Perl1664\bin;%PATH%
c:
cd c:\Perl1664\
cmd

Use d8bc46 as a reference.

write a porting guide for existing AM users

AM has changed and will change quite a bit before it is stable. Write a porting guide for those who have datasets or programs in the old AM system and would like to update to the new one.

investigate reporting of dzil plugins in report-versions-tiny

The report-versions-tiny.t script always reports all of the Dist::Zilla plugin versions. Those are only authoring dependencies, and don't matter when just grabbing the code and installing. We don't need all of those reports! Investigate how to make it stop.

allow traditional AM terminology

Royal asked if the regular AM terminology could be used instead of class, training item, etc. Alternative method names and alternative parameter names would probably be fine. Printing, though, is problematic. Maybe it would be better to change all of the reports to AM terminology, as the people who use this functionality may be the ones who care more.

dzil build fails because of duplicate files

Because of a bug in the CopyFilesFromBuild dzil plugin, dzil build currently fails (three duplicate files: cpanfile, Makefile.PL, LICENSE).

You can either wait for [https://rt.cpan.org/Ticket/Display.html?id=92828](rt 92828) or you can get rid of that plugin or you can delete those three files from the root before building.

Document Perl macros/functions in AM.xs

There are some Perl macros/functions used in AM.xs quite commonly, but it's easy to forget what they all are. Document them at the top of AM.xs to make maintenance easier. Here are some important ones:

HeKEY,
HeVAL
hv_fetch
SvUPGRADE
sv_setpvn
SvPVX
SvUVX
SvUV
SvIVX
SvIOK
Safefree
Copy
Zero
SvCUR_set
SvPOK_on
SvGROW

add gang_list report

Add a report for gang_list from the original AM, which lists all gangs and each of their items.

add test for largest order big int

AM.xs currently handles numbers up to 128 bits. I believe that 1a38c19 would have broken that, which is why I reversed it in 5d297e8. There should be a test with a dataset large enough to exercise those big integers so that if the upper bits were clobbered it would be known immediately.

add a LatticeCombiner interface

Add a LatticeCombiner interface whose role is to combine sublattices. Then we can have the original algorithm from AM::Parallel beside the current intermediate combination one. This will make comparing them possible, but will also make switching possible if the intermediate combination one turns out to be too memory-hungry for a given data set. I plan on there being more combiners in the future as well, after reading the literature for fast set intersection.

document bigint

Right now the big integer stuff is hard to understand. It might be useful to give it its own file with a well-documented API. As it is, I don't know what exactly happens when we do 100 * $n / $grandtotal. I don't see special handling for arithmetic operators anywhere.

Not factoring the code into a separate file would be fine too, but it just needs to be researched and documented.

reduce lattice combination steps

The Java version manages to combine lattices much more quickly by consolidating common supracontexts after each combination, not just after the final combination. That would be a good optimization here.

make AM_LONG longer on 64-bit machines

Currently all size-sensitive calculations are done with 32-bit integers. Couldn't we use 64-bit ones on a 64-bit machine, which is almost all machines these days?

add to_str method to Item

Add a method for stringifying Item objects. Will probably need to provide format as a parameter. Hmm, or add some functionality to print these. I don't want the format to get locked in.

improve analogize documentation

User feedback was that

it was not clear how to print multiple reports (comma with no space)
it was not clear how to print to a file
there need to be more usage examples

document all unknown variables

Document all of the mysterious variables in AM's guts. Specifically, I think it would be beneficial to document all of those passed into the _initialize method, as well as others such as datatocontext and anything involved with pack or unpack.

lower required Perl version

Find out why Perl 5.14 is required and remove whatever features are being used to get the required version down to around 5.10.

Dataset/Item creation API request

A user requested that it be made possible to create an Item from a single data line and the format options (or a DataSet which shares the same options). They also mentioned it could be nice to be able to get a dataset from an array of lines. The reason for this is that the user keeps the data file in a GUI and does not want to parse himself or reload from disk.

I think it might be good to allow passing handles to be read from (could use with string refs).

bad stats

The example used to demonstrate #34 gives weird output (but didn't with AM::Parallel):

    +-----------------------------+-------------+------------+
    | Class                       | Score       | Percentage |
    +-----------------------------+-------------+------------+
    | alternarialeaf-spot         |        2180 |   0.000%   |
    | anthracnose                 |       49856 |   0.000%   |
    | bacterial-blight            |       61952 |   0.000%   |
    | brown-spot                  |      364608 |   0.001%   |
    | brown-stem-rot              |     1060652 |   0.002%   |
    | charcoal-rot                |  8964860640 |  18.824%   |
    | diaporthe-pod-&-stem-blight |         140 |   0.000%   |
    | diaporthe-stem-canker       |       51840 |   0.000%   |
    | downy-mildew                |       50688 |   0.000%   |
    | frog-eye-leaf-spot          |      104448 |   0.000%   |
    | phytophthora-rot            |       62912 |   0.000%   |
    | powdery-mildew              |        7008 |   0.000%   |
    | purple-seed-stain           |       21664 |   0.000%   |
    +-----------------------------+-------------+------------+
    | Total                       | 47623698012 |            |
    +-----------------------------+-------------+------------+

It does get basically the right idea, with charcoal-rot outdoing everything else, and the count for diaporthe-pod-&-stem-blight is correct, but everything else is wacky. Did I screwed up the biginteger calculations? This is all awful. Here's the AM::Parallel output, which matches that of Weka AM:

alternarialeaf-spot               3016836    0.001%
anthracnose                       5358272    0.002%
bacterial-blight                  2880000    0.001%
brown-spot                        2134080    0.001%
brown-stem-rot                  976826156    0.289%
charcoal-rot                 337300810464   99.700%
diaporthe-pod-&-stem-blight           140    0.000%
diaporthe-stem-canker            10013312    0.003%
downy-mildew                        50688    0.000%
frog-eye-leaf-spot                 890880    0.000%
phytophthora-rot                  2028992    0.001%
powdery-mildew                   11869024    0.004%
purple-seed-stain                 1463456    0.000%
                             ------------
                             338317342300

support exclude nulls/known in analogize.pl

These features are important for classification, so they should be supported on the command line.

add a standalone script

Add a standalone script for running AM classification on data sets:

--train/--data/--exemplars: training set
--test: test set; leave-one-out if ommitted
--project: classic AM project
--{gangs/summary/etc}: print given report type