garfieldnate / algorithm-am Goto Github PK
View Code? Open in Web Editor NEWPerl module implementing Analogical Modeling
License: Other
Perl module implementing Analogical Modeling
License: Other
https://code.activestate.com/ppm/Algorithm-AM/
You will notice the last good build for active state perl was 3.02. This is because they build using Visual C which is only C89 compliant.
Add a standalone script for running AM classification on data sets:
--train/--data/--exemplars: training set
--test: test set; leave-one-out if ommitted
--project: classic AM project
--{gangs/summary/etc}: print given report type
Currently, AM reports Correct outcome predicted
if the correct outcome has the highest number of pointers. But several outcomes might have the highest number of pointers! Handle this case separately and report ties. This may actually be very bad for previous published results, depending on if they simply grep
ed amcresults
for Correct
or if they did their own calculating. The manner of calculation used by the finnverb example is also similarly flawed.
These features are important for classification, so they should be supported on the command line.
AM 2.0 was pure Perl, and it is possible that some people will still not like or be able to compile XS code. Add back that old Perl code as a pp option and let users decide on pp or xs versions at install time. This would make it at least a little more future-proof, since there seem to be so many problems trying to get the XS to not crash Perl on some systems.
There are some Perl macros/functions used in AM.xs quite commonly, but it's easy to forget what they all are. Document them at the top of AM.xs to make maintenance easier. Here are some important ones:
HeKEY
,HeVAL
hv_fetch
SvUPGRADE
sv_setpvn
SvPVX
SvUVX
SvUV
SvIVX
SvIOK
Safefree
Copy
Zero
SvCUR_set
SvPOK_on
SvGROW
Document all of the mysterious variables in AM's guts. Specifically, I think it would be beneficial to document all of those passed into the _initialize
method, as well as others such as datatocontext
and anything involved with pack
or unpack
.
A user requested that it be made possible to create an Item from a single data line and the format options (or a DataSet which shares the same options). They also mentioned it could be nice to be able to get a dataset from an array of lines. The reason for this is that the user keeps the data file in a GUI and does not want to parse himself or reload from disk.
I think it might be good to allow passing handles to be read from (could use with string refs).
I was wondering why there was more than one gang being printed with the same context label; the actual problem was feature vectors like K Y 0 = 0 = 0 = T E
. I would check to see if a feature was false, and if so it would be printed as -
. But 0
is also false in Perl, and here it is being used as a feature.
AM_BIG_INT holds 2 bytes in each element and has 8 elements, meaning it is 16 bytes long, or 128-bit. Are there any alternatives (e.g. libraries) that give this precision? It sure would be easier to work with something more natural than arrays representing single numbers.
Add a LatticeCombiner interface whose role is to combine sublattices. Then we can have the original algorithm from AM::Parallel beside the current intermediate combination one. This will make comparing them possible, but will also make switching possible if the intermediate combination one turns out to be too memory-hungry for a given data set. I plan on there being more combiners in the future as well, after reading the literature for fast set intersection.
This is long overdue; don't know how I did without it. Just use # or something. Or let the user set it.
Now that we can have comments in data sets (#43), add documentation comments to all of them. It might be good to download the other available AM data sets and comment them, as well.
Add a method for stringifying Item objects. Will probably need to provide format as a parameter. Hmm, or add some functionality to print these. I don't want the format to get locked in.
So we don't have to wait for CPAN testers, although those are great, too.
The report-versions-tiny.t script always reports all of the Dist::Zilla plugin versions. Those are only authoring dependencies, and don't matter when just grabbing the code and installing. We don't need all of those reports! Investigate how to make it stop.
Having to convert data sets will be a barrier to entry for many AM::Parallel users, so add a function for reading data stored in the old format.
Royal asked if the regular AM terminology could be used instead of class, training item, etc. Alternative method names and alternative parameter names would probably be fine. Printing, though, is problematic. Maybe it would be better to change all of the reports to AM terminology, as the people who use this functionality may be the ones who care more.
Change all tabs to spaces and make sure indenting is done properly. It's a bit hard to read at some places right now.
AM has changed and will change quite a bit before it is stable. Write a porting guide for those who have datasets or programs in the old AM system and would like to update to the new one.
User feedback was that
Currently all size-sensitive calculations are done with 32-bit integers. Couldn't we use 64-bit ones on a 64-bit machine, which is almost all machines these days?
The data file has all of the exemplars and their outcomes, and the outcomes file has a longer name for those outcomes, but instead of being a simple mapping of short to long outcomes it is required to have the same number of lines as the data file and contain a long outcome string for each data item. Was this considered ideal at the time of writing it, or is it something that could be done away with? I would rather accept a simple mapping file, or just allow it to be placed in the data file optionally.
'perl Makefile.PL && make' or 'dzil build' does not install prereqs (like Text::Table, Crypt::PRNG, Exporter::Easy, Test::LongString)
After forking the project I had to manually install these, even though for instance Text::Table is in the Makefile.
The most annoying thing about the software at this point is that the only way to get classification data is to read what it prints out. Instead of printing all of that data to stdout, create some structure or class and return that to represent the results.
May have to wait for #7.
Currently the statistical outcomes seem to be printed alphabetically, which is less helpful than in order of decreasing probability. This would of course also make #42 easy to look at.
AM.xs, line 768. This was removed a long time ago.
The Java version manages to combine lattices much more quickly by consolidating common supracontexts after each combination, not just after the final combination. That would be a good optimization here.
Give a nice error for this so the user sees where their mistake is immediately.
There are probably lots of places where better checking would be useful, but I was just bitten by this. The error instead happens when classify
tries to call cardinality
on undef
.
Even in the latest dev release, many platforms are failing badly, with all of the classification results giving 0.00000% for each possible outcome. All of *nix still fails, and for whatever reason 5.18.0 on Windows also fails with the same problem.
Find out why Perl 5.14 is required and remove whatever features are being used to get the required version down to around 5.10.
Christian Walde's smoke tester has consistently failed all tests, with Perl crashing for every single one. Here's an example report.
Christian's smoketester is available on GitHub. Add a .bat
file containing the following lines to the root in order to use this install easily (assuming you placed in in c:\Perl1664\
):
set PATH=c:\Perl1664\site\bin;c:\Perl1664\bin;%PATH%
c:
cd c:\Perl1664\
cmd
Use d8bc46 as a reference.
Right now the giant code string in the __DATA__
section is marked at edit points with innocent-looking comments. If those are removed, then creating the eval
string fails silently, which causes the program to die loudly and strangely.
If the large code string cannot be avoided, use less fragile editing techniques, like a templating system of some kind.
Running finnverb with analogize.pl causes a failure around line 406 because $gang->{data}->{$class}
is undefined.
Need to figure out how to do that with Pod::Usage, first.
Because of a bug in the CopyFilesFromBuild dzil plugin, dzil build
currently fails (three duplicate files: cpanfile
, Makefile.PL
, LICENSE
).
You can either wait for [https://rt.cpan.org/Ticket/Display.html?id=92828](rt 92828) or you can get rid of that plugin or you can delete those three files from the root before building.
Generated Makefile.PL still has version at 3.03 (will lead to object linking problems). Need to tag versions 3.03 and 3.04.
If #2 is solved by using C99 features, then the Makefile should be made to check for C99 and warn/die if it is not available.
https://http://metacpan.org/module/Module::Install::XSUtil and https://metacpan.org/source/GFUJI/Data-MessagePack-0.33/Makefile.PL
I imagine you just have to do some git-fu to rebuild from a prior point as you are using the contributors plugin.
If you fix this could you change my author info to "Nick Logan [email protected]"?
Add a report for gang_list from the original AM, which lists all gangs and each of their items.
Right now the big integer stuff is hard to understand. It might be useful to give it its own file with a well-documented API. As it is, I don't know what exactly happens when we do 100 * $n / $grandtotal
. I don't see special handling for arithmetic operators anywhere.
Not factoring the code into a separate file would be fine too, but it just needs to be researched and documented.
Document how the long
arrays are used as big integers. Reference another work if possible, like Java's BigInteger
class code or Hacker's Delight or whatever can be found. Related questions:
char outspace[55];
is long enough? 55
is magic there.NV
and PV
in normalize
, when Perl can very well make the PV
itself?counthi
and countlo
?Since it's a lot faster.
Data driven for this would be the best. #42 hurt, and unfortunately I don't know how to unit test XS code.
The example used to demonstrate #34 gives weird output (but didn't with AM::Parallel):
+-----------------------------+-------------+------------+
| Class | Score | Percentage |
+-----------------------------+-------------+------------+
| alternarialeaf-spot | 2180 | 0.000% |
| anthracnose | 49856 | 0.000% |
| bacterial-blight | 61952 | 0.000% |
| brown-spot | 364608 | 0.001% |
| brown-stem-rot | 1060652 | 0.002% |
| charcoal-rot | 8964860640 | 18.824% |
| diaporthe-pod-&-stem-blight | 140 | 0.000% |
| diaporthe-stem-canker | 51840 | 0.000% |
| downy-mildew | 50688 | 0.000% |
| frog-eye-leaf-spot | 104448 | 0.000% |
| phytophthora-rot | 62912 | 0.000% |
| powdery-mildew | 7008 | 0.000% |
| purple-seed-stain | 21664 | 0.000% |
+-----------------------------+-------------+------------+
| Total | 47623698012 | |
+-----------------------------+-------------+------------+
It does get basically the right idea, with charcoal-rot
outdoing everything else, and the count for diaporthe-pod-&-stem-blight
is correct, but everything else is wacky. Did I screwed up the biginteger calculations? This is all awful. Here's the AM::Parallel output, which matches that of Weka AM:
alternarialeaf-spot 3016836 0.001%
anthracnose 5358272 0.002%
bacterial-blight 2880000 0.001%
brown-spot 2134080 0.001%
brown-stem-rot 976826156 0.289%
charcoal-rot 337300810464 99.700%
diaporthe-pod-&-stem-blight 140 0.000%
diaporthe-stem-canker 10013312 0.003%
downy-mildew 50688 0.000%
frog-eye-leaf-spot 890880 0.000%
phytophthora-rot 2028992 0.001%
powdery-mildew 11869024 0.004%
purple-seed-stain 1463456 0.000%
------------
338317342300
An add and a multiply function/macro would be nice and reduce a lot of code.
All possible AM functionality, including batch-related features such as hooks and probabilities, are baked into AM.pm. Move it into a separate package for greater flexibility and organization.
Currently there's no quick way to download, build, test and install this. You have to dzil! Add some plugins to dist.ini either to commit the build or copy over the makefile.pl.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.