Giter Club home page Giter Club logo

algorithm-am's Introduction

Hi there! ๐Ÿ‘‹

I'm Nathan, a freelance software engineer with over 10 years of experience specializing in natural language processing. I'm familiar with a wide range of computer technologies as wells as human languages. If you'd like to learn more about me or my work, you can visit my website at nateglenn.com. If you'd like a free consultation on NLP or general engineering work, send me an email at [email protected]. Talk to you soon!

algorithm-am's People

Contributors

garfieldnate avatar ugexe avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

algorithm-am's Issues

add pure perl version

AM 2.0 was pure Perl, and it is possible that some people will still not like or be able to compile XS code. Add back that old Perl code as a pp option and let users decide on pp or xs versions at install time. This would make it at least a little more future-proof, since there seem to be so many problems trying to get the XS to not crash Perl on some systems.

investigate alternatives to AM_BIG_INT

AM_BIG_INT holds 2 bytes in each element and has 8 elements, meaning it is 16 bytes long, or 128-bit. Are there any alternatives (e.g. libraries) that give this precision? It sure would be easier to work with something more natural than arrays representing single numbers.

use Travis-CI

So we don't have to wait for CPAN testers, although those are great, too.

handle features that evaluate to false better

I was wondering why there was more than one gang being printed with the same context label; the actual problem was feature vectors like K Y 0 = 0 = 0 = T E. I would check to see if a feature was false, and if so it would be printed as -. But 0 is also false in Perl, and here it is being used as a feature.

strange 0.000000% classification errors

Even in the latest dev release, many platforms are failing badly, with all of the classification results giving 0.00000% for each possible outcome. All of *nix still fails, and for whatever reason 5.18.0 on Windows also fails with the same problem.

Prereqs not getting installed

'perl Makefile.PL && make' or 'dzil build' does not install prereqs (like Text::Table, Crypt::PRNG, Exporter::Easy, Test::LongString)

After forking the project I had to manually install these, even though for instance Text::Table is in the Makefile.

make giant eval editing less fragile

Right now the giant code string in the __DATA__ section is marked at edit points with innocent-looking comments. If those are removed, then creating the eval string fails silently, which causes the program to die loudly and strangely.

If the large code string cannot be avoided, use less fragile editing techniques, like a templating system of some kind.

report ties

Currently, AM reports Correct outcome predicted if the correct outcome has the highest number of pointers. But several outcomes might have the highest number of pointers! Handle this case separately and report ties. This may actually be very bad for previous published results, depending on if they simply greped amcresults for Correct or if they did their own calculating. The manner of calculation used by the finnverb example is also similarly flawed.

document how big integers work in AM guts

Document how the long arrays are used as big integers. Reference another work if possible, like Java's BigInteger class code or Hacker's Delight or whatever can be found. Related questions:

  • Why do we know that char outspace[55]; is long enough? 55 is magic there.
  • Why do we set the NV and PV in normalize, when Perl can very well make the PV itself?
  • Why use counthi and countlo?

move batch processing into a separate package

All possible AM functionality, including batch-related features such as hooks and probabilities, are baked into AM.pm. Move it into a separate package for greater flexibility and organization.

fix indents in AM.xs

Change all tabs to spaces and make sure indenting is done properly. It's a bit hard to read at some places right now.

don't print classification results

The most annoying thing about the software at this point is that the only way to get classification data is to read what it prints out. Instead of printing all of that data to stdout, create some structure or class and return that to represent the results.

May have to wait for #7.

sort results by order of probability

Currently the statistical outcomes seem to be printed alphabetically, which is less helpful than in order of decreasing probability. This would of course also make #42 easy to look at.

document data sets

Now that we can have comments in data sets (#43), add documentation comments to all of them. It might be good to download the other available AM data sets and comment them, as well.

change project organization

The data file has all of the exemplars and their outcomes, and the outcomes file has a longer name for those outcomes, but instead of being a simple mapping of short to long outcomes it is required to have the same number of lines as the data file and contain a long outcome string for each data item. Was this considered ideal at the time of writing it, or is it something that could be done away with? I would rather accept a simple mapping file, or just allow it to be placed in the data file optionally.

Win32 5.16.3 Perl crash

Christian Walde's smoke tester has consistently failed all tests, with Perl crashing for every single one. Here's an example report.

Christian's smoketester is available on GitHub. Add a .bat file containing the following lines to the root in order to use this install easily (assuming you placed in in c:\Perl1664\):

set PATH=c:\Perl1664\site\bin;c:\Perl1664\bin;%PATH%
c:
cd c:\Perl1664\
cmd

Use d8bc46 as a reference.

write a porting guide for existing AM users

AM has changed and will change quite a bit before it is stable. Write a porting guide for those who have datasets or programs in the old AM system and would like to update to the new one.

investigate reporting of dzil plugins in report-versions-tiny

The report-versions-tiny.t script always reports all of the Dist::Zilla plugin versions. Those are only authoring dependencies, and don't matter when just grabbing the code and installing. We don't need all of those reports! Investigate how to make it stop.

allow traditional AM terminology

Royal asked if the regular AM terminology could be used instead of class, training item, etc. Alternative method names and alternative parameter names would probably be fine. Printing, though, is problematic. Maybe it would be better to change all of the reports to AM terminology, as the people who use this functionality may be the ones who care more.

dzil build fails because of duplicate files

Because of a bug in the CopyFilesFromBuild dzil plugin, dzil build currently fails (three duplicate files: cpanfile, Makefile.PL, LICENSE).

You can either wait for [https://rt.cpan.org/Ticket/Display.html?id=92828](rt 92828) or you can get rid of that plugin or you can delete those three files from the root before building.

Document Perl macros/functions in AM.xs

There are some Perl macros/functions used in AM.xs quite commonly, but it's easy to forget what they all are. Document them at the top of AM.xs to make maintenance easier. Here are some important ones:

  • HeKEY,
  • HeVAL
  • hv_fetch
  • SvUPGRADE
  • sv_setpvn
  • SvPVX
  • SvUVX
  • SvUV
  • SvIVX
  • SvIOK
  • Safefree
  • Copy
  • Zero
  • SvCUR_set
  • SvPOK_on
  • SvGROW

add gang_list report

Add a report for gang_list from the original AM, which lists all gangs and each of their items.

add test for largest order big int

AM.xs currently handles numbers up to 128 bits. I believe that 1a38c19 would have broken that, which is why I reversed it in 5d297e8. There should be a test with a dataset large enough to exercise those big integers so that if the upper bits were clobbered it would be known immediately.

add a LatticeCombiner interface

Add a LatticeCombiner interface whose role is to combine sublattices. Then we can have the original algorithm from AM::Parallel beside the current intermediate combination one. This will make comparing them possible, but will also make switching possible if the intermediate combination one turns out to be too memory-hungry for a given data set. I plan on there being more combiners in the future as well, after reading the literature for fast set intersection.

document bigint

Right now the big integer stuff is hard to understand. It might be useful to give it its own file with a well-documented API. As it is, I don't know what exactly happens when we do 100 * $n / $grandtotal. I don't see special handling for arithmetic operators anywhere.

Not factoring the code into a separate file would be fine too, but it just needs to be researched and documented.

reduce lattice combination steps

The Java version manages to combine lattices much more quickly by consolidating common supracontexts after each combination, not just after the final combination. That would be a good optimization here.

make AM_LONG longer on 64-bit machines

Currently all size-sensitive calculations are done with 32-bit integers. Couldn't we use 64-bit ones on a 64-bit machine, which is almost all machines these days?

add to_str method to Item

Add a method for stringifying Item objects. Will probably need to provide format as a parameter. Hmm, or add some functionality to print these. I don't want the format to get locked in.

improve analogize documentation

User feedback was that

  1. it was not clear how to print multiple reports (comma with no space)
  2. it was not clear how to print to a file
  3. there need to be more usage examples

document all unknown variables

Document all of the mysterious variables in AM's guts. Specifically, I think it would be beneficial to document all of those passed into the _initialize method, as well as others such as datatocontext and anything involved with pack or unpack.

lower required Perl version

Find out why Perl 5.14 is required and remove whatever features are being used to get the required version down to around 5.10.

Dataset/Item creation API request

A user requested that it be made possible to create an Item from a single data line and the format options (or a DataSet which shares the same options). They also mentioned it could be nice to be able to get a dataset from an array of lines. The reason for this is that the user keeps the data file in a GUI and does not want to parse himself or reload from disk.

I think it might be good to allow passing handles to be read from (could use with string refs).

bad stats

The example used to demonstrate #34 gives weird output (but didn't with AM::Parallel):

    +-----------------------------+-------------+------------+
    | Class                       | Score       | Percentage |
    +-----------------------------+-------------+------------+
    | alternarialeaf-spot         |        2180 |   0.000%   |
    | anthracnose                 |       49856 |   0.000%   |
    | bacterial-blight            |       61952 |   0.000%   |
    | brown-spot                  |      364608 |   0.001%   |
    | brown-stem-rot              |     1060652 |   0.002%   |
    | charcoal-rot                |  8964860640 |  18.824%   |
    | diaporthe-pod-&-stem-blight |         140 |   0.000%   |
    | diaporthe-stem-canker       |       51840 |   0.000%   |
    | downy-mildew                |       50688 |   0.000%   |
    | frog-eye-leaf-spot          |      104448 |   0.000%   |
    | phytophthora-rot            |       62912 |   0.000%   |
    | powdery-mildew              |        7008 |   0.000%   |
    | purple-seed-stain           |       21664 |   0.000%   |
    +-----------------------------+-------------+------------+
    | Total                       | 47623698012 |            |
    +-----------------------------+-------------+------------+

It does get basically the right idea, with charcoal-rot outdoing everything else, and the count for diaporthe-pod-&-stem-blight is correct, but everything else is wacky. Did I screwed up the biginteger calculations? This is all awful. Here's the AM::Parallel output, which matches that of Weka AM:

alternarialeaf-spot               3016836    0.001%
anthracnose                       5358272    0.002%
bacterial-blight                  2880000    0.001%
brown-spot                        2134080    0.001%
brown-stem-rot                  976826156    0.289%
charcoal-rot                 337300810464   99.700%
diaporthe-pod-&-stem-blight           140    0.000%
diaporthe-stem-canker            10013312    0.003%
downy-mildew                        50688    0.000%
frog-eye-leaf-spot                 890880    0.000%
phytophthora-rot                  2028992    0.001%
powdery-mildew                   11869024    0.004%
purple-seed-stain                 1463456    0.000%
                             ------------
                             338317342300

add a standalone script

Add a standalone script for running AM classification on data sets:

--train/--data/--exemplars: training set
--test: test set; leave-one-out if ommitted
--project: classic AM project
--{gangs/summary/etc}: print given report type

commit build or at least makefile

Currently there's no quick way to download, build, test and install this. You have to dzil! Add some plugins to dist.ini either to commit the build or copy over the makefile.pl.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.