Giter Club home page Giter Club logo

kaplan's Introduction

Kaplan

Project for CAS741.

Developer Name: Jen Garner

This project implements conformer searching using evolutionary computation.

The main folders in the project are:

  • docs (documentation)
  • kaplan (source code)

There is also a test folder within the kaplan directory that contains:

  • testfiles
  • jupyter-notebooks

Dependencies

Kaplan has the following dependencies. Note sublists indicate that Kaplan does not directly import the dependency, but it is needed by its parent in the list.

How to install

Installing Dependencies

The recommended installation process involves installing conda or miniconda.

Generate a new conda environment:
$ conda create -n kenv python=3.6 numpy

Turn on the environment:
$ source activate kenv

Your prompt should now reflect the environment name:
(kenv) $

With the environment active, install the dependencies:

  1. psi4
    (kenv) $ conda install -c psi4 psi4
  2. openbabel
    (kenv) $ conda install -c openbabel openbabel
  3. pubchempy
    (kenv) $ conda install -c mcs07 pubchempy
  4. rmsd
    (kenv) $ pip install rmsd
  5. vetee
    (kenv) $ pip install -i https://pypi.anaconda.org/kumrud/simple vetee

Installing Kaplan

Do a git clone of the kaplan repo:
(kenv) $ git clone https://github.com/PeaWagon/Kaplan.git

Go into the Kaplan directory:
(kenv) $ cd Kaplan

Then do a pip install (development):
(kenv) $ pip install -e ./

Working with the environment

To see installed packages:
(kenv) $ conda list

Turn off the environment:
(kenv) $ source deactivate kenv
$

Uninstall Kaplan

(kenv) $ pip uninstall kaplan

How to run Kaplan

Please see the README.md in the kaplan subdirectory for more information on how to run this program.

Running tests

Please see the README.md in the kaplan/test subdirectory for information on how to run the tests for this program.

Special Thanks

The following people have contributed in some way to the making of this project.

  • Kumru Dikmenli
  • Xiaomin Huang
  • Laura Bickley
  • Brooks MacLachlan
  • Vajiheh Motamer
  • Oluwaseun Owojaiye
  • Karol Sekis
  • Malavika Srinivasan
  • Robert White
  • Hanane Zlitni
  • Dr. Spencer Smith
  • Dr. Paul Ayers

kaplan's People

Contributors

peawagon avatar smiths avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

kaplan's Issues

Module Guide feedback #3

I think M5 and M6 should somehow be linked to one or more of your functional requirements. Please disregard this feedback if its not the case.

image

Great job Jenn, well documented!

Module Guide feedback #2

@PeaWagon, AC8 made me a bit curious, if the programming language is a likely change, this impacts most of the modules I assume, if not all, which contradicts this statement in the screenshot below or maybe I am wrong.
image

I am looking at this from my own project point of view as well and want to understand more...how changing the programming language impacts the design and should this be in the AC list?

Maybe @smiths has some opinion about this.

MG issue #3 - M2

@PeaWagon

Hi Jenn,

This is my last issue. :)
It's a well done document.

In the services for M2, you say that M2 `communicates ... to M3'. From your description, I think M3 uses M2 and you don't have to mention it in the reverse way (M2 communicates to M3). Just for clarity, Can you please rewrite this sentence ?

Malavika

Problem statement review

My problem statement document is located in docs/ProblemStatement/ProblemStatement.pdf (or tex). Please let me know if I should be raising this issue on the cas741 gitlab webpage instead. Instructions state to chose the group's repository. I wasn't quite sure if that was the course repo or not (no one else seems to have made an issue there for their problem statements).

Thanks,
Jen

Ring as program input

Allow user to provide a ring structure as input (if continuing to run a conformer search). Right now, the ring is saved after a job has completed, but there is no ability to use it as input.

MG issue #1 - Anticipated changes

@PeaWagon

Hi Jenn,

Your AC6 says that the software can be changed into a library by incorporating into other software packages. In AC7, you mention that using it on through the phone. I can see a connection between them but, I think it is better to mention that it may be a `mobile application '.

Malavika

update ring/tournament selection

The ring module should have the following update rule:

  1. pick pmem1
  2. sort pmems within range of pmem1 by fitness
  3. assign pmem2 to the best fitness of the pmems surrounding pmem1
  4. generate 2 children from pmem1 and pmem2
  5. replace the worst 2 pmems (within range of pmem1) with children 1 and 2

UnitVnVPlan Tables Issue

@PeaWagon Please list the tables which you used in the document like the template. Besides, Please to more clarification please specify a caption for each table.

Sorry

@PeaWagon @smiths

Hi Jen,

I am writing this here because I don't have your email. I posted issues in your MG instead of Vajiheh's MG. I referred the excel sheet but wrongly interpreted it. Sorry again.

Malavika

SRS Tables of Symbols/Acronyms missing and unnecessary entries

I noticed some symbols and acronyms that were used in the body of the SRS but not defined in the Table of Symbols or Table of Acronyms.

Missing from Table of Symbols

  • α
  • β
  • ci
  • c (This one and the above one are both used in the text following T1, though they seem to refer to two different things. If this is correct, the symbol c should not be used for both of them)
  • k (used in GD1 and in the text below T1, each time representing something different. k should only be used for one of these, and the other should be replaced with a new symbol (otherwise it is ambiguous)
  • ζ
  • Gn (I believe this is the same as nG, which does appear in the table, so maybe all instances of Gn in the body should be replaced by nG)
  • Zα
  • j (used as an index for a conformer along with i. Maybe give i and j each their own entry in the table of symbols? If you do this, the Ei entry should be removed as it is already covered by E and i)
  • x
  • y
  • z
  • I would argue that the constants which appear in the Table of Units belong instead in the Table of Symbols

Missing from Table of Acronyms

  • VSEPR
  • VB
  • 3D

I also found a couple entries that were not necessary in these tables:

  • Gi is not used anywhere in the body of the SRS as far as I can tell, so it should not appear in the table
  • Since you refer to your program as "Kaplan", which is neither an acronym or abbreviation (I think), it does not need to appear in the Table of Acronyms

MIS Issue

Hello @PeaWagon

I apologize for the lateness of my issues. Perhaps they can be useful for the final submission at this point.

9.4.4 Access Routine Semantics: get fitness(xyz coords, method, basis, fit form, coef energy, coef rmsd, charge, multip): why is the transition: None? and exception: None?
Shouldn't there be at least be an exception as calc fitness has?

[MIS.pdf]
(https://github.com/PeaWagon/Kaplan/blob/master/docs/Design/MIS/MIS.pdf)

UnitVnVPlan test perfom

@PeaWagon In some tests you mentioned to the "numpy.testing.assert" to perform test, I did not make sense of it, It should be better include a blurb to explain of it.

SRS Traceability issues

Some traceability information is missing from the SRS:

  • In the Assumptions section, Likely Changes should be referenced in square brackets next to their respective assumptions
  • In the Likely Changes section, the assumption reference should be more clear (wrap it in square brackets or explicitly reference it in the sentence)
  • In the traceability matrix Table 5, A3 should be referenced by IM1
  • In the traceability matrix Table 7, R4 and R5 should be referenced by something (can include the data constraints section as a column in order to achieve this, see SWHS for example of this)
  • In the traceability matrix Table 6, I do not think there should be an X in the DD1 column, T1 row (If DD1 were changed would Schrodingers equation also need to change? I'm not entirely sure about this one)

Writing xyz files to the same location has interesting results...

... by interesting, I mean it doesn't write the correct output. May have to find a bug in Kaplan or Vetee.

What happened:
When the output was directed to the kaplan_output directory of the git repository, the file was overwritten (in the sense that the time stamp was changed), but the new molecule was not added to the xyz file (it was still butane and not ethanol).

What is puzzling:
This problem doesn't occur when running outside of the github repository. Perhaps that is the problem (if version control doesn't allow for the update to run).

SysVnVPlan FT-1-TC3 issue

@PeaWagon
In the FT1-Test 3: Test Output Generation, “Initial State: an optimized input molecule versus an unoptimized input molecule.” I am not sure this state could be considered as Initial state. Please correct me if I am wrong.

SRS Requirements questions

For the Requirements section, I have the following comments/questions:

  • R1 - A lot of the information in the block of text is echoed in the subsequent table and can be removed to avoid redundancy. Also, the sentence beginning with "For example," seems unnecessary (I'm not sure how it relates to Kaplan)
  • R2 - My interpretation of this sentence is that the number of dihedral angles is equal to the number of conformers, but my understanding of the science leads to me to believe that my interpretation is wrong, because a single conformer would have many dihedral angles. If this is correct, then this requirement should be rephrased to be less ambiguous.
    image
  • R5 - What if the input format was not an xyz geometry? If the input was, for example, Z-matrix format, would the output still be in xyz coordinates? Also, if this is a requirement for the output, I think there should be a verb to make that action clear (ex. convert to xyz coordinates and display them)
    image
  • Kaplan outputs more than just the xyz geometry of the conformers, it also outputs the energy of each conformer. There should be a requirement for this.
  • Could new functional requirements be added to address any nonfunctional requirements? An example that pops out to me is the nonfunctional requirement "The program should have suggestions for inputs to help the user get started" which could be easily mapped to a functional requirement for providing those suggestions.

SysVnVPlan FT1 general issue

@PeaWagon In the three tests which you mentioned in section FT1 please consider the following item:

  • As regards “How test will be performed “, please mention to the test framework which you consider to test.

Feedback on SRS

  1. The Global minimum is almost always meaningless because it is not a conformational isomer but a true isomer (where bonds are broken). So we want to restrict our conformational search to rotations around bonds. (One could go further, including things like H-atom transfers across hydrogen bonds and other similar low-barrier processes, but that would make things significantly more complicated.)

  2. You can be a bit more clear about what you are doing. You are searching for the best choice of bond torsions/dihedral angles. (Of course there will be small changes in bond angles/lengths associated with these changes, but they are induced by the torsions.) Realistically, you may wish to construct the whole ensemble (all the thermally accessible structures) not just the lowest-energy conformer. That may be beyond the mandate of the program, but need not be....usually you will find many (if not all) of the other low-energy conformers in the process of seeking the lowest-energy conformer.

  3. Is it essential that the user specify (guess) the number of conformers? Can this be guessed by some heuristic? I would think that the number of rotatable bonds could be identified, and then multiplied by ~3, to get an upper bound on the number of conformers. (The number of low-energy conformers will, in general, be much smaller than this....something of the order of N_atoms or N_atoms^2 I would guess, except (again always an exception) in painfully "glassy" molecules (e.g., long floppy polymers) where the problem is more-or-less intractable (and uninteresting as many conformers are chemically equivalent) also.

  4. 5.1 The RMSD is tricky. Sometimes there are, due to symmetry, several equivalent conformers. I am not sure it is practical (it might be very hard) but technically while only one of these conformers needs to be stored, the "degeneracy" (number of equivalent instances of this conformer) should be also stored, so that one can compute thermodynamic properties. Again, this is if the mandate is to allow thermal averages over conformers (in which case the degeneracy of each conformer is needed) to compute thermodynamic quantities. If the goal is merely to identify the lowest-energy conformer(s) for subsequent studies (qualitative reactivity and/or quantitative electronic/mechanical properties studies) then this is not essential. It isn't clear to me exactly what the mandate is....just find some low-energy conformers that are inequivalent? Is the temperature just a screening criterion or is it intended to be used to guide thermal averaging later? The Procrustes package (Fanwang) has RMSD and a lot of other things like that already there, though the RMSD between molecules is pretty straightforward as long as you don't worry about matching atom types (and only atomic positions).

  5. Conformational isomer. Rotations around triple and quintuple bonds are also free; basically all rotations around bonds where there are an even number of occupied pi and/or delta orbitals. That mostly happens in transition-metal compounds, though, as conformations around triple (or higher) bonds require atoms that form at least 5 bonds. Still, what you say isn't quite precisely correct... It is probably better to merely say that single bonds are rotatable and then the user can specify (by hand) other bonds they wish to consider rotatable, because there are cases where there are (weak) double bonds that are also rotatable.

  6. I'm not sure I trust A8. Usually you need to converge the bond lengths/angles at least loosely. You may not need to do this in the very early stages but, for example, the rotational barrier in simple hydrocarbons (especially if there are bulky-ish substituents) is quite far off if you freeze bond lengths and angles. Something similar would be true of chair/boat/twisted-boat/etc. geometries for ring compounds. Probably it is safe to do initial screening without optimization and after that a coarse optimization with early stopping once you have converged the geometry to an energy that is small enough (say 1/2 of the conformer energy gap you are trying to assess) is adequate.

  7. Page 9. The nuclei are much more massive than the electrons. They aren't much larger...if fact they are about 100,000 times smaller (in terms of the spatial extent of their wavefunctions).

  8. The AOs are pretty muddled. You normally write something like cexp(-ar^{2})*r^(some power)*spherical harmonic. (Obviously not r^{2} in the exponential for Slater orbitals.) Slater orbitals do not diverge at the nucleus (no -> infinity) but they have a derivative discontinuity/cusp at the nucleus, which is "correct" if we assume that the nucleus is a point charge. You are right that it is easier to do the integrals for Gaussian orbitals and that while the near-nucleus and far-from-nucleus behavior of Gaussians is generally poorer than for Slaters, taking a linear combination of Gaussians to approximate a Slater type function is a pragmatic compromise. In practice, doing the integrals should not (but this is not always true) be the bottleneck in your code anyway.....

  9. Page 11. The RMSD cannot be computed until the optimal alignment (align centers of mass of the two conformers and then rotate the conformers to minimize the RMSD; that is the thing that Fanwang's code does though again you can safely implement it in situ) is constructed. So the formula you give is a RMSD but not what we normally mean by the RMSD between two molecules/conformers.

  10. During the process of the exploration S_{RMSD} should be reduced. Early in the optimization it is favorable to force yourself to explore very diverse structures. Later on you want to take the best...even if they are not very far from each other. Alternatively make the RMSD criterion nonlinear, something like (1-exp(-RMSD^2)) so that as soon as two structures are "different enough" they contributed a "big enough" amount to the diversity criterion.

  11. Appendix. Misspelled "Planck" and also should maybe (not clear to me from the text what you intend here) include the nuclear-nuclear repulsion energy in the Hamiltonian.

Overall: Looks very good. As a specification I like it, though I'm not sure algorithmically....achieving the specification may be tricky.

SRS typos and other minor consistency errors

The following is a list of typos and other minor consistency issues with the current draft of the SRS:

  • In section 3, in the 2nd sentence, "moleule" should be "molecule"
  • In section 3, 3rd sentence, "abbreivated" should be "abbreviated"
  • In section 3, 4th sentence, "geometies" should be "geometries"
  • In section 3, 4th sentence, "procurring" should be "procuring"
  • In section 3.2, 3rd sentence, "diehdral" should be "dihedral"
  • In section 3.3, in the itemized list, "hamiltonian" should be capitalized
  • In section 3.4, 2nd sentence, there is a reference to section 1 but this should be a reference to section 2
  • In section 4.1, the first paragraph should not be indented (to be consistent with the rest of the document)
  • In section 4.1, 2nd sentence, "includes" should be "include"
  • In section 4.1, the 2nd item of the User Responsibilities list should not end in a period (for consistency)
  • In section 4.2, "first year" should be "first-year" (for consistency with rest of document)
  • In section 4.2, "3rd year" should be "third-year" (for consistency with rest of document)
  • In section 5.1.2, item PS1, first sentence, "molecule" should be capitalized
  • In section 5.1.2, Figure 3 appears without any corresponding reference in the text to give it context
  • In section 5.2.1, A5, the word "a" is missing before "linear combination"
  • In section 5.2.5, 2nd paragraph, 3rd sentence is written imperatively (should be rephrased with a clear subject of the sentence -- what is manipulating the dihedral angles to solve for Fit_G?)
  • In section 5.2.5, IM1 description, I do not think the words "water temperature" were intended to be there
  • In section 5.2.5, IM1 description, "1" should be "one" (for consistency)
  • In section 5.2.6, Table 2, an asterisk should appear next to the table entry for G_n (as it is referenced by the asterisk note below the table)
  • In section 6.1, R1, in the last entry in the table "to specify to connectivity" should be "to specify the connectivity"
  • In section 9, T2, "hamiltonian" should be capitalized

UnitVnVPlan test Derivation

@PeaWagon In the test "FitG" in “test sum energies” you have the output, however you did not consider any derivation. Please include a blurb there to explain test derivation. Besides, in some modules such as tournament module you do not have the

UnitVnVPlan RNS

@PeaWagon As the figure I have attached, In the section 5.1.6 Crossover & Mutation Module, you mentioned to the RNS. I did not make sense of that if it has an effect on the test, if Yes should not to be considered as input or initial state ?

1

SRS Goal Statements questions

With regards to the Goal Statements, I have the following questions/comments:

  • In GS1, I do not think the "Alternatively," sentence is necessary. That information will be made clear with the refinement to the instance model.
  • I would argue that there is a 2nd goal here, which is to determine the energy of each conformer. The energy values are one of Kaplan's outputs so it makes sense to me that their determination must be one of the goals.

SysVnVPlan FT1-TC1 issue

@PeaWagon
1- In the “FT1 - Data Input and Output in the input” you mentioned “Initial State: empty, no molecule input”, if you mean “NA”, please use the same term for the same condition as you mentioned in the “Test Input Geometry” as "NA"

2- In the “FT1 - Data Input and Output in the input” part it should be refer to the Table 1 as input values.

3- To describe the test please complete the sentences which clarify the purpose of the test. Besides it would be better be specified who inputs the values? It is a bit confusing (I know user must input the values according to prior sections, However it is a bit ambiguous in this part as test description).

1

Feedback on Verification Plan

  1. Being able to write the molecule in z-matrix format is not enough to identify rotatable bonds. In a ring compound, there are "missing dihedrals" in the rings. Derrick's software will automatically generate all dihedrals if you like. The number of rotatable bonds is very often (much!) different from the number of dihedrals in the z matrix -- it can be bigger but also can be (much) smaller. E.g., in a linear hydrocarbon C_{n} H_{2n+2} you have n-1 rotatable bonds but 3n dihedrals in the z-matrix. For a branched isomer of the same hydrocarbon the mismatch can be even worse.

  2. If you keep a list of dihedral angles, then for molecules without symmetry, different lists of dihedrals are different conformers. For molecules with symmetry you could label "equivalent dihedrals" either automatically (hard) or manually (easy) or us the RMSD trick to find two conformers that are equivalent by symmetry.

  3. Page 3. As before, your list of dihedrals from a z-matrix will not be adequate. Certainly different atom permutations gives different (and terrible!) dihedrals in a z-matrix. But that's irrelevant....you just can't make a dihedral angle list in this way. Derrick has an automated way to do this though.....based on the (standard) approach for finding (redundant) internal coordinates in Dalton and Gaussian.

  4. 4.4 Can link to cardbordlint (or whatever it is called now) repo.

  5. 5.1.1.2 For a lot of the Dunning (cc-pVXZ) basis sets K is missing. So that makes this "basis set missing test" easy. There is not only the Basis Set Exchange but a (nicer) porcelain available through MolSSI.

  6. 5.1.1.2 With very bad geometries a conformer may still be findable, even the best. If the rotatable bonds are located correctly things should still work....no matter how bad it is. On the other hand, if you enter cubane (no rotatable bonds) when you want a different isomer, then all hell will break loose....

  7. Probably for these tests should give very specific molecules/molecular geometries to test on. It won't be too hard, I think.....

Overall the issue I see is algorithmic and specificity. Simple tests using (substituted) hydrocarbons or small di/tri-peptides would suffice. For amino acids, actually, we have an exhaustive list of conformers (and their relative stability) in hand (work with Chunying, Farnaz, Ramon, etc.).

MG issue #2 - Unlikely changes

@PeaWagon

Hi Jenn,

I am not sure if UC3 is needed. If the goal changes, then its not the same software anymore. Can you please elaborate if you see a change in the goal?

Malavika

SysVnVPlan FT1-TC3 issue

@PeaWagon
In “Test Input Geometry” It should be considered geometry values as you think purposefully bad or invalid. The example is not sufficient and you should specify a set of inputs as well as the output which is expected regarding to the given input.

2

SRS Table of Symbols "Array of..." descriptions

In the Table of Symbols, the descriptions for some symbols (examples below) include phrases such as "array of length 3", which sounds like an implementation detail.

image

These could be rephrased to something along the lines of "position of electron in Cartesian coordinates" to be more abstract.

SysVnVPlan SRS Verification Plan issue

@PeaWagon In the section SRS Verification Plan, there are a lot of the information and questions in your mind relating to the Likely Changes. I think in this section these questions are not required because for me as a reviewer these questions led a considerable confusion.

SysVnVPlan comments issue

@PeaWagon (mostly minor issue) Please remove the Dr.smith ‘s comments from your document as Dr.smith has put them in the blank template just to explain of the each section.

SRS Problem Description confusion

After reading the SRS, I'm not entirely sure about the purpose of Kaplan. I understand that it finds a set of conformers for a molecule, but what makes one potential set of conformers a "better fit" than another? My best guess is that the "best fit" set of conformers are the conformers that are most likely to actually occur in nature. If this is correct, it should be explicitly stated, and I think the Problem Description section 5.1 is the place to do that.

It may also be that if I had all of the characteristics of the intended reader, I would have understood this even without it being explicitly mentioned, in which case it might not be necessary to add anything (though I still think it couldn't hurt).

SRS Completeness issues

The following (mostly minor issues) were missing from the SRS:

  • Content in section 5.2.7 Properties of a Correct Solution
  • Exhaustive symbol descriptions in GD1 (for example, include a definition for z.)
  • The left-hand sides of the equations in DD1 and DD2
  • In section 3.4 Organization of Document, descriptions for sections after section 4

SRS Use of "SI Units" even when units are not SI

In the Table of Units there is an "SI" column, and the text claims that SI is used throughout.

image

Also in the GDs and DDs there is a row heading "SI Units".

image

Angstroms are not SI units, so in this case I think these heading names should be changed to not specify "SI".

UnitVnVPlan Path issue

@PeaWagon Regarding with that you have used the file name in the table tests , it should be referenced to exact path “src” folder on your project in the github in the related place such as section 5 or 5.1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.