Giter Club home page Giter Club logo

Comments (7)

feiranl avatar feiranl commented on August 26, 2024

Hi, thanks for asking! Could you tell which file did you see the duplicate datapoints?
Yes. We trained the model with log2 conversion, and in the paper to facilitate comparisons with other papers( most of them are based on log10), then we did the transform when we plotting

from dlkcat.

jiejiangtao avatar jiejiangtao commented on August 26, 2024

image
The code predicts the kcat value through sequence and substrate. My understanding is that data should be de-weighted through sequence and substrate. As I said above, I found that some data smile and SEQUENCE are identical in the data you provided。

from dlkcat.

jiejiangtao avatar jiejiangtao commented on August 26, 2024

That's my understanding. I don't know if that's right。Thanks

from dlkcat.

le-yuan avatar le-yuan commented on August 26, 2024

Hi, thanks for your question. I suppose that I know the answer about your question.

I checked the original data in this repo. For these very few duplicated datapoints with the same protein sequence and substrate SMILES in different entries, they were generated mainly due to the the circumstance when different kinds of data were combined together from different resources. In that step, we still didn't have all the protein sequence for all entries. So we used EC Number, organism info and substrate info to evaluate whether there are duplicated entries.

For example, I checked that there are two entries which have the same protein sequence (MKRINALTIAGTDPSGGAGIQADLKTFSALGAYGCSVITALVAQNTRGVQSVYRIEPDFVAAQLDSVFSDVRIDTTKIGMLAETDIVEAVAERLQRYQIQNVVLDTVMLAKSGDPLLSPSAVATLRSRLLPQVSLITPNLPEAAALLDAPHARTEQEMLEQGRSLLAMGCGAVLMKGGHLDDEQSPDWLFTREGEQRFTAPRIMTKNTHGTGCTLSAALAALRPRHTNWADTVQEAKSWLSSALAQADTLEVGHGIGPVHHFHAWW) and substrate SMILES (CC1=NC=C(C(=N1)N)CO).

These are two entries shown as below:
{'ECNumber': '2.7.1.49', 'Organism': 'Escherichia coli', 'Smiles': 'CC1=NC=C(C(=N1)N)CO', 'Substrate': '2-methyl-4-amino-5-hydroxymethylpyrimidine', 'Sequence': 'MKRINALTIAGTDPSGGAGIQADLKTFSALGAYGCSVITALVAQNTRGVQSVYRIEPDFVAAQLDSVFSDVRIDTTKIGMLAETDIVEAVAERLQRYQIQNVVLDTVMLAKSGDPLLSPSAVATLRSRLLPQVSLITPNLPEAAALLDAPHARTEQEMLEQGRSLLAMGCGAVLMKGGHLDDEQSPDWLFTREGEQRFTAPRIMTKNTHGTGCTLSAALAALRPRHTNWADTVQEAKSWLSSALAQADTLEVGHGIGPVHHFHAWW', 'Value': '0.43', 'Unit': 's^(-1)'}
{'ECNumber': '2.7.4.7', 'Organism': 'Escherichia coli', 'Smiles': 'CC1=NC=C(C(=N1)N)CO', 'Substrate': '2-methyl-4-amino-5-hydroxymethylpyrimidine', 'Sequence': 'MKRINALTIAGTDPSGGAGIQADLKTFSALGAYGCSVITALVAQNTRGVQSVYRIEPDFVAAQLDSVFSDVRIDTTKIGMLAETDIVEAVAERLQRYQIQNVVLDTVMLAKSGDPLLSPSAVATLRSRLLPQVSLITPNLPEAAALLDAPHARTEQEMLEQGRSLLAMGCGAVLMKGGHLDDEQSPDWLFTREGEQRFTAPRIMTKNTHGTGCTLSAALAALRPRHTNWADTVQEAKSWLSSALAQADTLEVGHGIGPVHHFHAWW', 'Value': '0.0867', 'Unit': 's^(-1)'}

If you check these two entries, you can find that they have different EC numbers and different kcat values. I guess this happened due to different measurements by different researchers, we are also not sure which one is the absolutely right one. But there are only quite a few data points like this!

For details about how we obtained and preprocessed the dataset from different resources, please check the method part and the related supplementary Figure 2. Thanks for your attention!

from dlkcat.

jiejiangtao avatar jiejiangtao commented on August 26, 2024

Thank you for your reply. I have read your data processing section and it is exactly what you said. Although some data have the same sequence and smile, their EC numbuber or organism are different.That is, although the sequence and organism are the same, there are others that are different, so it's not duplicate data. I think only sequence and smile are used in the code, but EC number, organism, etc.
{
"ECNumber": "5.3.1.5",
"Organism": "Streptomyces sp.",
"Smiles": "C1C(C(C(C(O1)(CO)O)O)O)O",
"Substrate": "D-Fructose",
"Sequence": "RHAGSAHTF",
"Value": "4.2",
"Unit": "s^(-1)"
},
{
"ECNumber": "5.3.1.5",
"Organism": "Streptomyces sp.",
"Smiles": "C(C1C(C(C(O1)(CO)O)O)O)O",
"Substrate": "alpha-D-Fructofuranose",
"Sequence": "RHAGSAHTF",
"Value": "4.2",
"Unit": "s^(-1)"
},
However, there are also cases where the kcat value is the same, such as this one, but the substrate is different.

from dlkcat.

le-yuan avatar le-yuan commented on August 26, 2024

For this case, the same protein sequence with different substrates, it is also possible to have the same kcat value in biology.

from dlkcat.

feiranl avatar feiranl commented on August 26, 2024

We have went through the process of removing duplicates. The combination of sequence and smiles are unique, but there are sequences which may matched to several smiles.
Screenshot 2023-10-22 at 3 18 29 PM

from dlkcat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.