Giter Club home page Giter Club logo

Comments (7)

wooorm avatar wooorm commented on June 1, 2024

Hey!

If the issue is: make things faster, then that'd typically accepted to be worked on.
If you have particular code examples that make things faster, preferably in a backwards compatible way, then a PR is I think the best way to discuss them?

Closing, likely open to a PR though!

from dice-coefficient.

sdegutis avatar sdegutis commented on June 1, 2024

This may not be backwards compatible, based on what you said in another issue. There, you basically said that the official interface is:

interface Sliceable {
  slice(i: number, j: number): any;
}
function diceCoefficient(a: Sliceable, b: Sliceable): number;

If that's the case then yeah, this breaks compatibility, because the new interface is basically:

interface Indexable {
  [key: number]: any;
}
function diceCoefficient(a: Indexable , b: Indexable): number;

In which case, I guess, if you don't want to merge this change, then I can publish it as a new lib.

from dice-coefficient.

sdegutis avatar sdegutis commented on June 1, 2024

Ah, I see. Just looked at this in your readme:

Instead of strings you can also pass lists of bigrams.
This can improve performance when processing the same strings repeatedly.

diceCoefficient(['ab', 'bc'], ['xy', 'yz']) // => 0
diceCoefficient(['ab', 'bc'], ['ab', 'bc']) // => 1
diceCoefficient(['ab', 'bc'], ['AB', 'BC']) // => 1

So the ability to take a string[] instead of a string was just a speed optimization, right?

Well I couldn't use that one because I'd have to cache them in memory and the sheer number of bigrams made Node run out of memory. That's why I came up with this trick instead.

So if arrays are only there for a speed optimization, and this actually has basically the same performance, and you don't mind breaking compatibility with a new major version bump, then yeah I'd recommend merging my changes.

But separately from that, there might be copyright issues, since my modifications to your library are, I think, copyrighted by my client.

from dice-coefficient.

wooorm avatar wooorm commented on June 1, 2024

there might be copyright issues

This is quite important. It’s something that you should decide on with the people that pay you. If they do not see modifications as open source, then this discussion is tainting this project with copyrighted material.

based on what you said in another issue.

That’s a different project.

Well I couldn't use that one because I'd have to cache them in memory and the sheer number of bigrams made Node run out of memory

It’s hard for me to evaluate this statement because you have tried some code out and I have no clue what that code was.

and this actually has basically the same performance, and you don't mind breaking compatibility with a new major version bump, then yeah I'd recommend merging my changes.

When they both have the same performance, why not stick with the existing code?

from dice-coefficient.

sdegutis avatar sdegutis commented on June 1, 2024

This is quite important. It’s something that you should decide on with the people that pay you. If they do not see modifications as open source, then this discussion is tainting this project with copyrighted material.

I'm pretty sure the way software copyright law works is that the code within any PR, whether by an individual or company, is owned by whoever sent the PR, even after you merge it in. This is why big projects have you sign an agreement giving them copyright over your code when you submit PRs. Otherwise, they have a hard time enacting certain decisions. But since I didn't submit a PR, there's just an idea which is in itself not copyrighted. I got the greenlight to submit a PR to you from my client (but see the above for why I didn't), so if you just implement the same idea, you have the copyright over it.

The idea is simple: don't use bigrams, but just compare two string characters individually using simple indexing to get them.

Well I couldn't use that one because I'd have to cache them in memory and the sheer number of bigrams made Node run out of memory

It’s hard for me to evaluate this statement because you have tried some code out and I have no clue what that code was.

and this actually has basically the same performance, and you don't mind breaking compatibility with a new major version bump, then yeah I'd recommend merging my changes.

When they both have the same performance, why not stick with the existing code?

Using pre-cached bigrams didn't work for me because there were so many that they overwhelmed Node's memory, and the code I linked you to has a performance increase instead.

So I should clarify that I'm not sure that my index-based modification of your library is as fast as just pre-caching bigrams, since I wasn't able to complete my test using pre-cached bigrams. But I did cut the runtime in half by ~50% when running on I think millions (billions?) of strings. It was a non-trivial test, and the results are repeatable, so I'm pretty confident in it.

Let me know if you're not going to use the index-based idea, and I'll create a new lib for it.

from dice-coefficient.

sdegutis avatar sdegutis commented on June 1, 2024

Oops my old link broke. Here's an updated one:

https://github.com/Atlas-Authority/marketing-automation/blob/f1749c423d3ee041e6e27b45fdadadd4e3489e9f/src/lib/engine/license-matching/similarity-scorer.ts#L40-L63

from dice-coefficient.

sdegutis avatar sdegutis commented on June 1, 2024

My client says we're happy to sign a CAA and send a PR.

from dice-coefficient.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.