Giter Club home page Giter Club logo

Comments (16)

nicolasferraro avatar nicolasferraro commented on July 30, 2024 1

Sure!, I'll handle it in #73.

from metric_units_reddit_bot.

nalinbhardwaj avatar nalinbhardwaj commented on July 30, 2024 1

I really like the idea of using numeral.js for parsing/handling preprocessing. But this would probably require a major refactor for preprocessing. We can assume locale based on many things(subreddit, general number of active users, etc.) and parse numbers accordingly, remove commas and push it forward. I can look into handling this once #73 is merged. If we plan to do this however, I suggest moving preprocess even outside conversion.js, for the sake of minimising complexity, since it would probably have lots of corner cases and stuff in itself.

from metric_units_reddit_bot.

cannawen avatar cannawen commented on July 30, 2024

Numeral.js looks like it has some locale features, but I'm not sure if it is as flexible as we need

from metric_units_reddit_bot.

nicolasferraro avatar nicolasferraro commented on July 30, 2024

Since we are looking to add more code to preprocessing, should we factor out the preprocessing code into a separate module and then export preprocessComment to avoid polluting conversion_helper?

from metric_units_reddit_bot.

cannawen avatar cannawen commented on July 30, 2024

I like that idea! Perhaps "conversion.js" can use some "preprocess" module before passing the result to "conversion_helper"? Do you want to handle this as part of PR #73, or should we create a new issue?

from metric_units_reddit_bot.

cannawen avatar cannawen commented on July 30, 2024

Ohhh yes, maybe it's a good idea move it out of anything related to conversions. Assigning issue to @nalinbhardwaj and marking issue as blocked on #73

from metric_units_reddit_bot.

nalinbhardwaj avatar nalinbhardwaj commented on July 30, 2024

Some thoughts after further analysis.

numeral.js is actually very lacking in it's locale handling. For example, for the Arabic/Indian system, the 2nd most popular after american system, it doesn't contain a locale. And locales are mainly just better(language translated) handlers for thousand, million, billion.

Also, some manual stats gathering on r/india shows that a lot more people use American/none(not comma seperate) their comments than those that do use the Indian system. False positives in such cases might lead to very bad results(it's better not to answer than answer wrongly IMO).

We probably need to find better heuristics, and/or a library that's much better at parsing automatically.

from metric_units_reddit_bot.

cannawen avatar cannawen commented on July 30, 2024

Instead of looking at the subreddit to determine locale, maybe we can just handle some basic types of numbers for all subreddits.

So the following is a list of what a Reddit user writes, and what the bot parses in "standard" notation:

  1. x.xx -> x.xx
  2. x.xxx.xxx -> x,xxx,xxx
  3. x.xxx,xxx -> x,xxx.xxx
  4. x'xxx -> x,xxx

So these are just some rough ideas/guidelines. We can break down the different types of formats into different issues if you think that makes sense? Or just tackle it all in one go. Whatever we do, we should keep this section of code very self-contained. I think we can create our own "locale" files in Numeral, if we feel like they are lacking something we need, but I think it is also possible to do this without using numeral and just using regular expressions. Lots and lots of regular expressions.

from metric_units_reddit_bot.

nalinbhardwaj avatar nalinbhardwaj commented on July 30, 2024

Yes @cannawen, actually, Numeral.js doesn't allow for a different number of distance in commas in a number(at least not trivially), which is the case with Arabic numeral system, where-in numbers are 105 = 1 lakh, 107 = 1 crore and so on, so basically, we use(1,00,000 instead of 100,000). Doing this in-house would be much easier.

We can break down the different types of formats into different issues if you think that makes sense?

This is optimal since that'll allow us expertise from individual user groups.

Whatever we do, we should keep this section of code very self-contained.

Agreed. Best to have a separate folder for keeping the parsers, and parse in a main preprocess file.

Also, I wanted to know, what are your thoughts on what we should do for multiple acceptable formats? As in user types 100,1000 this can represent a valid Arabic system 105 or 100 and 1000 in standard.

from metric_units_reddit_bot.

cannawen avatar cannawen commented on July 30, 2024

OK. Let's first create an issue for "refactor number parsing" and first we will just parse standard numbers, like:

000.00
0,000.00
0.00
.00

And then we can talk about adding support for different number systems on a case-by-case basis after the refactor is in.

@nalinbhardwaj Do you still want to work on this?

from metric_units_reddit_bot.

nalinbhardwaj avatar nalinbhardwaj commented on July 30, 2024

@cannawen yes, that can be done, but again more or less first we need to figure out how to deal with this as I mentioned:

Also, I wanted to know, what are your thoughts on what we should do for multiple acceptable formats? As in user types 100,1000 this can represent a valid Arabic system 105 or 100 and 1000 in standard.

I can make a PR soon that would add all suffix multipliers(thousand, million etc.) from numeral.js . I think it's a good start. We can work on the number parsing regex side by side. I can also make a PR adding the new formats, but that can be kept open if someone else wants a stab at it.

from metric_units_reddit_bot.

nalinbhardwaj avatar nalinbhardwaj commented on July 30, 2024

I thought about this a bit, and I now feel like the downside to making a much plainer regex, something like this would be extremely low. I mean, we know it is a number, and followed by a unit/following the format, isn't it almost guaranteed to be a number of units? @cannawen thoughts?

from metric_units_reddit_bot.

cannawen avatar cannawen commented on July 30, 2024

We can add a parsing for "100,1000" (or 50e-5) but I think it makes sense to first focus on creating a good architecture for "standard" numbers so we can easily add new ones in the future. I think we can leave the "1k -> 1000" parsing where it is for now, and perhaps refactor later on if we feel like it makes sense in the number parsing part of the code.

I created issue #100 for how I am envisioning the initial refactor, does that issue look clear enough/make sense to you? Anything you think should be changed?

Also, I think it makes sense to keep this issue the discussion area, so the refactor thread itself is cleaner

from metric_units_reddit_bot.

nalinbhardwaj avatar nalinbhardwaj commented on July 30, 2024

@cannawen did you see that regex I posted? My basic idea was that if we just allow for any numbers with any amount of separation using , or ., I don’t think we'll run into many false positives, since we are already checking if it’s surrounded by a unit etc. Can you think of a case where a valid sentence is written, assuming proper punctuation(commas in sentence followed by space) where this parser would give sub optimal results?
If not, this would easily give extremely good results in parsing(in any and every format)

from metric_units_reddit_bot.

cannawen avatar cannawen commented on July 30, 2024

If we "capture" every number with periods and commas, we still need to "figure out" what it means (ex. the string "1.234,56" means 1234.56) so I don't see the benefit in capturing the string "1.234,56" alone, without parsing it to a javascript float as well. But perhaps I am missing something?

I also think it would be best if the number-parsing was completely separate form the unit-parsing (so perhaps we could release it as a separate library for others to use), but this is open for discussion.

from metric_units_reddit_bot.

cannawen avatar cannawen commented on July 30, 2024

Closing for inactivity - Perhaps a future enhancement to look into

from metric_units_reddit_bot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.