Giter Club home page Giter Club logo

id-translation's People

Contributors

dependabot[bot] avatar rsundqvist avatar

id-translation's Issues

Conditional source based on ID value

Name splitting

Add callbacks to choose a source (override-style only) based on ID value. Maybe something like:

def func(source, ids, available_sources):
    return {
        "positive": list(filter(lambda i: i >= 0, ids)),
        "negative": list(filter(lambda i: i < 0, ids))
    } source == "numbers" else None  # None=use regular mapper
    

t = Translator(...).translate({"numbers": [-1, 1], custom_mapping_function=func)

Add a UserMappingError(MappingError) or similar to report strange return values.

Adopt Higher-Kinded TypeVars

There's an old issue for this: python/typing/issues/548, which would solve a few typing issues if it were ever implemented.

Translating as `pandas.Categories`

The function below works but is limited.

import pandas as pd
from id_translation.offline import TranslationMap

def translate_as_categories(df: pd.DataFrame, tmap: TranslationMap) -> pd.DataFrame:
    from id_translation.dio import resolve_io

    dtypes = {
        # sort_index() to ensure ordering by ID
        column: pd.CategoricalDtype(pd.Series(tmap[column]).sort_index(), ordered=True) 
        for column in df
    }
    return resolve_io(df).insert(df, names=list(df), tmap=tmap, copy=False).astype(dtypes)

Not very convenient though, and requires some knowledge of internal id_translation types.

  1. Setup

    >>> data = {1999: "Sofia", 1991: "Richard"}
    
    >>> from id_translation import Translator
    >>> translator = Translator({"people":  data})
    >>> translator
    Translator(online=False: cache=TranslationMap('people': 2 IDs))
  2. Create data

    >>> df = pd.Series(list(data)).to_frame("people")
    >>> df = df.sample(4, replace=True).reset_index(drop=True)
    >>> df.T
    people  1999  1999  1991  1999
  3. Apply

    >>> df = translate_as_categories(df, translator.cache)
    
  4. Result

    >>> df.T
    people  1999:Sofia  1999:Sofia  1991:Richard  1999:Sofia
    
    >>> df["people"].dtype
    CategoricalDtype(categories=['1991:Richard', '1999:Sofia'], ordered=True, categories_dtype=object)

Maybe it's enough to put up at documentation/examples.

Repeated names

Does not currently work. Running

from rics.translation import Translator
left, right = Translator().translate([1, 1], names=list("aa")))
assert left == right

gives:

ValueError: Number of names 2 must be 1 or equal to the length of the data (2) to translate, but got names=['b', 'a'].

SqlFetcher: More efficient table size estimation

Currently uses count(*) that will give an exact number in most cases, which isn't needed. Should be configurable to avoid heavy queries.

Already somewhat configurable since the function may be overridden:

def get_approximate_table_size(
    self,
    table: sqlalchemy.sql.schema.Table,
    id_column: sqlalchemy.sql.schema.Column,
) -> int:
    return self._engine.execute(sqlalchemy.func.count(id_column)).scalar()

Extend `Translator.translated_names()`

Include an option to get the full name-to-source mapping, not just the names. Especially useful for callers that want to act based on the sources used.

Typing issues

  • Class-level IdType generics
    The IdType generic arg is in many cases defined on the class-level, when it should really be defined on the function level. In reality both Translator and Fetcher instances can handle mixed ID types.
  • Bad NameType / SourceType split.
    Treated as separate things though in reality these types are often assumed to be the same.

Fetcher-defined transformers in TOML

Add preprocessing to the TOML, inject into Translator.

[fetching.'<fetcher-type>'.transform.'<source>'.BitmaskTransformer]

Alternatively, just allow another top-level key in extra_fetchers-files.

[transform.'<source>'.BitmaskTransformer]

In either case, raise on if <source> is claimed more than once.

Inplace translation of `pandas.Series` after PDEP-6

Description

After PDEP-6, assigning values incompatible with the current dtype will require explicit an type conversion.

This will break inplace translation, since Series.astype returns a new object. It (should?) keep working for DataFrame, as in that case we simply replace the entire column.

Possible solution

Detect if pandas.Series is compatible with strings. If not, raise NotInplaceTranslatableError if inplace=True.

def series_inplace_translatable() -> bool:
    try:
        pd.Series(dtype=int)[:] = ""
    except ExceptionType:
        raise NotInplaceTranslatableError

The exact exception type that will be raised is not known yet.

Improved UUID/GUID handling

The only officially supported types are int and str. Add uuid.UUID as well.

UUID is implicitly treated as strings, which may cause issues when they're improperly stored (e.g. a bad type in a SQL database) or read from source that forces implicit types (CSV).

Configurable `ID` placeholder

The id_translation.typing.ID-attribute can be modified, but this will just break translation.

Allow users to set the ID field in order to use something other than 'id' as the ID-placeholder.

Cannot depickle SQL-based Translator on SQLAlchemy>=2

Traceback (most recent call last):
  File "/home/dev/git/id-translation/src/id_translation/_translator.py", line 581, in restore
    ans = pickle.load(f)  # noqa: S301
          ^^^^^^^^^^^^^^
  File "/home/dev/git/id-translation/.venv/lib/python3.11/site-packages/sqlalchemy/util/langhelpers.py", line 1636, in __new__
    raise TypeError(
TypeError: Can't replace canonical symbol for unpickled with new int value -4533261097786850201

Works for sqlalchemy<2.

User-defined ID transformations

Possible use cases:

  • Blacklisted IDs
  • Remove aliases/known duplicates
  • Remove IDs which have been removed in the source (why is the user translating these?)
  • Convert between UUID/ID (does anyone actually need this?)

Possible signature:

def user_id_transformation(ids: list[IdType], names: list[NameType], source: SourceType) -> list[IdType]:
    """Map IDs based on the name(s) being translated and the source to which they've been mapped.

    Args:
        ids: Unique IDs for which translation data will be fetched.
        names: The different names from which the `ids` have been extracted.
        source: The source from which translation data will be fetched.

    Returns:
        A transformed subset of `ids`.
    """
    raise NotImplementedError

What level should this be defined on? Probably individual fetcher.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.