rsundqvist / id-translation Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 2.08 MB

Turn meaningless IDs into human-readable labels.

License: MIT License

Python 85.57% Shell 0.21% Dockerfile 0.39% TSQL 0.09% Jupyter Notebook 13.75%

id-translation's People

Contributors

id-translation's Issues

Translating bitmasks.

Should this be supported? What would the result look like?

Conditional source based on ID value

Name splitting

Add callbacks to choose a source (override-style only) based on ID value. Maybe something like:

def func(source, ids, available_sources):
    return {
        "positive": list(filter(lambda i: i >= 0, ids)),
        "negative": list(filter(lambda i: i < 0, ids))
    } source == "numbers" else None  # None=use regular mapper
    

t = Translator(...).translate({"numbers": [-1, 1], custom_mapping_function=func)

Add a UserMappingError(MappingError) or similar to report strange return values.

Support `polars`

Must be an optional dependency.

Adopt Higher-Kinded TypeVars

There's an old issue for this: python/typing/issues/548, which would solve a few typing issues if it were ever implemented.

SqlFetcher: Pre-defined table size

Configured table sizes

`Element.make` erases conversion and format spec

Elements such as

Element.make("{id!r:.8}")

become equivalent to

Element.make("{id}")

when using Element.positional_part.

Translating as `pandas.Categories`

The function below works but is limited.

import pandas as pd
from id_translation.offline import TranslationMap

def translate_as_categories(df: pd.DataFrame, tmap: TranslationMap) -> pd.DataFrame:
    from id_translation.dio import resolve_io

    dtypes = {
        # sort_index() to ensure ordering by ID
        column: pd.CategoricalDtype(pd.Series(tmap[column]).sort_index(), ordered=True) 
        for column in df
    }
    return resolve_io(df).insert(df, names=list(df), tmap=tmap, copy=False).astype(dtypes)

Not very convenient though, and requires some knowledge of internal id_translation types.

Setup

>>> data = {1999: "Sofia", 1991: "Richard"}

>>> from id_translation import Translator
>>> translator = Translator({"people":  data})
>>> translator
Translator(online=False: cache=TranslationMap('people': 2 IDs))

Create data

>>> df = pd.Series(list(data)).to_frame("people")
>>> df = df.sample(4, replace=True).reset_index(drop=True)
>>> df.T
people  1999  1999  1991  1999

Apply

>>> df = translate_as_categories(df, translator.cache)

Result

>>> df.T
people  1999:Sofia  1999:Sofia  1991:Richard  1999:Sofia

>>> df["people"].dtype
CategoricalDtype(categories=['1991:Richard', '1999:Sofia'], ordered=True, categories_dtype=object)

Maybe it's enough to put up at documentation/examples.

Repeated names

Does not currently work. Running

from rics.translation import Translator
left, right = Translator().translate([1, 1], names=list("aa")))
assert left == right

gives:

ValueError: Number of names 2 must be 1 or equal to the length of the data (2) to translate, but got names=['b', 'a'].

SqlFetcher: More efficient table size estimation

Currently uses count(*) that will give an exact number in most cases, which isn't needed. Should be configurable to avoid heavy queries.

Already somewhat configurable since the function may be overridden:

def get_approximate_table_size(
    self,
    table: sqlalchemy.sql.schema.Table,
    id_column: sqlalchemy.sql.schema.Column,
) -> int:
    return self._engine.execute(sqlalchemy.func.count(id_column)).scalar()

Extend `Translator.translated_names()`

Include an option to get the full name-to-source mapping, not just the names. Especially useful for callers that want to act based on the sources used.

Heuristic function examples

Add to docstrings.

Typing issues

Class-level IdType generics
The IdType generic arg is in many cases defined on the class-level, when it should really be defined on the function level. In reality both Translator and Fetcher instances can handle mixed ID types.
Bad NameType / SourceType split.
Treated as separate things though in reality these types are often assumed to be the same.

Fetcher-data only `Translator.store` option

At the moment, the entire Translator-object is serialized along with the data. This generally isn't really what the user needs.

Rename `Translator.translate(maximal_untranslated_fraction)`

Options?

max_untranslated
max_fail
fail_limit
untranslated_limit

Should probably be a stand-alone breaking release given the API it's part of.

Fetcher-defined transformers in TOML

Add preprocessing to the TOML, inject into Translator.

[fetching.'<fetcher-type>'.transform.'<source>'.BitmaskTransformer]

Alternatively, just allow another top-level key in extra_fetchers-files.

[transform.'<source>'.BitmaskTransformer]

In either case, raise on if <source> is claimed more than once.

Move read-from-ENV logic from `SqlFetcher` to the factory

Inplace translation of `pandas.Series` after PDEP-6

Description

After PDEP-6, assigning values incompatible with the current dtype will require explicit an type conversion.

This will break inplace translation, since Series.astype returns a new object. It (should?) keep working for DataFrame, as in that case we simply replace the entire column.

Possible solution

Detect if pandas.Series is compatible with strings. If not, raise NotInplaceTranslatableError if inplace=True.

def series_inplace_translatable() -> bool:
    try:
        pd.Series(dtype=int)[:] = ""
    except ExceptionType:
        raise NotInplaceTranslatableError

The exact exception type that will be raised is not known yet.

Traceback (most recent call last):
  File "/home/dev/git/id-translation/src/id_translation/_translator.py", line 581, in restore
    ans = pickle.load(f)  # noqa: S301
          ^^^^^^^^^^^^^^
  File "/home/dev/git/id-translation/.venv/lib/python3.11/site-packages/sqlalchemy/util/langhelpers.py", line 1636, in __new__
    raise TypeError(
TypeError: Can't replace canonical symbol for unpickled with new int value -4533261097786850201

Works for sqlalchemy<2.

Blacklisted IDs
Remove aliases/known duplicates
Remove IDs which have been removed in the source (why is the user translating these?)
Convert between UUID/ID (does anyone actually need this?)

Possible signature:

def user_id_transformation(ids: list[IdType], names: list[NameType], source: SourceType) -> list[IdType]:
    """Map IDs based on the name(s) being translated and the source to which they've been mapped.

    Args:
        ids: Unique IDs for which translation data will be fetched.
        names: The different names from which the `ids` have been extracted.
        source: The source from which translation data will be fetched.

    Returns:
        A transformed subset of `ids`.
    """
    raise NotImplementedError

What level should this be defined on? Probably individual fetcher.

Overloads for `Translator.translate`

Would be nice to have an overload translate(data, inplace=True) -> None, at least.