rsundqvist / id-translation Goto Github PK
View Code? Open in Web Editor NEWTurn meaningless IDs into human-readable labels.
License: MIT License
Turn meaningless IDs into human-readable labels.
License: MIT License
Should this be supported? What would the result look like?
Add callbacks to choose a source (override-style only) based on ID value. Maybe something like:
def func(source, ids, available_sources):
return {
"positive": list(filter(lambda i: i >= 0, ids)),
"negative": list(filter(lambda i: i < 0, ids))
} source == "numbers" else None # None=use regular mapper
t = Translator(...).translate({"numbers": [-1, 1], custom_mapping_function=func)
Add a UserMappingError(MappingError)
or similar to report strange return values.
Must be an optional dependency.
There's an old issue for this: python/typing/issues/548
, which would solve a few typing issues if it were ever implemented.
Configured table sizes
Elements such as
Element.make("{id!r:.8}")
become equivalent to
Element.make("{id}")
when using Element.positional_part
.
The function below works but is limited.
import pandas as pd
from id_translation.offline import TranslationMap
def translate_as_categories(df: pd.DataFrame, tmap: TranslationMap) -> pd.DataFrame:
from id_translation.dio import resolve_io
dtypes = {
# sort_index() to ensure ordering by ID
column: pd.CategoricalDtype(pd.Series(tmap[column]).sort_index(), ordered=True)
for column in df
}
return resolve_io(df).insert(df, names=list(df), tmap=tmap, copy=False).astype(dtypes)
Not very convenient though, and requires some knowledge of internal id_translation
types.
Setup
>>> data = {1999: "Sofia", 1991: "Richard"}
>>> from id_translation import Translator
>>> translator = Translator({"people": data})
>>> translator
Translator(online=False: cache=TranslationMap('people': 2 IDs))
Create data
>>> df = pd.Series(list(data)).to_frame("people")
>>> df = df.sample(4, replace=True).reset_index(drop=True)
>>> df.T
people 1999 1999 1991 1999
Apply
>>> df = translate_as_categories(df, translator.cache)
Result
>>> df.T
people 1999:Sofia 1999:Sofia 1991:Richard 1999:Sofia
>>> df["people"].dtype
CategoricalDtype(categories=['1991:Richard', '1999:Sofia'], ordered=True, categories_dtype=object)
Maybe it's enough to put up at documentation/examples.
Does not currently work. Running
from rics.translation import Translator
left, right = Translator().translate([1, 1], names=list("aa")))
assert left == right
gives:
ValueError: Number of names 2 must be 1 or equal to the length of the data (2) to translate, but got names=['b', 'a'].
Currently uses count(*)
that will give an exact number in most cases, which isn't needed. Should be configurable to avoid heavy queries.
Already somewhat configurable since the function may be overridden:
def get_approximate_table_size(
self,
table: sqlalchemy.sql.schema.Table,
id_column: sqlalchemy.sql.schema.Column,
) -> int:
return self._engine.execute(sqlalchemy.func.count(id_column)).scalar()
Include an option to get the full name-to-source mapping, not just the names. Especially useful for callers that want to act based on the sources used.
Add to docstrings.
IdType
genericsIdType
generic arg is in many cases defined on the class-level, when it should really be defined on the function level. In reality both Translator
and Fetcher
instances can handle mixed ID types.NameType
/ SourceType
split.At the moment, the entire Translator
-object is serialized along with the data. This generally isn't really what the user needs.
Options?
max_untranslated
max_fail
fail_limit
untranslated_limit
Should probably be a stand-alone breaking release given the API it's part of.
Add preprocessing to the TOML, inject into Translator
.
[fetching.'<fetcher-type>'.transform.'<source>'.BitmaskTransformer]
Alternatively, just allow another top-level key in extra_fetchers
-files.
[transform.'<source>'.BitmaskTransformer]
In either case, raise on if <source>
is claimed more than once.
After PDEP-6, assigning values incompatible with the current dtype will require explicit an type conversion.
This will break inplace translation, since Series.astype
returns a new object. It (should?) keep working for DataFrame
, as in that case we simply replace the entire column.
Detect if pandas.Series
is compatible with strings. If not, raise NotInplaceTranslatableError
if inplace=True
.
def series_inplace_translatable() -> bool:
try:
pd.Series(dtype=int)[:] = ""
except ExceptionType:
raise NotInplaceTranslatableError
The exact exception type that will be raised is not known yet.
The only officially supported types are int
and str
. Add uuid.UUID
as well.
UUID is implicitly treated as strings, which may cause issues when they're improperly stored (e.g. a bad type in a SQL database) or read from source that forces implicit types (CSV).
The id_translation.typing.ID
-attribute can be modified, but this will just break translation.
Allow users to set the ID
field in order to use something other than 'id'
as the ID-placeholder.
Using explicit or derived names.
Traceback (most recent call last):
File "/home/dev/git/id-translation/src/id_translation/_translator.py", line 581, in restore
ans = pickle.load(f) # noqa: S301
^^^^^^^^^^^^^^
File "/home/dev/git/id-translation/.venv/lib/python3.11/site-packages/sqlalchemy/util/langhelpers.py", line 1636, in __new__
raise TypeError(
TypeError: Can't replace canonical symbol for unpickled with new int value -4533261097786850201
Works for sqlalchemy<2
.
Sizes are currently computed for all tables when sources are requested, not when sizes are actually needed.
Must be an optional dependency. Build on whatever comes out of #226 .
Currently part of Translator
, for historical reasons.
This (probably?) won't work for any non-local filesystem.
What will this break?
https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html
Possible use cases:
Possible signature:
def user_id_transformation(ids: list[IdType], names: list[NameType], source: SourceType) -> list[IdType]:
"""Map IDs based on the name(s) being translated and the source to which they've been mapped.
Args:
ids: Unique IDs for which translation data will be fetched.
names: The different names from which the `ids` have been extracted.
source: The source from which translation data will be fetched.
Returns:
A transformed subset of `ids`.
"""
raise NotImplementedError
What level should this be defined on? Probably individual fetcher.
Would be nice to have an overload translate(data, inplace=True) -> None
, at least.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.