Giter Club home page Giter Club logo

srsly's Introduction

srsly: Modern high-performance serialization utilities for Python

This package bundles some of the best Python serialization libraries into one standalone package, with a high-level API that makes it easy to write code that's correct across platforms and Pythons. This allows us to provide all the serialization utilities we need in a single binary wheel. Currently supports JSON, JSONL, MessagePack, Pickle and YAML.

tests PyPi conda GitHub Python wheels

Motivation

Serialization is hard, especially across Python versions and multiple platforms. After dealing with many subtle bugs over the years (encodings, locales, large files) our libraries like spaCy and Prodigy had steadily grown a number of utility functions to wrap the multiple serialization formats we need to support (especially json, msgpack and pickle). These wrapping functions ended up duplicated across our codebases, so we wanted to put them in one place.

At the same time, we noticed that having a lot of small dependencies was making maintenance harder, and making installation slower. To solve this, we've made srsly standalone, by including the component packages directly within it. This way we can provide all the serialization utilities we need in a single binary wheel.

srsly currently includes forks of the following packages:

Installation

⚠️ Note that v2.x is only compatible with Python 3.6+. For 2.7+ compatibility, use v1.x.

srsly can be installed from pip. Before installing, make sure that your pip, setuptools and wheel are up to date.

python -m pip install -U pip setuptools wheel
python -m pip install srsly

Or from conda via conda-forge:

conda install -c conda-forge srsly

Alternatively, you can also compile the library from source. You'll need to make sure that you have a development environment with a Python distribution including header files, a compiler (XCode command-line tools on macOS / OS X or Visual C++ build tools on Windows), pip and git installed.

Install from source:

# clone the repo
git clone https://github.com/explosion/srsly
cd srsly

# create a virtual environment
python -m venv .env
source .env/bin/activate

# update pip
python -m pip install -U pip setuptools wheel

# compile and install from source
python -m pip install .

For developers, install requirements separately and then install in editable mode without build isolation:

# install in editable mode
python -m pip install -r requirements.txt
python -m pip install --no-build-isolation --editable .

# run test suite
python -m pytest --pyargs srsly

API

JSON

📦 The underlying module is exposed via srsly.ujson. However, we normally interact with it via the utility functions only.

function srsly.json_dumps

Serialize an object to a JSON string. Falls back to json if sort_keys=True is used (until it's fixed in ujson).

data = {"foo": "bar", "baz": 123}
json_string = srsly.json_dumps(data)
Argument Type Description
data - The JSON-serializable data to output.
indent int Number of spaces used to indent JSON. Defaults to 0.
sort_keys bool Sort dictionary keys. Defaults to False.
RETURNS str The serialized string.

function srsly.json_loads

Deserialize unicode or bytes to a Python object.

data = '{"foo": "bar", "baz": 123}'
obj = srsly.json_loads(data)
Argument Type Description
data str / bytes The data to deserialize.
RETURNS - The deserialized Python object.

function srsly.write_json

Create a JSON file and dump contents or write to standard output.

data = {"foo": "bar", "baz": 123}
srsly.write_json("/path/to/file.json", data)
Argument Type Description
path str / Path The file path or "-" to write to stdout.
data - The JSON-serializable data to output.
indent int Number of spaces used to indent JSON. Defaults to 2.

function srsly.read_json

Load JSON from a file or standard input.

data = srsly.read_json("/path/to/file.json")
Argument Type Description
path str / Path The file path or "-" to read from stdin.
RETURNS dict / list The loaded JSON content.

function srsly.write_gzip_json

Create a gzipped JSON file and dump contents.

data = {"foo": "bar", "baz": 123}
srsly.write_gzip_json("/path/to/file.json.gz", data)
Argument Type Description
path str / Path The file path.
data - The JSON-serializable data to output.
indent int Number of spaces used to indent JSON. Defaults to 2.

function srsly.write_gzip_jsonl

Create a gzipped JSONL file and dump contents.

data = [{"foo": "bar"}, {"baz": 123}]
srsly.write_gzip_json("/path/to/file.jsonl.gz", data)
Argument Type Description
path str / Path The file path.
lines - The JSON-serializable contents of each line.
append bool Whether or not to append to the location. Appending to .gz files is generally not recommended, as it doesn't allow the algorithm to take advantage of all data when compressing - files may hence be poorly compressed.
append_new_line bool Whether or not to write a new line before appending to the file.

function srsly.read_gzip_json

Load gzipped JSON from a file.

data = srsly.read_gzip_json("/path/to/file.json.gz")
Argument Type Description
path str / Path The file path.
RETURNS dict / list The loaded JSON content.

function srsly.read_gzip_jsonl

Load gzipped JSONL from a file.

data = srsly.read_gzip_jsonl("/path/to/file.jsonl.gz")
Argument Type Description
path str / Path The file path.
RETURNS dict / list The loaded JSONL content.

function srsly.write_jsonl

Create a JSONL file (newline-delimited JSON) and dump contents line by line, or write to standard output.

data = [{"foo": "bar"}, {"baz": 123}]
srsly.write_jsonl("/path/to/file.jsonl", data)
Argument Type Description
path str / Path The file path or "-" to write to stdout.
lines iterable The JSON-serializable lines.
append bool Append to an existing file. Will open it in "a" mode and insert a newline before writing lines. Defaults to False.
append_new_line bool Defines whether a new line should first be written when appending to an existing file. Defaults to True.

function srsly.read_jsonl

Read a JSONL file (newline-delimited JSON) or from JSONL data from standard input and yield contents line by line. Blank lines will always be skipped.

data = srsly.read_jsonl("/path/to/file.jsonl")
Argument Type Description
path str / Path The file path or "-" to read from stdin.
skip bool Skip broken lines and don't raise ValueError. Defaults to False.
YIELDS - The loaded JSON contents of each line.

function srsly.is_json_serializable

Check if a Python object is JSON-serializable.

assert srsly.is_json_serializable({"hello": "world"}) is True
assert srsly.is_json_serializable(lambda x: x) is False
Argument Type Description
obj - The object to check.
RETURNS bool Whether the object is JSON-serializable.

msgpack

📦 The underlying module is exposed via srsly.msgpack. However, we normally interact with it via the utility functions only.

function srsly.msgpack_dumps

Serialize an object to a msgpack byte string.

data = {"foo": "bar", "baz": 123}
msg = srsly.msgpack_dumps(data)
Argument Type Description
data - The data to serialize.
RETURNS bytes The serialized bytes.

function srsly.msgpack_loads

Deserialize msgpack bytes to a Python object.

msg = b"\x82\xa3foo\xa3bar\xa3baz{"
data = srsly.msgpack_loads(msg)
Argument Type Description
data bytes The data to deserialize.
use_list bool Don't use tuples instead of lists. Can make deserialization slower. Defaults to True.
RETURNS - The deserialized Python object.

function srsly.write_msgpack

Create a msgpack file and dump contents.

data = {"foo": "bar", "baz": 123}
srsly.write_msgpack("/path/to/file.msg", data)
Argument Type Description
path str / Path The file path.
data - The data to serialize.

function srsly.read_msgpack

Load a msgpack file.

data = srsly.read_msgpack("/path/to/file.msg")
Argument Type Description
path str / Path The file path.
use_list bool Don't use tuples instead of lists. Can make deserialization slower. Defaults to True.
RETURNS - The loaded and deserialized content.

pickle

📦 The underlying module is exposed via srsly.cloudpickle. However, we normally interact with it via the utility functions only.

function srsly.pickle_dumps

Serialize a Python object with pickle.

data = {"foo": "bar", "baz": 123}
pickled_data = srsly.pickle_dumps(data)
Argument Type Description
data - The object to serialize.
protocol int Protocol to use. -1 for highest. Defaults to None.
RETURNS bytes The serialized object.

function srsly.pickle_loads

Deserialize bytes with pickle.

pickled_data = b"\x80\x04\x95\x19\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x03foo\x94\x8c\x03bar\x94\x8c\x03baz\x94K{u."
data = srsly.pickle_loads(pickled_data)
Argument Type Description
data bytes The data to deserialize.
RETURNS - The deserialized Python object.

YAML

📦 The underlying module is exposed via srsly.ruamel_yaml. However, we normally interact with it via the utility functions only.

function srsly.yaml_dumps

Serialize an object to a YAML string. See the ruamel.yaml docs for details on the indentation format.

data = {"foo": "bar", "baz": 123}
yaml_string = srsly.yaml_dumps(data)
Argument Type Description
data - The JSON-serializable data to output.
indent_mapping int Mapping indentation. Defaults to 2.
indent_sequence int Sequence indentation. Defaults to 4.
indent_offset int Indentation offset. Defaults to 2.
sort_keys bool Sort dictionary keys. Defaults to False.
RETURNS str The serialized string.

function srsly.yaml_loads

Deserialize unicode or a file object to a Python object.

data = 'foo: bar\nbaz: 123'
obj = srsly.yaml_loads(data)
Argument Type Description
data str / file The data to deserialize.
RETURNS - The deserialized Python object.

function srsly.write_yaml

Create a YAML file and dump contents or write to standard output.

data = {"foo": "bar", "baz": 123}
srsly.write_yaml("/path/to/file.yml", data)
Argument Type Description
path str / Path The file path or "-" to write to stdout.
data - The JSON-serializable data to output.
indent_mapping int Mapping indentation. Defaults to 2.
indent_sequence int Sequence indentation. Defaults to 4.
indent_offset int Indentation offset. Defaults to 2.
sort_keys bool Sort dictionary keys. Defaults to False.

function srsly.read_yaml

Load YAML from a file or standard input.

data = srsly.read_yaml("/path/to/file.yml")
Argument Type Description
path str / Path The file path or "-" to read from stdin.
RETURNS dict / list The loaded YAML content.

function srsly.is_yaml_serializable

Check if a Python object is YAML-serializable.

assert srsly.is_yaml_serializable({"hello": "world"}) is True
assert srsly.is_yaml_serializable(lambda x: x) is False
Argument Type Description
obj - The object to check.
RETURNS bool Whether the object is YAML-serializable.

srsly's People

Contributors

adrianeboyd avatar erjanmx avatar honnibal avatar ines avatar koaning avatar musicinmybrain avatar nyejon avatar pfvosi avatar polm avatar rmitsch avatar sadovnychyi avatar shadchin avatar svlandeg avatar willfrey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

srsly's Issues

Update Ultrajson

Hi,

Ultrajson recently got updated, I'm uncertain how it works with forks and srsly being standalone but is it possible to update the version used in srsly to use this latest version?

//Tim

srsly.write_json segfaults trying to write numpy types

Hi,

I was using srsly for json serialization and it killed my jupyter kernels without producing any errors. It turns out it's due to ujson segmentation faults trying to serialize types it doesn't understand. It's quite easy to reproduce though:

import numpy as np
import srsly
f = np.float32()
srsly.write_json("x.json", f)

Believe this has now been fixed in ujson as part of ultrajson/ultrajson@53f85b1 (issue ultrajson/ultrajson#294). Doing an equivalent test with the latest ujson (version 3.0) gives a sensible error.

Prior to that fix, it looked like some people were abandoning ujson (noirbizarre/flask-restplus#589) and there was a Pandas fork (https://github.com/pandas-dev/pandas/tree/master/pandas/_libs/src/ujson/python) before it started being maintained again.

Are there any plans to update the version used by srsly?

Add numbin as another serialization library

Hi, I'm the author of numbin (An efficient binary serialization format for numpy data.) I just found this repo and I wonder if you will accept numbin as another serialization library.

Check the benchmark here.

BTW, I don't know there is a library called msgpack-numpy before I develop numbin. After explore the source code, I think numbin would provide better performance and flexibility.

Is there any interest in supporting CSV?

We get some strange issues with the pandas’ read_csv function that are similar to the issues for which srsly seems to have been formed. Is there interest in creating high-performance support for CSV files that better handles issues as compared to pandas?

Not support ensure_ascii=False for write_json

Dear,

In json dump, there is a very useful feature ensure_ascii for non-English.
Actually, as a Chinese, I alwayse choose ensure_ascii=False.

Could srsly support ensure_ascii feature in write_json?
thanks!!

Release v1.0.0 was removed from PyPI.org

The release package for 1.0.0 was removed from PyPI.org. Doing this causes problems for many automated build systems and pipelines. Removing packages from repos is frowned upon. Please replace the package in PyPI. It it is broken in some way, please release a v1.0.1 instead of removing 1.0.0.


pip3 install srsly==1.0.0
Collecting srsly==1.0.0
  Could not find a version that satisfies the requirement srsly==1.0.0 (from versions: 0.0.1, 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.0.7, 0.1.0, 0.2.0, 2.0.0.dev0, 2.0.0.dev1, 2.0.0)
No matching distribution found for srsly==1.0.0

Releases_·_explosion_srsly

srsly_·_PyPI

Bug: Cannot install if Cython is not already installed

Summary

The latest version 1 tag (v1.0.5) altered setup.py to directly import from Cython. This means that install fails with No module named 'Cython' unless you have manually installed it.

Projects that use srsly as a dependency will need to be edited to manually install Cython now, whereas before setuptools handed this install.

This also effects version 2 and the master branch.

Edit

I've looked into this and I think the expectation is that pyproject.toml will handle this cython pre-install. I am using python 3.6.5, I suspect this python version (or its setuptools) is not reading this file.

Reproduce

$ docker run -w /home/circleci circleci/python:3.6.5 bash -c "python3 -m venv venv; . venv/bin/activate; pip install srsly==1.0.5" 
Collecting srsly==1.0.5
  Downloading https://files.pythonhosted.org/packages/c7/08/abe935f33b69a08d365b95e62b47ef48f93a69ab734e623248a8a4079ecb/srsly-1.0.5.tar.gz (86kB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-vjnjnimo/srsly/setup.py", line 11, in <module>
        from Cython.Build import cythonize
    ModuleNotFoundError: No module named 'Cython'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-vjnjnimo/srsly/

I give a docker route to reproduce to match the environment perfectly.

Entities

#44
https://github.com/explosion/srsly/blob/v1.0.5/setup.py#L11

How do I say “srsly”?

It takes too long to verbally spell out s-r-s-l-y! How is it meant to be spoken?

  • seriously
  • serially
  • sirs-ly
  • …?

TypeError: 'escape_forward_slashes' is an invalid keyword argument for this function, when loading SpaCy 3.0 model (Python 3.10)

When using Python 3.10.8, getting following error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/confection/__init__.py", line 495, in try_dump_json
    return srsly.json_dumps(value)
  File "/opt/conda/lib/python3.10/site-packages/srsly/_json_api.py", line 26, in json_dumps
    result = ujson.dumps(data, indent=indent, escape_forward_slashes=False)
TypeError: 'escape_forward_slashes' is an invalid keyword argument for this function

The above exception was the direct cause of the following exception:

 File "/app/ML/util/nlputil.py", line 35, in load_spacy_tokenizer
    nlp = spacy.load("en_core_web_sm", disable=disable)
  File "/opt/conda/lib/python3.10/site-packages/spacy/__init__.py", line 54, in load
    return util.load_model(
  File "/opt/conda/lib/python3.10/site-packages/spacy/util.py", line 432, in load_model
    return load_model_from_package(name, **kwargs)  # type: ignore[arg-type]
  File "/opt/conda/lib/python3.10/site-packages/spacy/util.py", line 468, in load_model_from_package
    return cls.load(vocab=vocab, disable=disable, enable=enable, exclude=exclude, config=config)  # type: ignore[attr-defined]
  File "/opt/conda/lib/python3.10/site-packages/en_core_web_sm/__init__.py", line 10, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/opt/conda/lib/python3.10/site-packages/spacy/util.py", line 649, in load_model_from_init_py
    return load_model_from_path(
  File "/opt/conda/lib/python3.10/site-packages/spacy/util.py", line 506, in load_model_from_path
    nlp = load_model_from_config(
  File "/opt/conda/lib/python3.10/site-packages/spacy/util.py", line 554, in load_model_from_config
    nlp = lang_cls.from_config(
  File "/opt/conda/lib/python3.10/site-packages/spacy/language.py", line 1781, in from_config
    interpolated = filled.interpolate() if not filled.is_interpolated else filled
  File "/opt/conda/lib/python3.10/site-packages/confection/__init__.py", line 196, in interpolate
    return Config().from_str(self.to_str())
  File "/opt/conda/lib/python3.10/site-packages/confection/__init__.py", line 419, in to_str
    flattened.set(section_name, key, try_dump_json(value, node))
  File "/opt/conda/lib/python3.10/site-packages/confection/__init__.py", line 503, in try_dump_json
    raise ConfigValidationError(config=data, desc=err_msg) from e
confection.ConfigValidationError:

Config validation error
Couldn't serialize config value of type <class 'NoneType'>: 'escape_forward_slashes' is an invalid keyword argument for this function. Make sure all values in your config are JSON-serializable. If you want to include Python objects, use a registered function that returns the object instead.
{'train': None, 'dev': None, 'vectors': None, 'init_tok2vec': None}

Has anyone seen this error before?

TypeError: 'escape_forward_slashes' is an invalid keyword argument for this function, when loading SpaCy 3.5 model (Python 3.8)

Since #83 is closed, I'm opening it again.

  • I'm using Spacy 3.5.1 on Python 3.8.15 and experiencing the same issue. srsly version is 2.4.6.
  • OS is Debian 10 Buster. My system had ujson installed, I tried uninstalling it but it didn't solve the issue.
  • Spacy 3.3.1, srrsly=2.4.3 in the same environment also gives the same error.

The code to reproduce:

bash:
python3 -m spacy download it_core_news_sm
python3:
import spacy

spacy.load("it_core_news_sm")

The full trace

nlp = spacy.load(model_name, exclude=excluded_pipes)
  File "/opt/conda/default/lib/python3.8/site-packages/spacy/__init__.py", line 54, in load
    return util.load_model(
  File "/opt/conda/default/lib/python3.8/site-packages/spacy/util.py", line 442, in load_model
    return load_model_from_package(name, **kwargs)  # type: ignore[arg-type]
  File "/opt/conda/default/lib/python3.8/site-packages/spacy/util.py", line 478, in load_model_from_package
    return cls.load(vocab=vocab, disable=disable, enable=enable, exclude=exclude, config=config)  # type: ignore[attr-defined]
  File "/opt/conda/default/lib/python3.8/site-packages/it_core_news_sm/__init__.py", line 10, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/opt/conda/default/lib/python3.8/site-packages/spacy/util.py", line 659, in load_model_from_init_py
    return load_model_from_path(
  File "/opt/conda/default/lib/python3.8/site-packages/spacy/util.py", line 516, in load_model_from_path
    nlp = load_model_from_config(
  File "/opt/conda/default/lib/python3.8/site-packages/spacy/util.py", line 564, in load_model_from_config
    nlp = lang_cls.from_config(
  File "/opt/conda/default/lib/python3.8/site-packages/spacy/language.py", line 1781, in from_config
    interpolated = filled.interpolate() if not filled.is_interpolated else filled
  File "/opt/conda/default/lib/python3.8/site-packages/confection/__init__.py", line 196, in interpolate
    return Config().from_str(self.to_str())
  File "/opt/conda/default/lib/python3.8/site-packages/confection/__init__.py", line 419, in to_str
    flattened.set(section_name, key, try_dump_json(value, node))
  File "/opt/conda/default/lib/python3.8/site-packages/confection/__init__.py", line 503, in try_dump_json
    raise ConfigValidationError(config=data, desc=err_msg) from e
confection.ConfigValidationError: 

Config validation error
Couldn't serialize config value of type <class 'NoneType'>: 'escape_forward_slashes' is an invalid keyword argument for this function. Make sure all values in your config are JSON-serializable. If you want to include Python objects, use a registered function that returns the object instead.

Rounding error in tests for x86 and aarch64

Compiling srsly 0.2.0 on Arch Alpine Linux I got following error in test for x86 and aarch64 architectures (https://cloud.drone.io/alpinelinux/aports/13161/1/1 , https://cloud.drone.io/alpinelinux/aports/13161/3/1):


=================================== FAILURES ===================================
--
1770 | ____________ UltraJSONTests.test_decodeFloatingPointAdditionalTests ____________
1771 |  
1772 | self = <srsly.tests.ujson.test_ujson.UltraJSONTests testMethod=test_decodeFloatingPointAdditionalTests>
1773 |  
1774 | def test_decodeFloatingPointAdditionalTests(self):
1775 | self.assertEqual(-1.1234567893, ujson.loads("-1.1234567893"))
1776 | self.assertEqual(-1.234567893, ujson.loads("-1.234567893"))
1777 | self.assertEqual(-1.34567893, ujson.loads("-1.34567893"))
1778 | self.assertEqual(-1.4567893, ujson.loads("-1.4567893"))
1779 | self.assertEqual(-1.567893, ujson.loads("-1.567893"))
1780 | self.assertEqual(-1.67893, ujson.loads("-1.67893"))
1781 | >       self.assertEqual(-1.7893, ujson.loads("-1.7893"))
1782 | E       AssertionError: -1.7893 != -1.7893000000000001
1783 |  
1784 | srsly/tests/ujson/test_ujson.py:761: AssertionError

Relaxing type annotations a bit?

Hi there!

I noticed a couple of places where type annotations could be relaxed.

  1. write_jsonl could be changed from Sequence to Iterable
  2. FilePath could be written as FilePath = Union[str, "os.PathLike[str]"]

The second change with os.PathLike could also come with some changes to force_path and force_string by removing some unnecessary isinstance(...) checks and using os.fspath, respectively.

Would you be open to a PR with these changes? If so, I'd be happy to make the changes and open one.

Thanks!

Duplicate uint64_t ctypedef when building with Cython 3.0.8 and Python 3.12

Error compiling Cython file:
------------------------------------------------------------
...
    object PyMemoryView_GetContiguous(object obj, int buffertype, char order)

from libc.stdlib cimport *
from libc.string cimport *
from libc.limits cimport *
ctypedef unsigned long long uint64_t
^
------------------------------------------------------------

srsly/msgpack/_unpacker.pyx:13:0: 'uint64_t' redeclared 

Error compiling Cython file:
------------------------------------------------------------
...
cdef extern from "Python.h":
    ctypedef int int32_t
    ctypedef int int64_t
    ctypedef unsigned int uint32_t
    ctypedef unsigned int uint64_t
    ^
------------------------------------------------------------

/usr/lib/python3.12/site-packages/Cython/Includes/cpython/pyport.pxd:5:4: Previous declaration is here

Build fails on PyPy

I am trying to build spacy 2.1.3 on an alpine PyPy docker image, everything seems to build without an issue except srsly.

Dockerfile to reproduce:

FROM jamiehewland/alpine-pypy:3.6-7.0

RUN apk add --no-cache --virtual .build-deps musl-dev g++ \
    && apk add --no-cache openblas-dev \
    && pip install --no-cache-dir spacy==2.1.3 \
    && rm -rf ~/.cache \
    && apk del .build-deps

Error output:

    ----------------------------------------
    Failed building wheel for srsly
    Running setup.py clean for srsly
  Successfully built cymem preshed murmurhash thinc blis thinc-gpu-ops numpy wrapt
  Failed to build srsly
  Installing collected packages: setuptools, wheel, Cython, cymem, preshed, murmurhash, numpy, blis, thinc-gpu-ops, wrapt, plac, tqdm, six, wasabi, srsly, thinc
    Running setup.py install for srsly: started
      Running setup.py install for srsly: finished with status 'error'
      Complete output from command /usr/local/bin/pypy3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-cxeja_a8/srsly/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-c7x55i7s/install-record.txt --single-version-externally-managed --prefix /tmp/pip-build-env-8tt8bidn/overlay --compile:
      running install
      running build
      running build_py
      running build_ext
      building 'srsly.msgpack._unpacker' extension
      gcc -pthread -DNDEBUG -O2 -fPIC -D__LITTLE_ENDIAN__=1 -I/usr/local/include -I. -I/tmp/pip-install-cxeja_a8/srsly/include -I/usr/local/include -c srsly/msgpack/_unpacker.cpp -o build/temp.linux-x86_64-3.6/srsly/msgpack/_unpacker.o -O2 -Wno-strict-prototypes -Wno-unused-function
      gcc: error: srsly/msgpack/_unpacker.cpp: No such file or directory
      gcc: fatal error: no input files
      compilation terminated.
      error: command 'gcc' failed with exit status 1

      ----------------------------------------

FYI build works OK on a standard python alpine docker image

srsly.read_json is broken

code:

import srsly

print(srsly.read_json("foo.json"))

outputs:

Traceback (most recent call last):
  File "/tmp/foo.py", line 4, in <module>
    print(srsly.read_json("foo.json"))
  File "/Users/yohei_tamura/work/lilly/.venv/lib/python3.8/site-packages/srsly/_json_api.py", line 52, in read_json
    return ujson.load(f)
ValueError: Expected object or value

srsly version: 2.2.0
python version: 3.8.1

add an option to write_jsonl to define whether a new line is written in append mode

Linking to the discussion in this thread:

https://support.prodi.gy/t/db-in-command-cant-handle-blank-lines-in-jsonl/2126/3

It would be useful to be able to define whether or not there should be a blank line appended to a file when using write_jsonl.

I have the scenario where I want to append to a single file using srsly, and it already adds a new line at the end of the file, so there does not need to be a new line written first when appending to the file.

There is an option in read_jsonl to ignore blank lines, which could also be an option for db-in in prodigy, but it would be a good idea anyway to write a single continuous file with multiple appends.

How to use `srsly.msgpack_dumps` with my custom class?

I want to serialize my custom class with srsly.msgpack_dumps, because it is stored in spacy.Doc.
In other words, doc.to_disk fails because my custom class cannotn be serialized with srsly.msgpack_dumps.
How to make my custom class to be able to save?

is_yaml_serializable giving alternating answers for same object

I encountered this problem while writing a custom saving loop that utilises srsly. Here is an example:

import srsly
import numpy as np

data = {
    1: np.ones([10, 10]),
    2: np.zeros([50, 50])
}

assert srsly.is_yaml_serializable(data) == srsly.is_yaml_serializable(data)

will trigger the assertion error.

Even successively running srsly.is_yaml_serializable(data) in the console will result in the answer flipping between True/False with every iteration.

Move to `orjson`

UltraJSON has been put in to "maintenance-only" status and the maintainers recommend upgrading to orjson. orjson is quite a bit faster than ujson, but does have some API incompatibilities.

I'm working on a project right now where we are caching lots of SpaCy objects, and the numerous calls to json_dumps can slow things down. orjson looks to be about twice as fast as ujson on average, and would make a big difference in our case!

The latest patch release of major version 1 does not have the security fix for CVE-2022-31116

From your release notes:

v2.4.4

Port https://github.com/ultrajson/ultrajson/pull/550 and https://github.com/ultrajson/ultrajson/pull/555 to fix incorrect handling of invalid surrogate pair characters (https://github.com/advisories/GHSA-wpqr-jcpx-745r)

However, the same fix for ultrajson does not seem to have been applied to the latest patch release major version 1. The version of SpaCy we have in our code base has srsly = ">=0.0.6,<1.1.0" and our security scan unveiled the vulnerability CVE-2022-31116.

Would it be possible to put through a major version 1 patch release including the security fix to address CVE-2022-31116?

Kind regards,

David Griffiths

Vendored version of cloudpickle does not support Python 3.8

The version of cloudpickle currently vendored in srsly fails to import under Python 3.8 as a result of PEP 570-related changes to types.CodeType's signature:

  File "...../lib/python3.8/site-packages/srsly/__init__.py", line 7, in <module>
    from ._pickle_api import pickle_dumps, pickle_loads
  File "...../lib/python3.8/site-packages/srsly/_pickle_api.py", line 4, in <module>
    from . import cloudpickle
  File "...../lib/python3.8/site-packages/srsly/cloudpickle/__init__.py", line 1, in <module>
    from .cloudpickle import *
  File "...../lib/python3.8/site-packages/srsly/cloudpickle/cloudpickle.py", line 167, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "...../lib/python3.8/site-packages/srsly/cloudpickle/cloudpickle.py", line 148, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

See also:
cloudpipe/cloudpickle#266
cloudpipe/cloudpickle#267

Updating to the latest version of cloudpickle should resolve this.

No module named 'srsly.ujson.ujson'

I'm installing srsly on an AWS lambda but I keep getting an error :
No module named 'srsly.ujson.ujson'

It's install with srsly inside its package but can't be access.
I tried to change all declaration to a global import ujson which i downloaded without any success ...

Any ideas ?

PS : I'm installing a Spacy lambda which require srsly

Unable to import srsly.ujson.ujson for SpaCy -- AWS Lambda

I am trying to add SpaCy as a dependency to my Python Lambda. I am doing this by installing SpaCy as a standalone dependency inside a directory named dependencies using pip3 install spacy --no-deps -t . This is because I can't load the entire Spacy dependency inside the \tmp directory of my Lambda.

I am able to successfully upload the folder to s3 and download it during the Lambda invocation. When I try to import spacy, I get this error: [ERROR] Runtime.ImportModuleError: Unable to import module : No module named 'srsly.ujson.ujson'.

I manually installed srsly inside dependencies\ and I have all the files that are listed as per this link. This was referenced by this link. One of the responses says, "it seems like Python can't load it, because it's not compiled?". How would I compile a dependency which has a .c file in it?

One other question which I found on SO is this question, but I have already manually installed srsly. How to I import the module? Thanks.


I manually check in my code for the presence of ujson before importing spacy like this:

if os.path.exists('/tmp/dependencies/srsly/ujson/ujson.c'):
    print('ujson exists')

and the print statement gets printed.

Memory leaks in ujson

We've replaced the usage of json with srsly.ujson a while ago for that free performance boost since we are doing lots of JSON encoding/decoding and we already have it installed as part of spacy, but now we had to move back because of some terrible memory leaks:

import json
import random
import string
import psutil
from srsly import ujson as json

sample = lambda x: ''.join(
  random.choice(string.ascii_uppercase + string.digits) for _ in range(x))

process = psutil.Process()

for i in range(10):
  data = json.dumps({sample(99): sample(100000) for k in range(50)})
  json.loads(data)
  print(process.memory_info())

Output with ujson:

pmem(rss=24203264, vms=4400664576, pfaults=19049, pageins=0)
pmem(rss=29409280, vms=4414173184, pfaults=33898, pageins=0)
pmem(rss=34557952, vms=4419309568, pfaults=47960, pageins=0)
pmem(rss=39714816, vms=4424429568, pfaults=62855, pageins=0)
pmem(rss=44838912, vms=4429549568, pfaults=77571, pageins=0)
pmem(rss=50081792, vms=4434751488, pfaults=92307, pageins=0)
pmem(rss=55312384, vms=4439973888, pfaults=107288, pageins=0)
pmem(rss=60440576, vms=4445093888, pfaults=122711, pageins=0)
pmem(rss=65806336, vms=4451422208, pfaults=137578, pageins=0)
pmem(rss=70934528, vms=4456542208, pfaults=151382, pageins=0)

Output with stdlib json:

pmem(rss=17154048, vms=4385366016, pfaults=17692, pageins=0)
pmem(rss=17317888, vms=4403191808, pfaults=32047, pageins=0)
pmem(rss=17317888, vms=4403191808, pfaults=46541, pageins=0)
pmem(rss=17317888, vms=4403191808, pfaults=61035, pageins=0)
pmem(rss=17358848, vms=4403191808, pfaults=75539, pageins=0)
pmem(rss=17383424, vms=4403191808, pfaults=90039, pageins=0)
pmem(rss=17383424, vms=4403191808, pfaults=104533, pageins=0)
pmem(rss=17420288, vms=4403191808, pfaults=119036, pageins=0)
pmem(rss=17420288, vms=4403191808, pfaults=133530, pageins=0)
pmem(rss=17420288, vms=4403191808, pfaults=148024, pageins=0)

Benchmark ran on python3.7 on macos, but same leak exists on Debian with python27.
You can increase the range from 10 and you will eventually run of our memory.

Days were spent on this issue, because I would never suspect the JSON library to be at fault, but it is. I don't know if it affects Spacy in any way.

Could be related: ultrajson/ultrajson#270

Wheel support for linux aarch64 [arm64]

Summary
Installing srsly on aarch64 via pip using command "pip3 install srsly" tries to build wheel from source code

Problem description
srsly doesn't have wheel for aarch64 on PyPI repository. So, while installing srsly via pip on aarch64, pip builds wheel for same resulting in it takes more time to install srsly. Making wheel available for aarch64 will benefit aarch64 users by minimizing srsly installation time.

Expected Output
Pip should be able to download srsly wheel from PyPI repository rather than building it from source code.

@srsly-team, please let me know if I can help you building wheel/uploading to PyPI repository. I am curious to make srsly wheel available for aarch64. It will be a great opportunity for me to work with you.

Wordwises works, X-Ray doesn't

Wordwise seems to run fine, X-Ray errors :

Traceback (most recent call last):
File "calibre\gui2\threaded_jobs.py", line 83, in start_work
File "calibre_plugins.worddumb.parse_job", line 37, in do_job
File "C:\Users\USERNAME\AppData\Roaming\calibre\plugins\worddumb-libs\spacy_3.1.1_3.8\spacy\__init__.py", line 11, in <module>
File "C:\Users\USERNAME\AppData\Roaming\calibre\plugins\worddumb-libs\thinc_8.0.8_3.8\thinc\__init__.py", line 5, in <module>
File "C:\Users\USERNAME\AppData\Roaming\calibre\plugins\worddumb-libs\thinc_8.0.8_3.8\thinc\config.py", line 14, in <module>
File "C:\Users\USERNAME\AppData\Roaming\calibre\plugins\worddumb-libs\srsly_2.4.1_3.8\srsly\__init__.py", line 1, in <module>
File "C:\Users\USERNAME\AppData\Roaming\calibre\plugins\worddumb-libs\srsly_2.4.1_3.8\srsly\_json_api.py", line 6, in <module>
File "C:\Users\USERNAME\AppData\Roaming\calibre\plugins\worddumb-libs\srsly_2.4.1_3.8\srsly\ujson\__init__.py", line 1, in <module>
ModuleNotFoundError: No module named 'srsly.ujson.ujson'

pip says srsly is installed correctly, and C:\Users\USERNAME\AppData\Roaming\calibre\plugins\worddumb-libs\srsly_2.4.1_3.8\srsly\ujson
exists and has ujson.c inside among other files..

(unrelated? : I had to manually add blis_0.7.4_3.8 to the worddumb-libs on Windows 10 to clear an other error)

C:\Users\USERNAME\AppData\Roaming\calibre\plugins\worddumb-libs\srsly_2.4.1_3.8\srsly\ujson_init_.py line 1 is :

from .ujson import decode, encode, dump, dumps, load, loads # noqa: F401

Called with args: ((1587, 'MOBI', 'BBGL0Z779A', 'book.mobi', <calibre.ebooks.metadata.book.base.Metadata object at 0x082C44C0>, {'spacy': 'en_core_web_', 'wiki': 'en'}), False, True) {'notifications': <queue.Queue object at 0x082C46B8>, 'abort': <threading.Event object at 0x082C45F8>, 'log': <calibre.utils.logging.GUILog object at 0x082C45C8>}

Windows 10
Python 3.9
Calibre - 5.24

Installed the plugin today.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.