renatopp / liac-arff Goto Github PK

A library for read and write ARFF files in Python

License: MIT License

Python 98.98% Shell 1.02%

liac-arff's Introduction

LIAC-ARFF

The liac-arff module implements functions to read and write ARFF files in Python. It was created in the Connectionist Artificial Intelligence Laboratory (LIAC), which takes place at the Federal University of Rio Grande do Sul (UFRGS), in Brazil.

ARFF (Attribute-Relation File Format) is an file format specially created for describe datasets which are used commonly for machine learning experiments and softwares. This file format was created to be used in Weka, the best representative software for machine learning automated experiments.

You can clone the arff-datasets repository for a large set of ARFF files.

Features

Read and write ARFF files using python built-in structures, such dictionaries and lists;
Supports scipy.sparse.coo and lists of dictionaries as used by SVMLight
Supports the following attribute types: NUMERIC, REAL, INTEGER, STRING, and NOMINAL;
Has an interface similar to other built-in modules such as json, or zipfile;
Supports read and write the descriptions of files;
Supports missing values and names with spaces;
Supports unicode values and names;
Fully compatible with Python 3.6+;
Under MIT License

How To Install

Via pip:

$ pip install liac-arff

Via conda:

$ conda install -c conda-forge liac-arff

Manually:

$ pip install .

Documentation

For a complete description of the module, consult the official documentation at http://packages.python.org/liac-arff/ with mirror in http://inf.ufrgs.br/~rppereira/docs/liac-arff/index.html

Usage

You can read an ARFF file as follows:

>>> import arff
>>> data = arff.load(open('wheater.arff', 'rb'))

Which results in:

>>> data
{
    u'attributes': [
        (u'outlook', [u'sunny', u'overcast', u'rainy']),
        (u'temperature', u'REAL'),
        (u'humidity', u'REAL'),
        (u'windy', [u'TRUE', u'FALSE']),
        (u'play', [u'yes', u'no'])],
    u'data': [
        [u'sunny', 85.0, 85.0, u'FALSE', u'no'],
        [u'sunny', 80.0, 90.0, u'TRUE', u'no'],
        [u'overcast', 83.0, 86.0, u'FALSE', u'yes'],
        [u'rainy', 70.0, 96.0, u'FALSE', u'yes'],
        [u'rainy', 68.0, 80.0, u'FALSE', u'yes'],
        [u'rainy', 65.0, 70.0, u'TRUE', u'no'],
        [u'overcast', 64.0, 65.0, u'TRUE', u'yes'],
        [u'sunny', 72.0, 95.0, u'FALSE', u'no'],
        [u'sunny', 69.0, 70.0, u'FALSE', u'yes'],
        [u'rainy', 75.0, 80.0, u'FALSE', u'yes'],
        [u'sunny', 75.0, 70.0, u'TRUE', u'yes'],
        [u'overcast', 72.0, 90.0, u'TRUE', u'yes'],
        [u'overcast', 81.0, 75.0, u'FALSE', u'yes'],
        [u'rainy', 71.0, 91.0, u'TRUE', u'no']
    ],
    u'description': u'',
    u'relation': u'weather'
}

You can write an ARFF file with this structure:

>>> print arff.dumps(data)
@RELATION weather

@ATTRIBUTE outlook {sunny, overcast, rainy}
@ATTRIBUTE temperature REAL
@ATTRIBUTE humidity REAL
@ATTRIBUTE windy {TRUE, FALSE}
@ATTRIBUTE play {yes, no}

@DATA
sunny,85.0,85.0,FALSE,no
sunny,80.0,90.0,TRUE,no
overcast,83.0,86.0,FALSE,yes
rainy,70.0,96.0,FALSE,yes
rainy,68.0,80.0,FALSE,yes
rainy,65.0,70.0,TRUE,no
overcast,64.0,65.0,TRUE,yes
sunny,72.0,95.0,FALSE,no
sunny,69.0,70.0,FALSE,yes
rainy,75.0,80.0,FALSE,yes
sunny,75.0,70.0,TRUE,yes
overcast,72.0,90.0,TRUE,yes
overcast,81.0,75.0,FALSE,yes
rainy,71.0,91.0,TRUE,no
%
%
%

Contributors

Project Page

https://github.com/renatopp/liac-arff

liac-arff's People

Contributors

Stargazers

Watchers

liac-arff's Issues

Drop Python2 support?

Hey everybody,

given that Python2 has reached the end of its life, maybe it is also time to finally drop Python2 support for liac-arff. This could make maintenance easier, but will also allow us to use new features such as type annotations. Are there any opinions on this?

_RE_TYPE_NOMINAL

The current version of _RE_TYPE_NOMINAL together with

@attribute 'family' {  GB , GK , GS , TN , ZA , ZF , ZH , ZM , ZS }

for decoding an arff file raises a BadAttributeType error. I think this Regex solves this problem:

r'^\{\s*((\".*\"|\'.*\'|\S*)\s*,\s*)*(\".*\"|\'.*\'|\S*)\s*\}$'

Added to conda

Hi,

this is more of an FYI than an issue but I just wanted to let you guys know that I added liac-arff to conda-forge so that it can be downloaded with conda.

The repo is here: https://github.com/conda-forge/liac-arff-feedstock

Let me know if anyone of the maintainers here would like to be added as maintainers there or if I should make a PR here to add a badge or install instructions. Otherwise, feel free to just close the issue.

Thanks! 😃

Improve string feature support

By allowing arff files like this:

@relation reviewSpamData-string-noSentiment
@attribute text string
@attribute spam {N,Y}
@data
'Doctor Aboolian is a board-certified, plastic surgeon.',Y

as reported in #15.

Issue reading timestamp attributes

I forward a limitation pointed out by @zuliani99 in scikit-learn: scikit-learn/scikit-learn#19944
It is more appropriate to solve the issue upstream than in the vendor version in scikit-learn.

Describe the bug

I am trying to fetch a dataset with the fetch_openml api and I notice that it can't handle date type features like timestamp.

Steps/Code to Reproduce

id = 41889
X, y = fetch_openml(data_id=id, as_frame=True, return_X_y=True, cache=False)
y = y.to_frame()
X[y.columns[0]] = y
df = X

Expected Results

I expected it returns the usual X and y.

Actual Results

Traceback (most recent call last):
  File "start.py", line 29, in <module>
    main()
  File "start.py", line 25, in main
    test()
  File "/home/riccardo/Desktop/AutoML-Benchmark/functions/test.py", line 10, in test
    X, y = fetch_openml(data_id=id, as_frame=True, return_X_y=True, cache=False)
  File "/home/riccardo/.local/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/riccardo/.local/lib/python3.8/site-packages/sklearn/datasets/_openml.py", line 915, in fetch_openml
    bunch = _download_data_to_bunch(url, return_sparse, data_home,
  File "/home/riccardo/.local/lib/python3.8/site-packages/sklearn/datasets/_openml.py", line 633, in _download_data_to_bunch
    out = _retry_with_clean_cache(url, data_home)(
  File "/home/riccardo/.local/lib/python3.8/site-packages/sklearn/datasets/_openml.py", line 59, in wrapper
    return f(*args, **kw)
  File "/home/riccardo/.local/lib/python3.8/site-packages/sklearn/datasets/_openml.py", line 514, in _load_arff_response
    arff = _arff.load(stream,
  File "/home/riccardo/.local/lib/python3.8/site-packages/sklearn/externals/_arff.py", line 1078, in load
    return decoder.decode(fp, encode_nominal=encode_nominal,
  File "/home/riccardo/.local/lib/python3.8/site-packages/sklearn/externals/_arff.py", line 915, in decode
    raise e
  File "/home/riccardo/.local/lib/python3.8/site-packages/sklearn/externals/_arff.py", line 911, in decode
    return self._decode(s, encode_nominal=encode_nominal,
  File "/home/riccardo/.local/lib/python3.8/site-packages/sklearn/externals/_arff.py", line 842, in _decode
    attr = self._decode_attribute(row)
  File "/home/riccardo/.local/lib/python3.8/site-packages/sklearn/externals/_arff.py", line 784, in _decode_attribute
    raise BadAttributeType()
sklearn.externals._arff.BadAttributeType: Bad @ATTRIBUTE type, at line 2.

Versions

System:
python: 3.8.5 (default, Jan 27 2021, 15:41:15) [GCC 9.3.0]
executable: /usr/bin/python3
machine: Linux-5.8.0-50-generic-x86_64-with-glibc2.29

Python dependencies:
pip: 21.0.1
setuptools: 56.0.0
sklearn: 0.24.1
numpy: 1.19.5
scipy: 1.5.4
Cython: 0.29.22
pandas: 1.1.4
matplotlib: 3.4.1
joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

Encode nominal when decoding failed

In [74]: decoder.decode(open('./pcc_data/genbase-train.arff'), encode_nominal=True)

---------------------------------------------------------------------------
BadDataFormat                             Traceback (most recent call last)
<ipython-input-74-be634aa47315> in <module>()
----> 1 train = decoder.decode(open('./pcc_data/genbase-train.arff'), encode_nominal=True)

/usr/local/lib/python2.7/site-packages/arff.pyc in decode(self, s, encode_nominal)
    556             # print e
    557             e.line = self._current_line
--> 558             raise e
    559
    560

BadDataFormat: Bad @DATA instance format, at line 2898.

Improve error messages

As suggested here openml/openml-python#920

Quoted questionmarks are converted to null

@RELATION XOR

@attribute 'bc' {'?','Y'}

@DATA
'Y'
'?'
'Y'
'?'
%
%
% ``` currently does not parse correctly, i.e. the '?' are represented as null objects, while they should be treated as the string questionmark.

dump, load and then dump fails

The attributes types are being turned into strings when using dump() to ARFF
'attributes': [
('local_hour', 'INTEGER'),
]
-->
@Attribute 'local_hour' 'INTEGER'

When loading this back using load() it will be a the string 'INTEGER' and will miss the check "if type_values.upper() in DECODE_ARFF_TYPES" in the __decode_attribute(type_values) function. Basically comparing the string 'INTEGER' with the string INTEGER (same for the other types)

The type will thus always be treated as a list of options. This in turn will cause a new dump() of the data (without any modifications) to fail with "... was not listed as a valid nominal value for such types not being nominal.

"Nominal" is not part of _SIMPLE_TYPES

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/arff.py in iter_encode(self, obj)
    694                 # Verify for invalid types
    695                 if attr[1] not in _SIMPLE_TYPES:
--> 696                     raise BadObject('Invalid attribute type "%s"'%str(attr))
    697 
    698             # Verify for bad object format

BadObject: Invalid attribute type "('attribute_1', 'NOMINAL')"

"Nominal" should be allowed no?

Add WEKA compatibility tests

To make sure that liac-arff behaves like the WEKA arff reader/writer.

Numeric categorical values can not be written to disk

If a categorical attribute has only numeric values (which is valid, if not somewhat unspecified, in the original arff definition), the package raises an error when writing the data to disk:

import arff
data = dict(
  relation='dataset name',
  description='dataset description',
  attributes=[("categorical_with_numeric_values", [1, 2, 3])],
  data=[[1], [2], [3]]
)
with open("test.arff", "w") as fh:
  arff.dump(data, fh)

Expected behavior: Produce a valid arff file with attribute:

@ATTRIBUTE categorical_with_numeric_values {1, 2, 3}.

Actual behavior: Treats the categories as strings, leading to an error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/pietergijsbers/repositories/automlbenchmark/venv/lib/python3.9/site-packages/arff.py", line 1091, in dump
    for row in generator:
  File "/Users/pietergijsbers/repositories/automlbenchmark/venv/lib/python3.9/site-packages/arff.py", line 1028, in iter_encode
    yield self._encode_attribute(attr[0], attr[1])
  File "/Users/pietergijsbers/repositories/automlbenchmark/venv/lib/python3.9/site-packages/arff.py", line 964, in _encode_attribute
    type_tmp = [u'%s' % encode_string(type_k) for type_k in type_]
  File "/Users/pietergijsbers/repositories/automlbenchmark/venv/lib/python3.9/site-packages/arff.py", line 964, in <listcomp>
    type_tmp = [u'%s' % encode_string(type_k) for type_k in type_]
  File "/Users/pietergijsbers/repositories/automlbenchmark/venv/lib/python3.9/site-packages/arff.py", line 420, in encode_string
    if _RE_QUOTE_CHARS.search(s):
TypeError: expected string or bytes-like object

Possible workaround by stringifying the categories (they are unquoted in the resulting arff header):

-  attributes=[("categorical_with_numeric_values", [1, 2, 3])],
+  attributes=[("categorical_with_numeric_values", ['1', '2,' '3'])],

Python 3.11.3, liac-arff 2.5.0

I understand there is currently no work being done on the package, but I figured I would document the bug and workaround.

Bad @DATA instance format

When I open the arff file, I get this error:
This line is in the last line in my arff file.

Bad @DaTa instance format in line 3096: free.fr,False,http://www.free.fr/adsl,2.0,5611,719,420.5199999999132,754.3740000000653,3627.99799

bad @attribute type for "@attribute emotion unknown"

i used emobase.conf in openSMILE
line 993 is @Attribute emotion unknown
how com is this bad @Attribute type?
can anyone fix this?
thx in advance

Partial support for date and datetime formats

ARFF supports dates and timestamps. liac-arff currently doesn't support it, nor does scipy or the unmaintained arff package.

java.text.SimpleDateFormat looks awful to fully support, so I intend to implement partial support for now. In my application, I need to understand the following formats:

yyyy-MM-dd'T'HH:mm:ss (the file format default)
yyyy-MM-dd' 'HH:mm:ss (R's foreign::write.arff for timestamps)
Any of the above with 'Z' appended (for importing/exporting localized times; a zero offset suits my needs)
yyyy-MM-dd (R's foreign::write.arff for dates)

comma in data value is not escaped

, in any data field should lead to this field being quoted with ', but doesn't

TypeError: a bytes-like object is required, not 'str'

I can't load arfffiles, error log is ↓

>>> import arff
>>> data = arff.load(open('sample.arff', 'rb'))
Traceback (most recent call last):
File "", line 1, in
File "/Users/usr/.pyenv/versions/anaconda3-2.5.0/lib/python3.5/site-packages/arff.py", line 879, in load
return_type=return_type)
File "/Users/usr/.pyenv/versions/anaconda3-2.5.0/lib/python3.5/site-packages/arff.py", line 722, in decode
matrix_type=return_type)
File "/Users/usr/.pyenv/versions/anaconda3-2.5.0/lib/python3.5/site-packages/arff.py", line 636, in _decode
row = row.strip(' \r\n')
TypeError: a bytes-like object is required, not 'str'
>>>

sample.arff is [from https://github.com/renatopp/arff-datasets](from https://github.com/renatopp/arff-datasets)

I think python's version deference makes this error, but sorry I don't know how to solve it.
Please help me. Thank you.

Efficiency improvements: what API needs to be stable?

If I am to offer changes improving loader efficiency, what guidelines do I need to adhere to in ensuring API stability? Is load/loads/ArffDecoder.decode the only real public interface for loading? Would restructuring the type conversion utilities be welcome?

Issue with separator space other than space

We had a dataset with the character \u3000 which is one of the separator space. The regular expression in the encoder only matches \\ while it should use \s to match all possible separators and add quotes around the string to be compliant with the ARFF format.

I will open a PR to solve this issue.

use liac-arff unsuccessfully in python 3.6

code:

import arff
data=arff.load(open('weather.arff', 'rb'))

error:
Traceback (most recent call last):
File "c:\Users\yhy\Desktop\test\weather.py", line 6, in
data=arff.load(open('weather.arff', 'rb'))
File "C:\Users\yhy\Anaconda3\lib\site-packages\arff.py", line 968, in load
return_type=return_type)
File "C:\Users\yhy\Anaconda3\lib\site-packages\arff.py", line 810, in decode
matrix_type=return_type)
File "C:\Users\yhy\Anaconda3\lib\site-packages\arff.py", line 724, in _decode
row = row.strip(' \r\n')
TypeError: a bytes-like object is required, not 'str'

Allow tab separated instance data

I have got a lot of ARFF files with tab separated instance data, and Weka can read them.
Could you please also read them?
(I know that officially "Attribute values for each instance are delimited by commas.")

Sparse with quoted values not supported

The ARFF spec includes quoted nominals in their sparse ARFF examples. liac-arff appears to raise an error in this case.

What I did

import arff
arff.loads('''
@relation foobar

@attribute a real
@attribute b {X,Y,Z}
@attribute c real
@attribute d real
@attribute cls {"class A", "class B"}
@data
{1 X, 3 .5, 4 "class A"}
''')

What I expected

{'data': [[0, 'X', 0, .5, 'class A']], ...}

What I got

ValueError                                Traceback (most recent call last)
<ipython-input-2-65109511f99c> in <module>()
      9 @data
     10 {1 X, 3 .5, 4 "class A"}
---> 11 '''
     12 )

/Users/joel/repos/liac-arff/arff.py in loads(s, encode_nominal, return_type)
   1010     decoder = ArffDecoder()
   1011     return decoder.decode(s, encode_nominal=encode_nominal,
-> 1012                           return_type=return_type)
   1013
   1014 def dump(obj, fp):

/Users/joel/repos/liac-arff/arff.py in decode(self, s, encode_nominal, return_type)
    829         try:
    830             return self._decode(s, encode_nominal=encode_nominal,
--> 831                                 matrix_type=return_type)
    832         except ArffException as e:
    833             # print e

/Users/joel/repos/liac-arff/arff.py in _decode(self, s, encode_nominal, matrix_type)
    798             # DATA INSTANCES --------------------------------------------------
    799             elif STATE == _TK_DATA:
--> 800                 data.decode_data(row, self._conversors)
    801             # -----------------------------------------------------------------
    802

/Users/joel/repos/liac-arff/arff.py in decode_data(self, s, conversors)
    350
    351     def decode_data(self, s, conversors):
--> 352         values = self._get_values(s)
    353
    354         if values[0][0].strip(" ") == '{':

/Users/joel/repos/liac-arff/arff.py in _get_values(self, s)
    369         '''(INTERNAL) Split a line into a list of values'''
    370         if _RE_QUOTATION_MARKS.search(s):
--> 371             return _read_csv(s.strip(' '))
    372         else:
    373             return next(csv.reader([s.strip(' ')]))

/Users/joel/repos/liac-arff/arff.py in _read_csv(line)
    575                 raise ValueError(
    576                     'Only whitespace allowed before quoting character at '
--> 577                     'index %d in line: %s' % (i, line))
    578             if quoted is False:
    579                 token = ''

ValueError: Only whitespace allowed before quoting character at index 14 in line: {1 X, 3 .5, 4 "class A"}

Versions

0e17414

Multi-line strings

What is the format for multiline strings? Is:

"one line
another line","another attribute"

correct?

"one line\
another line","another attribute"

"one line\nanother line","another attribute"

Or any of the above??

Drop python2.6

The support for python 2.6 seems to decline and for example sklearn decided to drop 2.6 support in the upcoming 0.19.X release. Therefore, I propose to drop python 2.6 from the unit tests and not support it any longer.

categorical values represented as '?' are saved to arff file as missing values

Hi, I understand that this may look llike an expected behaviour but this can lead to unexpected results in the following scenario:

arff file with quoted question marks as categorical values and data: e.g. @attribute feat1 {'?', 'A', 'B', 'C'}
arff.load() reads those '?' as strings.
arff.write() (for example after sampling the original data) then writes the '?' from loaded data without quotes: @attribute feat1 {?, A, B, C}
arff.load() the last file interpretes ? as missing value (None).

see openml/automlbenchmark#209 for a hack implemented locally to prevent this, but this hack also means that it would not be possible anymore to represent missing values as ? in arff files saved with the library.

Suugesting to add a param to arff.dump signature, for example:

def dump(obj, fp, missing_values=[None, '?']):
    pass

allowing user to call arff.dump(o, f, missing_values=[None]) when ? should not be considered as a missing value, and therefore be quoted.

Option to check data format without loading

We want to use liac-arff to check the validity of ARFF files. These can be quite big, however, and loading them can take quite some time. It would be nice if an option could be added that checks whether the file is valid ARFF, and returns a useful error message if not, without actually loading the data.

Trailing comment lines

It seems that, whenever I use arff.dump(), it ends the output with 3 empty lines, like this:

125915,1,6250_weka.DecisionTable,0.97963,ok
125915,1,6378_weka.LogitBoost_DecisionStump,1.0,ok
% 
% 
%

It that expected behavior or am I doing something wrong?

Possible bug in test_loads.py

Hi,
I think I stumbled upon a bug in test_loads.py in the method test_format_correct. This test should not pass in my opinion because the only line of data is the same line which contains @data.

This also raises the question if a line which denotes @data should be checked with a u_row.startswith(). I propose to use a u_row.startswith() followed by a check if there is something else in the same line.

SECURITY: bad regex pattern in 'sklearn/externals/_arff.py' will cause 'ReDos' security problem.

Posted at scikit-learn/scikit-learn#19522 by @leveryd, _RE_TYPE_NOMINAL presents a ReDos risk.

Bad @ATTRIBUTE type at line 993

i used emobase.conf in openSMILE
line 993 is @Attribute emotion unknown
how com is this bad @Attribute type?
can anyone fix this?
thx in advance

Bugs in sparse data with missing nominal values

Due to the default value str(0), it may causes BadNominalValue() error when data contains some missing nominal values.

Support returning a generator (at least for DENSE and LOD)

To allow liac-arff to be memory efficient, but not commit to a particular in-memory representation or libraries, it could have an option which returned the data as an iterator/generator (using return_type, or as well as it?). It could either generate one row at a time, or use a pandas-style chunksize to deliver multiple rows at a time. Either way it could be used, together with the returned metadata about attributes, to store data in a compact or out-of-memory format on the fly (e.g. using numpy.fromiter).

ARFF attribute naming specification changed

Hi Renato,

The specification of ARFF has recently changed to allow a wider range of attribute names. More specifically, attributes are now allowed to start with numbers (in fact any character except control characters and '{', '}', ',', or '%': http://weka.wikispaces.com/ARFF+%28stable+version%29

It would be great if liac-arff could be updated to reflect this. We are using liac-arff for python clients for OpenML.org and some of our datasets (those that have attributes starting with numbers) cannot be parsed by liac-arff because of this reason.

Many thanks!

Sincerely,
Joaquin

ValueError: Matrix type arff.COO not supported.

Prof. Renato Pereira,

Tudo bem?

Eu me chamo Tiago e estou cursando especialização em IA no IESB, aqui em Brasília.

Na matéria machine learning, conduzida pela profa. Letícia, tive o desafio de replicar o algoritmo Naives Bayes da base credit-g.arff do Weka para o Python.

Para cumprir esse desafio, encontrei o módulo https://pythonhosted.org/liac-arff/#, elaborado e gentilmente publicado por você.

No entanto, me deparei com esse pequeno percalço ao utilizá-lo com google collab:

Então, resolvi pedir seu apoio para entender como posso superar esse desafio utilizando a estrutura de dados arff.

Caso venha ser preciso, também me ofereço para colaborar na implementação de uma possível solução.

Att.

Tiago Rocha de Almeida

Update documentation

It seems like documentation stalled at 2.1.0 while it should continuously be updated. It would be good to use travis-ci to build the documentation and push it to github to host it at renatopp.github.io/liac-arff.

Issue- "BadLayout: Invalid layout of the ARFF file, at line ..."

Hello,

I tried to convert regular arff file format to sparse arff format through liac-arff.
I used the code below:

import arff
fp = open('est1.arff')
data = arff.load(fp)

X = arff.dumps(data)

from scipy import sparse
decoder = arff.ArffDecoder()
d = decoder.decode(X, encode_nominal=True, return_type=arff.COO)
data = d['data'][0]
row = d['data'][1]
col = d['data'][2]
matrix = sparse.coo_matrix((data, (row, col)), shape=(max(row)+1, max(col)+1))

However, it did not work, and the comment in console was "BadLayout: Invalid layout of the ARFF file, at line ..."

I am not sure what the problem is. Would you please help me to solve this issue?
Thank you.

Future of liac-arff

CC @jnothman @ogrisel @renatopp @PGijsbers

Dear all,

I'm opening this issue to start the discussion on the future of liac-arff as the reason why I started working on this project is soon going away. I needed an arff parser to communicate with OpenML.org and started working on liac-arff as it was the best arff-parser around (and probably still is). As OpenML will support other formats than arff (1,2,3) we will drop arff support in the OpenML-Python API, which removes my motivation to maintain this package.

I was therefore wondering how to continue with the package and see the following ways forward:

Moving the work in scikit-learn. As scikit-learn will be the power user of liac-arff once OpenML-Python drops support for arff, @jnothman did most recent contributions and scikit-learn also implements other data readers this looks like the best fit to me.
Someone else takes over the package and @renatopp gives access to that person. However, I'm not sure if he still receives these emails and so I'm not sure if that's a possible way forward.
Moving the work into scipy. liac-arff parser is more feature complete than the one in scipy and according to the benchmarks from @jnothman as fast as the one in scipy.
Create a new github organization.

Looking forward to your opinion, Matthias

Bad @DATA instance

Hi, I'm trying to load medical from http://mulan.sourceforge.net/datasets-mlc.html
but I got


BadDataFormat                             Traceback (most recent call last)
<ipython-input-17-6ee92362123e> in <module>()
----> 1 data = arff.load(open(filename, 'rb'))

/usr/local/lib/python2.7/site-packages/arff.pyc in load(fp)
    695      '''
    696     decoder = ArffDecoder()
--> 697     return decoder.decode(fp)
    698
    699 def loads(s):

/usr/local/lib/python2.7/site-packages/arff.pyc in decode(self, s)
    532             # print e
    533             e.line = self._current_line
--> 534             raise e
    535
    536

BadDataFormat: Bad @DATA instance format, at line 1499.

Issues when string data has a character `\n` or `\r\n`

I am not sure that this is something easy to handle.
Currently, if the data contain a string with the escape character \n or\r\n, then the parser will fail. I would assume this due to the fact that those characters are used to split the data as well.

Tests are broken for python 3.4

Tests depending on StringIO and unicode are broken

dump() is broken (python3.4)

Haven't yet tested on python2, but with Python 3.4.1 (archlinux default) i get the following error when calling arff.dump(data, file):

  File "/usr/lib/python3.4/site-packages/arff.py", line 719, in dump
    last_row = generator.next()
AttributeError: 'generator' object has no attribute 'next'

arff.dumps() works, though.
This happens both with the (old) liac-arff version 2.0.1 from pypi, as well as the current HEAD

ImportError: No module named arff

I'm getting this error on py2.7 or 3.6 on OS X
Installed via pip or easy_install
never had same issue with any python module (tried module arff too but same problem)

$ pip install liac-arff
Requirement already satisfied: liac-arff in /Library/Python/2.7/site-packages

$ python
Python 2.7.13 (default, Dec 18 2016, 07:03:34)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import arff
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named arff

Issue- "BadLayout: Invalid layout of the ARFF file, at line 3

file = open("data/final-dataset.arff", 'r')

# Togglable Options

regenerate_model = False

regenerate_data = False

generate_graphs = True

save_model = True

create_model_image = False

def generate_model(shape):
# define the model
model = Sequential()

model.add(Dense(30, input_dim=shape, kernel_initializer='uniform', activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.4))
# model.add(Dense(64, activation='relu'))
# model.add(Dropout(0.4))
model.add(Dense(5, activation='softmax'))
print(model.summary())

return model

def scrape_data():
# decode the .arff data and change text labels into numerical
decoder = arff.ArffDecoder()
data = decoder.decode(file, encode_nominal=True)

# split the raw data into data and labels
vals = [val[0: -1] for val in data['data']]
labels = [label[-1] for label in data['data']]

for val in labels:
    if labels[val] != 0:
        labels[val] = 1

# split the labels and data into traning and validation sets
training_data = vals[0: int(.9 * len(vals))]
training_labels = labels[0: int(.9 * len(vals))]
validation_data = vals[int(.9 * len(vals)):]
validation_labels = labels[int(.9 * len(vals)):]


print(training_labels)

# flatten labels with one hot encoding
training_labels = to_categorical(training_labels, 5)
validation_labels = to_categorical(validation_labels, 5)

# save all arrays with numpy
np.save('saved-files/vals', np.asarray(vals))
np.save('saved-files/labels', np.asarray(labels))
np.save('saved-files/training_data', np.asarray(training_data))
np.save('saved-files/validation_data', np.asarray(validation_data))
np.save('saved-files/training_labels', np.asarray(training_labels))
np.save('saved-files/validation_labels', np.asarray(validation_labels))

check to see if saved data exists, if not then create the data

if not os.path.exists('saved-files/training_data.npy') or not os.path.exists(

'saved-files/training_labels.npy') or not os.path.exists(

'saved-files/validation_data.npy') or not os.path.exists('saved-files/validation_labels.npy'):

print('creating')

if not os.path.exists('saved-files'):

os.mkdir('saved-files')

scrape_data()

load the saved data

data_train = np.load('saved-files/training_data.npy')
label_train = np.load('saved-files/training_labels.npy')
data_eval = np.load('saved-files/validation_data.npy')
label_eval = np.load('saved-files/validation_labels.npy')

generate and compile the model

model = generate_model(len(data_train[0]))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

initialize tensorboard

tensorboard = TensorBoard(log_dir='logs/', histogram_freq=0, write_graph=True, write_images=True)

only using 3 epochs otherwise the model would overfit to the data

history = model.fit(data_train, label_train, validation_data=(data_eval, label_eval), epochs=2, callbacks=[tensorboard])
loss_history = history.history["loss"]

numpy_loss_history = np.array(loss_history)
np.savetxt("saved-files/loss_history.txt", numpy_loss_history, delimiter=",")

model = load_model('saved-files/model.h5')

evaluating the model's performace

print(model.evaluate(data_eval, label_eval))
print(model.evaluate(data_train, label_train))

#if create_model_image:
plot_model(model, to_file='model.png', show_shapes=True)

plt.figure(1)

summarize history for accuracy

plt.subplot(211)
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')

summarize history for loss

plt.subplot(212)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

save the model for later so no retraining is needed

model.save('saved-files/model.h5')

play sound when done with code to alert me

os.system('afplay /System/Library/Sounds/Ping.aiff')
os.system('afplay /System/Library/Sounds/Ping.aiff')

Deprecation warning due to invalid escape sequences in Python 3.8

Deprecation warnings are raised due to invalid escape sequences in Python 3.8 . Below is a log of the warnings raised during compiling all the python files. Using raw strings or escaping them will fix this issue.

find . -iname '*.py'  | xargs -P 4 -I{} python -Walways -m py_compile {}

./tests/test_decode.py:328: DeprecationWarning: invalid escape sequence \{
  '\{2 a\}'):
./tests/test_decode.py:335: DeprecationWarning: invalid escape sequence \{
  '\{2 a\}'):
./tests/test_decode.py:342: DeprecationWarning: invalid escape sequence \{
  '\{2 a\}'):
./tests/test_decode.py:400: DeprecationWarning: invalid escape sequence \}
  "',1 'c d'\}."):

Bad @ATTRIBUTE format for "@ATTRIBUTE 0 {0,1} "

In text mining task,we may want to design a feature, which indicates whether a document has the word "0". The feature in arff format is "@Attribute 0 {0,1}". When liac-arff parses it, we may encounter the "Bad @Attribute format" exception.

The similar problem occurs, when "@Attribute _qacct {0,1}".

"U" mode is no longer accepted from Python 3.9

liac-arff/setup.py
19:    f = open('README.rst','rU')

python/cpython#16959

loads / dumps functionality not complementary

When loading and dumping an arff object that contains string values and escaped characters, it keeps on adding escape characters without removing them.

The test case that tries to enforce correct behavior is a bit simplistic (it only considers numeric values). I reckon this is due to the fact that there is a quite extensive 'encode_string' functionality, but there is no 'decode_string', leaving the decoded string with all the escape characters.

I have composed a tiny example that can serve as unit test.

import unittest
import arff
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO

OBJ = {
    'description': '\nDummy Dataset\n\n\n',
    'relation': 'Dummy',
    'attributes': [
        ('input1', 'STRING'),
        ('y', 'REAL'),
    ],
    'data': [
        ['"mean"', 0.0],
        ['"median"', 1.0],
        ['"most_frequent"', 1.0],
        ['"mean"', 0.0]
    ]
}

ARFF = '''%
% Dummy Dataset
%
%
%
@RELATION Dummy

@ATTRIBUTE input1 STRING
@ATTRIBUTE y REAL

@DATA
'\\"mean\\"',0.0
'\\"median\\"',1.0
'\\"most_frequent\\"',1.0
'\\"mean\\"',0.0
'''

class TestLoadDump(unittest.TestCase):
    def get_dumps(self):
        dumps = arff.dumps
        return dumps

    def get_loads(self):
        loads = arff.loads
        return loads

    def test_simple(self):
        dumps = self.get_dumps()
        loads = self.get_loads()

        arff = ARFF
        obj = OBJ

        count = 0
        while count < 10:
            count += 1

            arff = dumps(obj)
            self.assertEqual(ARFF, arff)
            obj = loads(arff)
            self.assertEqual(OBJ, obj)

Bad @DATA instance format for UCI chronic kidney disease data

Hi, I was trying to load the .arff file of the UCI chronic kidney diease into python using
data = arff.loads(open('./Data/chronic_kidney_disease_full.arff'))
But I got the 'BadDataFormat' exception as a result
BadDataFormat: Bad @DATA instance format in line 215: 26,70,1.015,0,4,?,normal,notpresent,notpresent,250,20,1.1,?,?,15.6,52,6900,6.0,no,yes,no,good,no,no,ckd,
Is there a way to fix this?

Support for sparse data

Hi,

I think some data, e.g. the tmc2007, would be better loaded in as somewhat sparse format. How do you think?

Bad @DATA instance format, cannot handle data instance end with comma

Hi,
Happy new year!

I recently used weka to generate some .arff files.
The files looks like

@relation 'SEA'

@Attribute attrib1 numeric
@Attribute attrib2 numeric
@Attribute attrib3 numeric
@Attribute class {groupA,groupB}

@DaTa

7.30967787376657,2.4053641567148585,6.374174253501082,groupB,
1.1700660880722513,7.815346320453048,2.5277616657598587,groupB,
9.84841540199809,8.791825178724801,9.412491794821143,groupB,
3.1293596519376554,3.6797575871052812,7.051747444754559,groupA,

which has a comma at the end of each row.
These files can be read by weka correctly, but cannot be loaded by liac-arff.
liac-arff will report
"Bad @DaTa instance format in line 10: 7.30967787376657,2.4053641567148585,6.374174253501082,groupB,"

after removing the comma, it works fine.

So, I think this might be an inconsistency with weka and submit this issue.

renatopp / liac-arff Goto Github PK

liac-arff's Introduction

LIAC-ARFF

Features

How To Install

Documentation

Usage

Contributors

Project Page

liac-arff's People

Contributors

Stargazers

Watchers

Forkers

liac-arff's Issues

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

What I did

What I expected

What I got

Versions

# Togglable Options

regenerate_model = False

regenerate_data = False

generate_graphs = True

save_model = True

create_model_image = False

check to see if saved data exists, if not then create the data

if not os.path.exists('saved-files/training_data.npy') or not os.path.exists(

'saved-files/training_labels.npy') or not os.path.exists(

'saved-files/validation_data.npy') or not os.path.exists('saved-files/validation_labels.npy'):

print('creating')

if not os.path.exists('saved-files'):

os.mkdir('saved-files')

scrape_data()

load the saved data

generate and compile the model

initialize tensorboard

only using 3 epochs otherwise the model would overfit to the data

evaluating the model's performace

summarize history for accuracy

summarize history for loss

save the model for later so no retraining is needed

model.save('saved-files/model.h5')

play sound when done with code to alert me

Recommend Projects

Recommend Topics

Recommend Org