quantipy / quantipy Goto Github PK

Python for people data

License: MIT License

Python 97.45% HTML 0.90% Shell 0.01% C 0.67% Visual Basic 0.86% Scheme 0.11%

quantipy's Introduction

Quantipy

Python for people data

Quantipy is an open-source data processing, analysis and reporting software project that builds on the excellent pandas and numpy libraries. Aimed at people data, Quantipy offers support for native handling of special data types like multiple choice variables, statistical analysis using case or observation weights, DataFrame metadata and pretty data exports.

Key features

Reads plain .csv, converts from Dimensions, SPSS, Decipher, or Ascribe
Open metadata format to describe and manage datasets
Powerful, metadata-driven cleaning, editing, recoding and transformation of datasets
Computation and assessment of data weights
Easy-to-use analysis interface
Automated data aggregation using Batch defintions
Structured analysis and reporting via Chain and Cluster containers
Exports to SPSS, Dimensions ddf/mdd, MS Excel and Powerpoint with flexible layouts and various options

Contributors

Alexander Buchhammer, Alasdair Eaglestone, James Griffiths, Kerstin Müller : https://yougov.co.uk
Datasmoothie’s Birgir Hrafn Sigurðsson and Geir Freysson: http://datasmoothie.com/

Python 3 compatability

Efforts are underway to port Quantipy to Python 3 in a seperate repository.

Docs

View the documentation at readthedocs.org

Required libraries before installation

We recommend installing Anaconda for Python 2.7 which will provide most of the required libraries and an easy means of keeping them up-to-date over time.

Python 2.7.8
Numpy 1.11.3
Pandas 0.19.2

Developing

Windows

Dependencies numpy and scipy are handled by conda. Create a virtual environment:

conda create -n envqp python=2.7 numpy==1.11.3 scipy==0.18.1

Install in editable mode:

pip install -r requirements_dev.txt

Linux

Dependencies numpy and scipy are handled in the installation.

Create a virtual environment:

conda create -n envqp python=2.7

Install in editable mode:

pip install -r requirements_dev.txt

5-minutes to Quantipy

Get started

Start a new folder called 'Quantipy-5' and add a subfolder called 'data'.

You can find an example dataset in quantipy/tests:

Example Data (A).csv
Example Data (A).json

Put these files into your 'data' folder.

Start with some import statements:

import pandas as pd
import quantipy as qp

from quantipy.core.tools.dp.prep import frange

# This is a handy bit of pandas code to let you display your dataframes
# without having them split to fit a vertical column.
pd.set_option('display.expand_frame_repr', False)

Load, inspect and edit your data

Load the input files in a qp.DataSet instance and inspect the metadata with methods like .variables(), .meta() or .crosstab():

# Define the paths of your input files
path_json = './data/Example Data (A).json'
path_csv = './data/Example Data (A).csv'

dataset = qp.DataSet('Example Data (A)')
dataset.read_quantipy(path_json, path_csv)

dataset.crosstab('q2', text=True)

Question                                                           q2. Which, if any, of these other sports have you ever participated in?
Values                                                                                                                                   @
Question                                           Values
q2. Which, if any, of these other sports have y... All                                                         2999.0
                                                   Sky diving                                                  1127.0
                                                   Base jumping                                                1366.0
                                                   Mountain biking                                             1721.0
                                                   Kite boarding                                                649.0
                                                   Snowboarding                                                 458.0
                                                   Parachuting                                                  428.0
                                                   Other                                                        492.0
                                                   None of these                                                 53.0

Variables can be created, recoded or edited with DataSet methods, e.g. derive():

mapper = [(1,  'Any sports', {'q2': frange('1-6, 97')}),
          (98, 'None of these', {'q2': 98})]

dataset.derive('q2_rc', 'single', dataset.text('q2'), mapper)
dataset.meta('q2_rc')

single                                              codes          texts missing
q2_rc: Which, if any, of these other sports hav...
1                                                       1     Any sports    None
2                                                      98  None of these    None

The DataSet case data component can be inspected with the []-indexer, as known from a pd.DataFrame:

dataset[['q2', 'q2_rc']].head(5)

        q2  q2_rc
0  1;2;3;5;    1.0
1      3;6;    1.0
2       NaN    NaN
3       NaN    NaN
4       NaN    NaN

Analyse and create aggregations batchwise

A qp.Batch as a subclass of qp.DataSet is a container for structuring data analysis and aggregation specifications:

batch = dataset.add_batch('batch1')
batch.add_x(['q2', 'q2b', 'q5'])
batch.add_y(['gender', 'q2_rc'])

The batch definitions are stored in dataset._meta['sets']['batches']['batch1']. A qp.Stack can be created and populated based on the available qp.Batch definitions stored in the qp.DataSet:

stack = dataset.populate()
stack.describe()

                data     filter     x       y  view  #
0   Example Data (A)  no_filter   q2b       @   NaN  1
1   Example Data (A)  no_filter   q2b   q2_rc   NaN  1
2   Example Data (A)  no_filter   q2b  gender   NaN  1
3   Example Data (A)  no_filter    q2       @   NaN  1
4   Example Data (A)  no_filter    q2   q2_rc   NaN  1
5   Example Data (A)  no_filter    q2  gender   NaN  1
6   Example Data (A)  no_filter    q5       @   NaN  1
7   Example Data (A)  no_filter  q5_3       @   NaN  1
8   Example Data (A)  no_filter  q5_3   q2_rc   NaN  1
9   Example Data (A)  no_filter  q5_3  gender   NaN  1
10  Example Data (A)  no_filter  q5_2       @   NaN  1
11  Example Data (A)  no_filter  q5_2   q2_rc   NaN  1
12  Example Data (A)  no_filter  q5_2  gender   NaN  1
13  Example Data (A)  no_filter  q5_1       @   NaN  1
14  Example Data (A)  no_filter  q5_1   q2_rc   NaN  1
15  Example Data (A)  no_filter  q5_1  gender   NaN  1
16  Example Data (A)  no_filter  q5_6       @   NaN  1
17  Example Data (A)  no_filter  q5_6   q2_rc   NaN  1
18  Example Data (A)  no_filter  q5_6  gender   NaN  1
19  Example Data (A)  no_filter  q5_5       @   NaN  1
20  Example Data (A)  no_filter  q5_5   q2_rc   NaN  1
21  Example Data (A)  no_filter  q5_5  gender   NaN  1
22  Example Data (A)  no_filter  q5_4       @   NaN  1
23  Example Data (A)  no_filter  q5_4   q2_rc   NaN  1
24  Example Data (A)  no_filter  q5_4  gender   NaN  1

Each of the defintions is a qp.Link. These can be e.g. analyzed in various ways, e.g. grouped categories can be calculated using the engine qp.Quantity:

link = stack[dataset.name]['no_filter']['q2']['q2_rc']
q = qp.Quantity(link)
q.group(frange('1-6, 97'), axis='x', expand='after')
q.count()

Question          q2_rc
Values              All       1    98
Question Values
q2       All     2999.0  2946.0  53.0
         net     2946.0  2946.0   0.0
         1       1127.0  1127.0   0.0
         2       1366.0  1366.0   0.0
         3       1721.0  1721.0   0.0
         4        649.0   649.0   0.0
         5        458.0   458.0   0.0
         6        428.0   428.0   0.0
         97       492.0   492.0   0.0

We can also simply add so called qp.Views to the whole of the qp.Stack:

stack.aggregate(['counts', 'c%'], False, verbose=False)
stack.add_stats('q2b', stats=['mean'], rescale={1: 100, 2:50, 3:0}, verbose=False)

stack.describe('view', 'x')

x                                q2  q2b   q5  q5_1  q5_2  q5_3  q5_4  q5_5  q5_6
view
x|d.mean|x[{100,50,0}]:|||stat  NaN  3.0  NaN   NaN   NaN   NaN   NaN   NaN   NaN
x|f|:|y||c%                     3.0  3.0  1.0   3.0   3.0   3.0   3.0   3.0   3.0
x|f|:|||counts                  3.0  3.0  1.0   3.0   3.0   3.0   3.0   3.0   3.0

link = stack[dataset.name]['no_filter']['q2b']['q2_rc']
link['x|d.mean|x[{100,50,0}]:|||stat']

Question             q2_rc
Values                  1          98
Question Values
q2b      mean    52.354167  43.421053

More examples

There is so much more you can do with Quantipy... why don't you explore the docs to find out!

quantipy's People

Contributors

Stargazers

Watchers

Forkers

michellejaeg sd2k jsivak g3org3mb magnusvalur techtonik andersfreund rdhyee blag0o avilaton scottporter jayliangmkt philipplinden

quantipy's Issues

Dimensions meta conversion is sometimes creating NULL entries in the items object

I have an example from a protected file, so I can't post it here, but I can see that the mask resulting from the conversion of a Dimensions grid is producing 4 valid items followed by 2 items that are null.

Stack cache matrix overwrite

We need the stack to ignore the cache matrices when a link already exists but the shape is different. This will happen if you remove rows from the index of the stack data and you want to "refresh" a link's view aggregations.

Alternatively, the matrices could be cleaned from the dropped rows and re-saved under a new cache key.

read_ascribe fails if the source file only has one question

When an Ascribe file only includes one question its metadata format is slightly different and needs to be handled accordingly.

Multi-question files have the following format:

{
    "CodedQuestions": {
        "ProjectID": "3458",
        "MultiForm": [
            {...},
            {...},
            {...}
        ]
}

Single-question files have the following format:

{
    "CodedQuestions": {
        "ProjectID": "3458",
        "MultiForm": {...}
}

Tips/tricks section in the Docs

We'll eventually want a tips/tricks section to the docs to help users get to an advanced level using the lib.

This issue is just to compile a list of things that can end up in that part of the Docs.

get_index_mapper for general data management
Using mappers normally designed for use with recode for general data management purposes such as dropping/splitting dataframes.

Expanded has_any logic

We need a way to generate a has_any() logical net that also includes the values contributing to that net underneath it.

So, instead of:

1
2
3
4
net-A
net-B

The option to create:

net-A
1
2
net-B
3
4

Initial thoughts about this are to add an expand kwarg.

net_views.add_method(
    name='Likely any(4,5,6)',
    kwargs={
        'text': {'en-GB': 'Net likely'},
        'logic': [4, 5, 6],
        'expand': True
    }
)

But this would also have to be implemented for block-logic (here I am using the new style proposed in #75):

net_views.add_method(
    name='Top_Middle_Bottom',
    kwargs={
        'block-logic': {
            'expand': True,
            'items': [
                {'top2': frange('9,10'), 'text': {'main': 'Top 2 Box'}},
                {'mid3': frange('6-8'), 'text': {'main': 'Middle 3 Box'}},
                {'bot5': frange('1-5'), 'text': {'main': 'Bottom 5 Box'}},
                {'others': frange('11-12'), 'text': {'main': 'Others'}, 'expand': False}
            ]
        }
    }
)

Dependencies

Auto-dependency retrieval and documentation for those that need to be install manually are urgently required. This information is incomplete or missing at this stage.

request_views() should optionally produce x-def version of the input

request_views() should optionally (or by default?) return an x_def version of the requested keys in its get_chain object, such that it can be used with a more granular loop-on-x approach to generating chains. Perhaps this should even be the standard, given the recent addition of the webeditor (#76).

The x_def approach to get_chain is a dict of x-keys inside which are the view only relevant for that x-key, and the chains are generated like this:

chains = []
for xk in x_vars:
    chains.append(stack.get_chain(x=xk, y=y_vars, views=views_def['get_chain'][xk]['c'])

dataset class

Can we wrap meta and data into a new class, suggested name 'dataset', which simply has the existing data and meta as attributes, and can benefit from a slew of class methods capable of treating the meta and the data as two sides of the same coin?

class Dataset():
    """
    Container for treating Quantipy meta and data as a unified whole.

    A Dataset provides a pathway for intrinsic methods capable of 
    editing meta and data by leveraging everything known both.
    """

    self.meta = None
    self.data = None

    def __init__(self, meta=None, data=None):
        if not meta is None and not data is None:
            self.meta = meta
            self.data = data

    def load(self, path_json, path_csv):
        self.meta = load_json('%s.json'.format(path_name))
        self.data = pd.DataFrame.from_csv(path_csv)

There are a practically endless number of uses for a class like this. Some basic examples:

qpd = qp.Dataset(meta, data)

qpd.variables()                 # return a list of all known variables
qpd.variables('int')            # return a list of all int variables

qpd.add_variable(...)           # add a new variable (meta and data)
qpd.del_variable(...)           # delete a variable (meta and data)

qpd.save(path_json, path_csv)   # save meta and data to disk (as 2 files)
qpd.load(path_json, path_csv)   # load meta and data from disk (as 2 files)

And so on...

This class would provide space for a lot of currently free-form functions to become class methods.

For example:

data['QMP_radio_stations_xb'] = recode(
    meta, data,
    target='QMP_radio_stations_xb',
    mapper={
        901: {'radio_stations': frange('1-23, 92, 94, 141')}
    }, 
    append=False
)

Could become the much more intuitive:

qpd['QMP_radio_stations_xb'].recode(
    mapper={
        901: {'radio_stations': frange('1-23, 92, 94, 141')}
    }, 
    append=False
)

Adverse effects of slicex/sortx /dropx on significance test results

The effect of slicex/sortx/dropx, applied to y, on significance test results needs to be confirmed.

coltests() view method: staticmethod _get_view_names() needs check for data_key

We need to check for the given Link's data_key in order to prevent unwanted sigtest computation.
The fix will be easy.

Implement logic with Classes?

If logic was implemented in classes, then we could overload the operators to do the following. Internally an intersection list in the same form we have it now would be generated but it would be much cleaner for the user.

# If (1 or 2 or 3) and not (4 or 5 or 6)
logic = {
    1: Has_any([1,2,3]) & ^Has_any([4,5,6])
}

Presumably there'd be a way to implement this with a parent Logic class definition.

stack.refresh(): unnecessary matrix re-creation

It appears that refresh() will sometimes not recognize already existing matrix definitions from the cache. This is probably related to the "reverse engineering" of add_link().

Tests against total

Currently we can only perform intra-variable tests. We've brainstormed how we might eventually get to inter-variable testing (using an array mask), but we need to put all of those complications to the side and instead focus on the much more frequently required test-against-total.

Assuming this could be implemented without also trying to solve properly custom test groups between arbitrary variables, this could potentially be done as an enhancement to our existing testing approach, where tests against total could become an additional opt-in, such as:

test_views = qp.ViewMapper(
    template={
        'method': qp.QuantipyViews().coltests,
        'kwargs': {
            'mimic': 'askia',
            'test_total': True,
            'iterators': {
                'metric': ['props', 'means'],
                'level': ['low', 'mid', 'high']
            }
        }
    }
)

This obviously presents at least one fundamental problem, in that it implies the generation of two views instead of one, since it will be generating results in the x/@ link as well as the x/y link.

One potential avenue to explore here would be to always show the results for tests against the total in the x/y link instead of trying to also put them into the x/@ link (based on which is significantly higher).

If necessary, a build could pull the result apart again but, given the increased complexity of trying to implement something more conventional, I don't think it should be attempted at this time (consider, especially, the fact that what appears in the total column is contextually dependent on the structure of a chain, which cannot be known at the time of aggregation).

I'm open to suggestions, but @H could indicate total is significantly higher and @L could indicate total is significantly lower.

Alternatively, @ could indicate total is significantly higher (consistent with existing results) and @L could be a single exceptional case.

Whatever the notation used to describe it, any significance against total should appear first.

The recode function should not result in duplicate values

When using recode(..., append=True) it's possible to end up with duplicate values in the resulting series. A test needs to be applied when recoding a delimited set such that this never happens.

Move Quantity() methods to Link

Everything that currently happens inside the Quantity() class should in fact be based on the Link object itself to gain flexibility and get rid of the middle man.

f.sum view method

We need an f.sum view method that sums the the frequency of x by y along the given axis, such that f.sum:f|x: would create a row of column sums, and f:f.sum|:y a column of row sums.

For consistency these example view key notations use the proposed new-style notation rules explained in issue #67. In this way f.sum would be a specialized version of f.math where all the operators were assumed to be +.

However, the initial fix for this issue may use an older style if necessary, in which case the notation will migrate along with everything else when the new rules have been agreed.

condense_dichotomous_set() is retaining 0s

The following call:

cols = [
    'pdl_pension_type_0',
    'pdl_pension_type_2',
    'pdl_pension_type_3',
    'pdl_pension_type_4',
    'pdl_pension_type_5',
    'pdl_pension_type_6',
    'pdl_pension_type_7',
    'pdl_pension_type_8',
    'pdl_pension_type_9',
    'pdl_pension_type_95',
    'pdl_pension_type_96',
    'pdl_pension_type_97',
    'pdl_pension_type_99'
]

data['pension_type'] = condense_dichotomous_set(data[cols], values_from_labels=True)

Should return this:

pmxid
20896       3;5;
20898         7;
20902        97;
20906        95;
20907        97;
22463        97;
22501        97;
22506         5;
22509       5;6;
22519        97;
22524        97;
22540         5;
22542        97;
22546        97;
22549         7;
22560        99;
22563         0;
22568        97;
22570        97;
22571        96;
22581         0;
22584       0;5;
22592        99;
22596        97;
22599        97;
22603        97;
22608        97;
22615         0;
22617       0;5;
22620        96;

Instead it is returning this:

pmxid
20896        0;0;3;0;5;0;0;0;0;0;0;0;0;
20898        0;0;0;0;0;0;7;0;0;0;0;0;0;
20902       0;0;0;0;0;0;0;0;0;0;0;97;0;
20906       0;0;0;0;0;0;0;0;0;95;0;0;0;
20907       0;0;0;0;0;0;0;0;0;0;0;97;0;
22463       0;0;0;0;0;0;0;0;0;0;0;97;0;
22501       0;0;0;0;0;0;0;0;0;0;0;97;0;
22506        0;0;0;0;5;0;0;0;0;0;0;0;0;
22509        0;0;0;0;5;6;0;0;0;0;0;0;0;
22519       0;0;0;0;0;0;0;0;0;0;0;97;0;
22524       0;0;0;0;0;0;0;0;0;0;0;97;0;
22540        0;0;0;0;5;0;0;0;0;0;0;0;0;
22542       0;0;0;0;0;0;0;0;0;0;0;97;0;
22546       0;0;0;0;0;0;0;0;0;0;0;97;0;
22549        0;0;0;0;0;0;7;0;0;0;0;0;0;
22560       0;0;0;0;0;0;0;0;0;0;0;0;99;
22563        0;0;0;0;0;0;0;0;0;0;0;0;0;
22568       0;0;0;0;0;0;0;0;0;0;0;97;0;
22570       0;0;0;0;0;0;0;0;0;0;0;97;0;
22571       0;0;0;0;0;0;0;0;0;0;96;0;0;
22581        0;0;0;0;0;0;0;0;0;0;0;0;0;
22584        0;0;0;0;5;0;0;0;0;0;0;0;0;
22592       0;0;0;0;0;0;0;0;0;0;0;0;99;
22596       0;0;0;0;0;0;0;0;0;0;0;97;0;
22599       0;0;0;0;0;0;0;0;0;0;0;97;0;
22603       0;0;0;0;0;0;0;0;0;0;0;97;0;
22608       0;0;0;0;0;0;0;0;0;0;0;97;0;
22615        0;0;0;0;0;0;0;0;0;0;0;0;0;
22617        0;0;0;0;5;0;0;0;0;0;0;0;0;
22620       0;0;0;0;0;0;0;0;0;0;96;0;0;

Update required libraries in README

There are a few required libraries that aren't in the list yet, including savReaderWriter. I'll compile a complete list of them here before updating the README file.

Applying custom base conditions to frequency views

We need a way to apply conditions onto the implied base of frequency views so as to allow for fine control over proportional (%) results.

In effect, this is cusomization of the kwarg rel_to:

'rel_to': [
    None, 
    'y', 
    {'y': {'x': has_any(frange('1-8'))}}
]

In this case, three sets of views are created - counts, natural-c% and custom-c% using the additional logical condition {'x': has_any(frange('1-8'))}.

Custom rel_to instructions follow the convention {axis_of_orientation: wildcard_logical_condition} where wildcard_logical_condition can be any wildard custom logic with the added option of using the shorthand keys x or y to refer to the given link's x or y variables. This shorthand would only be available for non-nested link axes!

The view keys, respectively, would be (I'm using the view key notation proposed in #67):

rel_to: None --> x|f|:|||my_net
rel_to: 'y' --> x|f|:|y||my_net
rel_to: {'y': {'x': has_any(frange('1-8'))}} --> x|f|:|y:x[{1,2,3,4,5,6,7,8}]||my_net

stack.add_link() should check that the soure data index hasn't changed

As discovered while investigating the behaviour in #97, when the index of the data in a stack changes between stack.add_data() and stack.add_link() some obscure failures occur, such as index 2184784 is out of bounds for axis 0 with size 2111, which is actually the matrix trying to use the new index as a lookup in the old index.

We need a check when using stack.add_link() to make sure that the index hasn't changed since data was added to the stack so that the user is given a useful warning about what has happened.

Complex logic function is_le() returning incorrect result.

Complex logic function is_le() returns the same result as is_lt().

In the example below column 4 gives a correct result, column 2 does not (19 is ignored).

column = 'age' 
target = 'age_rc'

# Meta
meta['columns'][target] = copy.deepcopy(meta['columns'][column])
meta['columns'][target]['name'] = target
meta['columns'][target]['type'] = 'delimited set'
meta['columns'][target]['values'] = [
  {
    'text': {'en-GB': ''},
    'value': 1 
  },  
  {
    'text': {'en-GB': ''},
    'value': 2 
  },  
  {
    'text': {'en-GB': ''},
    'value': 3 
  },
  {
    'text': {'en-GB': ''},
    'value': 4
  }
]
# Data
data[target] = recode(
    meta, data,
    target=target,
    mapper={
        1: [18, 19],
        2: intersection([is_ge(18), is_le(19)]),
        3: intersection([is_gt(17), is_lt(20)]),
        4: intersection([is_ge(18), is_lt(19)])
    },
    default=column
)

# Crosstab result
print crosstab(meta, data, column, target).head(5)


# Output
Question        age_rc                
Values             All   1   2   3   4
Question Values                       
age      All        23  23  11  23  11
         18         11  11  11  11  11
         19         12  12   0  12   0
         20          0   0   0   0   0
         21          0   0   0   0   0

nets on y

This is not depending on #94: branch "quantity_axis_support_I" contains a solution for Quantity().combine() and the frequency view method to support y-axis calculations.

We should aim to release this first and sort the broadcasting of view results after that.

Deactivate dependency on pylzma

It's hard to install and we're not using it.

request_views() retrieving significance levels

When using request_views() the way users need to name significance levels for retrieval is too exact. Currently we have to say exactly .10 and cannot say simply .1 which is arguably what many people would more intuitively expect to work.

Similarly, we should allow low, mid and high as names for .1, .05 and .01 respectively.

Start a ViewMapper with named view methods from QuantipyViews

It would be handy if a new ViewMapper instance had the option of initializing with named methods drawn from QuantipyViews.

This would avoid the need to do this:

stack.add_link(x=['q1', 'q2'], y=y_vars, views=qp.QuantipyViews(['cbase']))
stack.add_link(x=['q1', 'q2'], y=y_vars, views=mean_views.subset(['mean1-5']))

Usage would probably be something like:

mean_views = qp.ViewMapper(init_views=['cbase'])

It would be handy for the init_view method/s to always be called whenever the view mapper is used, regardless of calls to the subset() method. If that were the case you would know each time you asked for your means to be calculated that your column bases would be refreshed as well.

Continuing with the above example, the following line would compute 'cbase' and 'mean1-5', even though 'cbase' hasn't been named explicitly in subset(), because it is one of the view mapper's initialized view methods.

stack.add_link(x=['q1', 'q2'], y=y_vars, views=mean_views.subset(['mean1-5']))

This means that subset() should receive an additional init_views parameter defaulted to True. If you didn't want the init_views to be computed you would instead write:

stack.add_link(x=['q1', 'q2'], y=y_vars, views=mean_views.subset(['mean1-5'], init_views=False))

Parent class for Stack and Chain

There are a lot of way stack and chain objects are similar, and a lot of utility/convenience functions that are useful for both (and not useful to much else).

Examples include the following methods:

stack.describe()
stack.verify_key_types()
stack.verify_multiple_key_types()
stack.verify_key_exists()
stack.force_key_as_list()
stack.yield_links()
stack.yield_views()
stack.reduce()
stack.save()
stack.load()

... and the following function (there are more....):

core.tools.view.query.get_dataframe

... and over time, no doubt many more.

What would be better is a new Class that defines all of these similarities and from which both Stack and Chain (and probably other 'shape' variants in the future) inherit.

Unattended lists in wildcard logic

In simple logic statements unattended lists are assumed to be has_any() queries. Currently this isn't true for unattended lists in wildcard logic (they fail).

We want to be able to do this:

{q1_1: [1, 2]}

Instead of this:

{q1_1: has_any([1, 2])}

Need a convenient way of exploding set/mask items for use with stack.add_link()

It would be useful to have an easy, and preferably internal, way of using stack.add_link() when a list of items found in the definition of a set or mask is what's needed, rather than the named set or mask given to the x or y parameters..

A stack-external approach would be to provide a convenience function that takes a list of variable names and inspects it for sets/masks, replacing set/mask references with their exploded items lists instead.

An stack-internal approach would be to somehow tell the method if you wanted to explode sets/masks or keep their type untouched. A second option here would be to have different rules for how stack.add_link() handles sets vs masks (explodes one and not the other) but that may result in more user-management of the meta than desired (you don't really want to have to generate a set-version of a mask if it doesn't exist already just so you can use the set in stack.add_link() - in that situation it would be quicker to list-comprehend the exploded items out yourself).

Multi-purpose "missing" concept needed

We need to implement a "missing" concept that can handle multiple reasons for something being "missing".

The original design was something like:

meta['columns']['q1']['values'] = [
    {'value':'1',  'text':''},
    {'value':'2',  'text':''},
    {'value':'3',  'text':''},
    {'value':'99',  'text':'', 'missing': True}
]

But this doesn't cater for more than 1 type of missing. Something like this would be better:

meta['columns']['q1']['values'] = [
    {'value':'1',  'text':''},
    {'value':'2',  'text':'', 'missing': ['skipped']},
    {'value':'3',  'text':'', 'missing': ['not asked']'},
    {'value':'99',  'text':'', 'missing': ['skipped', 'not asked']}
]

This would allow space for multiple reasons for a value being missing, and for those reasons to be overlapping with each other on any given value.

UserWarning needed if trying to use descriptives on a delimited set

Descriptive stats (mean, median, etc.) are not support on a delimited set. At the moment this will trigger an exception in the view method but no feedback is given to the user.

Can we add a UserWarning so that the user is made aware of why their request hasn't produced a view?

Block nets need text_keys and textual representation

Block nets are currently done like this:

net_views.add_method(
    name='Top_Middle_Bottom',
    kwargs={
        'logic': [
            {'Top 2 Box': frange('9,10')},
            {'Middle 3 Box': frange('6-8')},
            {'Bottom 5 Box': frange('1-5')}]})

But the problem here is the lack of a conventional text object that supports multiple labels via text_key. All labels in Quantipy must provide multi-label support.

To that end we should implement a block- kwarg for defning multiple uses of the same method that are returned in a single group.

The following example shows the first of these we need to implement: block-logic.

net_views.add_method(
    name='Top_Middle_Bottom',
    kwargs={
        'block-logic': {
            'items': [
                {'top2': frange('9,10'), 'text': {'main': 'Top 2 Box'}},
                {'mid3': frange('6-8'), 'text': {'main': 'Middle 3 Box'}},
                {'bot5': frange('1-5'), 'text': {'main': 'Bottom 5 Box'}}]}})

The -logic part of the block- kwarg informs how the content of the items object should be interpreted. This allows the textual representation of each item to be given in place of where 'logic' would normally appear, so that 'top2': frange('9,10') is read as 'logic': frange('9,10') with the textual representation top2.

The unpainted version of the resulting dataframe will show an index with the values 'top2', 'mid3' and 'bot5'. When painted those representations will be translated by the given text_key as per any normal data value from a single or delimited set column.

The block- kwarg needs to itself be a dict with an items object to provide space for additional instructions, such as rules and additional custom properties.

View key clean-up

There are some problems with the current view key notation that need to be cleaned up.

method colon-delimiting

The method-part of the view key needs to be colon-delimit-able so that it it can describe the effect of different methods acting on x and y. Where only 1 method is named and both x and y are present, the same method should be assumed to be working on both.

The general rule should be that method_a:method_b|x:y means the intersection of method_a(x) by method_b(y), a more concrete example being frequency:mean|x:y means the intersection of frequency(x) by mean(y).

By extension, though, this renders what is currently frequency|x:y incorrect as the key for a column base row because this should mean the intersection of frequency(x) by frequency(y), or in plain speak where the row and column bases intersect (e.g. the number of cases in x and y).

As a consequence, the correct key for a column base row should be simply frequency|x and for a row base column frequency|y. Incidentally this is perfectly in keeping with the fundamental meaning of frequency| as basic counts, since the mention of either x or y is an implied collapse of all their values, respectively.

More examples (assuming x and y each have 3 possible values):

x|frequency||||counts                      
x|frequency||y||c%                          
x|frequency||x||r%                  
x|frequency|x|||cbase                      # same as x|frequency|x[(1,2,3)]|||net1-3
x|frequency|y|||rbase                      # same as x|frequency|y[(1,2,3)]|||net1-3
x|frequency|x[(1,2)]|||xnet1-2
x|frequency|y[(1,2)]|||ynet1-2
x|frequency|x:y|||cbase*rbase              # same as x|frequency|x[(1,2,3)]:y[(1,2,3)]|||cbase*rbase

... and so on.

Another important change that should be made is to use the conventional curly brace for set notation, so logic descriptors should be written as x[{1,2}]: instead of x[(1,2)]:. Currently the curly brace is used for answer count, but the two uses should be swapped. In this way one answer from codes 1 or 2 would be written as x[{1,2}(1)]:.

Due to the required delimitable-nature of the method-part of the view key, it may be prudent to put in place some truncation rules that method names must adhere to. For example instead of frequency perhaps simply f will suffice, especially given that it's so common. for other methods a 6-character limit per sub/method-part (to allow for needed abbreviations like stddev, stderr and so on) would help condense the overall key length and improve readability.

To avoid ambiguity, what is currently the relation part of the view key must always include a colon.

|:| means no conditions placed on either x or y
|x:| collapsed x, no conditions placed on y
|:y| collapsed y, no conditions placed on x
|x:y| collapsed x and y

The new convention means you should never see something like |y:x| because the left-hand side will always describe x and the right-hand side will always describe y.

In accordance with all of these proposed changes, the above view keys would become:

x|f|:|||counts                      
x|f|:|y||c%                          
x|f|:|x||r%                  
x|f|x:|||cbase                      # same as x|f|x[{1,2,3}]|||net1-3
x|f|:y|||rbase                      # same as x|f|y[{1,2,3}]|||net1-3
x|f|x:y|||cbase*rbase               # same as x|f|x[{1,2,3}]:y[{1,2,3}]|||cbase*rbase

However, all of these examples use the same method on x and y, which will often not be the case. Where a different method is used on each, both methods must be named and must be colon-delimited.

In conjunction with the need for descriptive stats to be named using sub-methods, this leads to:

x|d.mean:f|x:|||cmean                # column mean
x|f:d.mean|:y|||rmean                # row mean

Including the change for set notation, block nets also need to appear in discrete x/y-blocks delimited with a comma, meaning they will change from |x[(1,2),(3,4),(5,6):y to |x[{1,2}],x[{3,4}],x[{5,6}]: This both corrects for ambiguity compared to complex logic and to provide for a comma-delimited relationship between the multiple methods and x/y.

Given the likely eventuality of other block methods the conventions should be similarly lazy, where f|x[{1,2}],x[{3,4}],x[{5,6}]: is effectively shorthand for f,f,f:f|x[{1,2}],x[{3,4}],x[{5,6}]:.

This is more relevant when imagining the needs of a block of descriptive stats, in which case d.mean,d.stddev,d.stderr:f|x: is more meaningful. In any case, parts that are not mentioned explicitly imply uniform application, so as to prevent the need for something like d.mean,d.stddev,d.stderr:f|x,x,x:.

x|d.mean:f|x:|||cmean                # column mean
x|f:d.mean|:y|||rmean                # row mean

effective base

Effective base view keys should indicate a sub-method of frequency and must name a weight-part. What is currently x|frequency|x:y|||ebase should become x|f.eff:f|x:||weight|ecbase. Similarly, an effective row base would be x|f:f.eff|:y||weight|erbase.

Stack.refresh(): add support for old cache/matrix definitions

I needed to remove cache support from refresh since it was giving incorrect results during the creation of the updated weight vectors. For now I simply use add_data() to re-establish the data references for key/data/meta (which effectively results in a clean cache).

To increase the processing time when refreshing an existing Stack I need a new mechanic to connect the new data with the old matrices.

Reordering values in view results

We need a way to reorder values in the result of views.

For example, given the following view:

Question            gender  
Values              @
Question    Values  
q1          1       7.201447
            2       9.64925
            3       55.662665
            4       72.663656
            5       4.704566
            6       11.496034
            7       21.69883
            8       3.178551
            9       0.099151
            96      2.199429
            98      2.497416
            99      8.949005

We need a way to specify a new values order in the index. For example, in this case the values 1-9 have been reversed.

Question            gender       
Values              @
Question    Values  
q1          9       0.099151
            8       3.178551
            7       21.69883
            6       11.496034
            5       4.704566
            4       72.663656
            3       55.662665
            2       9.64925
            1       7.201447
            96      2.199429
            98      2.497416
            99      8.949005

I'll implement this using column meta tags that are picked up by paint_dataframe.

Sorting block-logic

We need a way to apply sortx rules to expanded logic (#86) and block-logic. When sorting expanded logic ideally sorting would happen on two levels (net and expanded values) but if the pathway to 2-level sorting for expanded logic is too much to deal with right now, single-level sorting (only on the nets) will suffice.

sigtesting docs: misleading example

The last usage example in the sigtesting docs is using a wrong/misleading view shortname 0.5 = mid should be 0.05 = mid.

broadcasting

We need to put the broadcasting mechanism that exists in branch "quantity_axis_support_I" to full effect for stack generation. This will need to include the computation of intersecting views of:

nets and block-nets
bases of intersecting nets and block-nets
descriptives like mean, stddev,... that intersect with nets
view arithmetics like NPS that intersect with nets AND other-axis arithmetics
bases of intersecting arithemtics

stand-alone conditional_base

This is related to #83.
Current solution will only work with block-wise nets and is therefore not future proof.

Update README

README is currently out of date, need to update with general information and a 5-minute example of using the library.

The SAV reader should pass ioLocale=None by default.

This is a matter of architecture, but as most users of the lib currently work on Windows, where ioLocale=None is a requirement for this to work, we will change this to be the default.

In a Mac environment the option used should be ioLocale="en_US.UTF-8".

Hiding values in view results

We need a way to hide specific values in the result of views.

For example, given the following view:

Question            gender  
Values              @
Question    Values  
q1          1       7.201447
            2       9.64925
            3       55.662665
            4       72.663656
            5       4.704566
            6       11.496034
            7       21.69883
            8       3.178551
            9       0.099151
            96      2.199429
            98      2.497416
            99      8.949005

We need a way to hide selected values in the index. For example, in this case the values 96, 98 and 99 which represent some form of non-response like "Don't know", "Prefer not to say", "None of these" have been hidden.

Question            gender          
Values              @
Question    Values          
q1          4       72.663656
            3       55.662665
            7       21.69883
            6       11.496034
            2       9.64925
            1       7.201447
            5       4.704566
            8       3.178551
            9       0.099151

I'll implement this using column meta tags that are picked up by paint_dataframe.

Sorting view results

We need a way to sort the results of views.

For example, given the following view:

Question            gender  
Values              @
Question    Values  
q1          1       7.201447
            2       9.64925
            3       55.662665
            4       72.663656
            5       4.704566
            6       11.496034
            7       21.69883
            8       3.178551
            9       0.099151
            96      2.199429
            98      2.497416
            99      8.949005

We need a way to sort the index, including the ability to exclude or fix some values (for example, in this case the values 96, 98 and 99 which represent some form of non-response like "Don't know", "Prefer not to say", "None of these" and so on.

Question            gender          
Values              @
Question    Values          
q1          4       72.663656
            3       55.662665
            7       21.69883
            6       11.496034
            2       9.64925
            1       7.201447
            5       4.704566
            8       3.178551
            9       0.099151
            96      2.199429
            98      2.497416
            99      8.949005

We also need a way to apply the sorted order of on view onto other views using the same x-variable. For example in this case you would probably want all other links using q1 in the x-position to be similarly sorted.

Question            gender          
Values              1           2
Question    Values          
q1          4       36.38664    36.277016
            3       27.682186   27.980479
            7       11.310729   10.388101
            6       5.035425    6.460609
            2       5.187247    4.462003
            1       3.669028    3.532419
            5       2.403846    2.30072
            8       1.644737    1.533814
            9       0.075911    0.02324
            96      1.037449    1.16198
            98      0.986842    1.510574
            99      4.57996     4.369045

There are a few possible approaches here. A quick solution (the focus of this ticket) will be to take a stack and sort links permanently (until re-sorted). A more sophisticated approach would be to decide on sorting when calling a build (this will not be the approach at this time).

Improvements/fixes for hmerging coding from Ascribe

hmerge could use a few improvements related to how it deals with the pandas.DataFrame.merge parameters left_on, right_on, left_index and right_index. Currently it's necessary to always set the left_on/right_on columns into the index of each dataframe and use left_index/right_index.

There is also some unusual behavior happening with the resulting data being fed into stack.add_link() that should be investigated, though this may end up being caused by the user.

index 2184784 is out of bounds for axis 0 with size 2111 is the fatal error being returned by stack.add_link() and in this case 2184784 is one of the unique ids used in the merge.

Auto meta creation for isolated dataframes

We need an auto-meta-generation method for creating Quantipy meta for otherwise isolated dataframes.

Adding pptx to builds module

PowerPoint Automation.

Compound/banked chains

In order to get really fine control over what chains can do, we need a 'compound' or 'banked' type of chain to force multiple chains that may have been drawn from different questions to be treated as a single entity. For example:

chains = [
    chain_1,
    chain_2,
    {
        'name': 'chain_3',
        'type': 'banked-chain',
        'text': {'en-GB': 'Top 2 summary q2a-q2c'},
        'views': ['x|mean|x[1,2,3,4,5]:y|||1-5 ^97,98'],
        'items': [
            {'chain': chain_3a, 'text': {'en-GB': 'Top 2 (q5a)'}},
            {'chain': chain_3b, 'text': {'en-GB': 'Top 2 (q5b)'}},
            {'chain': chain_3c, 'text': {'en-GB': 'Top 2 (q5c)'}}
        ]
    },
    chain_3,
    chain_4
]

In this case chain_3a, chain_3b and chain_3c are the "top 2" boxes from three different links. The expected result when sent to a build is that they all get "banked" together so as to appear as 1 compound table.

Make complex_logic functions part of the quantify() engine

A matrix definition of a Link needs to have full access on all logic functions.

Quantity.to_df(): clean up needed.

This method handles all the into-df conversions of the .result np.array calculations. Due to the complexity of the code involved in making broadcasting happen for the different data manipulation and aggregation methods there is a lot of if-else spaghetti code inside this method right now.

This should be cleaned properly if time allows.

write_spss should provide set variability

write_spss is currently hard-coded to use the 'data file' set when finding columns for export. This should instead be the default value of a new parameter, allowing users to specify any set instead.

The docs example for read_spss is incorrect as well. Currently it says:

>>> from quantipy.core.tools.dp.io import read_spss
>>> meta, data = read_spss(path_sav, meta, data)

But it should say:

>>> from quantipy.core.tools.dp.io import read_spss
>>> meta, data = read_spss(path_sav, **kwargs)

place row margin at beginning of count() aggregation

While working through #91 the row margin should be placed at the first position in count() aggregation columns layout.

quantipy / quantipy Goto Github PK

quantipy's Introduction

Quantipy

Python for people data

Key features

Contributors

Python 3 compatability

Docs

Required libraries before installation

Developing

Windows

Linux

5-minutes to Quantipy

More examples

quantipy's People

Contributors

Stargazers

Watchers

Forkers

quantipy's Issues

Recommend Projects

Recommend Topics

Recommend Org