Giter Club home page Giter Club logo

Comments (57)

fractaledmind avatar fractaledmind commented on July 28, 2024

So, my current thinking is this:

Have a single filter.py module that handles all searching for Alfred-Workflow. This module has one primary API function (filter() or search() I'd imagine), which has all of the Workflow.filter()'s current params in addition to the new protocol param. The protocol param determines which filtering backend you want to use (sqlite FTS or iterating). Each of these backends is a class, which have a run() (or whatever it should be called) method to actually perform the search. The APIs of these two run() methods are identical; that is, the class methods handle the searching, so they only have the query, key=lambda x: x, fold_diacritics=True, match_on=MATCH_ALL params. The module function gets the filtered data back and it handles the non-search params (ascending=False, include_score=False, min_score=0, max_results=0).

Since the FTS search isn't iterated, there will also be the FTSDatabase class in the module which handles the creation and searching of the sqlite virtual table.

This seems sufficiently decoupled, accessible, and organized to me, but I'd love to hear any other thoughts. Again, you can see my working code in my repo here.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

I would keep the existing filter() and the sqlite-based searches entirely separate. I'd rather make a "real" database with multiple, user-specified fields, so queries like "author:query" or "tag:query" work. It strikes me as a more useful model when dealing with many thousands of items.

Aping the filter() API has a few issues, imo:

  • They wouldn't work the same way, e.g. the options would be different, result ordering would be inconsistent.
  • It'd be a bit unclear what the difference is.
  • The power of sqlite's own queries is missing.

(That said, if the filter() API could be replicated with sqlite, I reckon it'd be faster to move that to an sqlite backend.)

I also want to make the creation/update of the search index explicit (haven't figured out the best way to do that yet).

I think filter.py is a bad name for the module, as import filter would override the built-in filter() function.

I haven't thought too much about how to organise them, but I think they should probably go in separate modules, as I'd sooner not import the sqlite module every time. Probably a search package with an index.py submodule.

I envisage an Index class with search() and update() methods (I'm not sure about the latter, as I haven't quite figured out the mechanics of creation/updating). I don't really like the FTSDatabase and FTSFilter names. FTS is more of an implementation detail. Index is a better name.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

Let me put together some code examples tonight, but I'm addressing a lot of these concerns already. My basic philosophy is to have simple defaults but strong flexibility. My FTSDatabase class already allows for complex table creation as well as column specific queries. It merely defaults to a simple table and simple queries.

I also have the explicit create() method for table creation. My setup memorizes the databases based on a hash of the dataset. So a query on an identical dataset doesn't require recreating the table. If by updating, you mean adding/deleting rows to an existing db, that should be doable, tho the creation method is hella fast already.

But, like I said, let me write some example code. The example code in the repo now samples the simplest setup. I'll write code for ZotQuery to show you how the complex setup would work

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

I find your code hard to follow, as there are no comments anywhere, no description of how it's supposed to be used, and the inappropriate method names are confusing: I see a call to memoize_database() wondering why you'd want to memoise a database, but it turns out the function doesn't memoise anything, nor does it have much to do with any database. Why isn't it called db_path_from_data()? memoize_database() is misleading ๐Ÿ˜• Also, the database file doesn't belong in the workflow directory. Use cachefile() or datafile(), not workflowfile(). Where are the old databases deleted when the dataset changes? Doesn't it just fill the workflow directory with (invisible) database files?

Why is data a required argument? If you're requiring the full dataset on each instantiation of the index, you're slowing things down unnecessarily.

Hashing an entire dataset also sounds unnecessarily expensive. That should be handled in explicit update()/create() methods, as the database should probably be created by a separate script in the background, while scripts running in Alfred only load the database, not the dataset.

Also, the search function should return a list of dicts, not lists. Much easier to work with.

Apart from that, the list of fields/table names should a list, not a string. And there's still the problem of not escaping SQL queries properly.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

You are a gloriously blunt individual, if a bit hyperbolic at times. However, I do appreciate your efficient feedback.

I am waiting to write full documentation until I have a clearer sense of the overall API.

As to the method naming, I don't argue that I'm not good at it. Practice makes perfect I guess.

The databases won't be in the workflow dir, that's simply there in the development stage as I am constantly testing and wanted them right on hand. I was going to cache them in the final version.

The data param is required because I am modeling my thinking on the original filter() method. My thinking was that the user would simply input a dataset and a query and get back filtered results. Thus, every instance of FTSFilter (as it is currently called) represents a unique search (not metaphysically, but you know what I mean). The reason I called the method memoize_database() is because this is the mechanism by which I allow for persistence in this philosophically singleton situation. The default is to create and then search a sqlite database for every search, but given the exact same dataset, the filter() can skip the creation. Perhaps this is a faulty default, but that's where I'm coming from right now.

The next point also seems to suggest that we are thinking about different goals here. I haven't found the hash expensive. My tests still show considerable speed gains over the original filter() method. But again, this depends on my thinking that every search (query dataset pair) should be treated as unique, and under-the-hood persistence for speed gains that do gain speed are worth any minor cost.

To get to the heart, are you imagining a module aimed at more long-term, stable dataset querying? My point of difference between the sqlite filtering and the Python filtering was dataset size. The Python filtering is crazy slow on big datasets, even my current partially built system can get 100x faster results. It feels to me tho that you are working from a perspective where the main difference (why a workflow author would use one over the other) is the type of dataset (not it's size necessarily); that is, a constantly changing dataset should use your filtering, whereas a more stable dataset should use the sqlite one. Being more stable, the dataset will require only updating after it's initial creation. I think that perspective makes good sense (my own ZotQuery workflow has a dataset exactly like that), but I actually think we can create both. As you said in an earlier post, we may just look to replace all filtering with a sqlite backend. I'm currently working on a more 1-to-1 translation of your filtering method to sqlite, but we could also make one for the ZotQuery-like situation.

Am I on the right track here, in terms of getting where you are coming from?

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

You are a gloriously blunt individual, if a bit hyperbolic at times

Hyperbolic? With regards to what?

I am waiting to write full documentation until I have a clearer sense of the overall API.

I appreciate that, and I don't expect full documentation, but the code is hard to follow without any comments or descriptions, and you're asking me to look at itโ€ฆ

The databases won't be in the workflow dir, that's simply there in the development stage as I am constantly testing and wanted them right on hand. I was going to cache them in the final version.

Could you add a comment saying when something is just temporary? I mean, you asked for feedback, I gave it, and now I'm "gloriously blunt" for pointing out where it's crap because I have no idea what I'm supposed to look at and what I'm supposed to ignore because it's temporary placeholder code, or you already intend to change it.

The next point also seems to suggest that we are thinking about different goals here. I haven't found the hash expensive.

We are talking about different things.

To get to the heart, are you imagining a module aimed at more long-term, stable dataset querying?

No, just a database-backed search with a database-like API. The database should be saved to disk to squeeze maximum speed out of it. It should enable key/row-based queries, and AND and OR.

In particular, I want to make creation/update explicit and separate from searching. At least, it should not be necessary to have the dataset on hand if the index already exists.

I see a few problems with the implicit caching based on the hash of the dataset (how do you delete old caches? What if a workflow wants multiple indices?) I'd prefer to use named indices, like with cached_data().

My preferred implementation, which is an idea @shawnrice came up with, would be pass in items and key arguments (as with filter()) to the creation/update methods, and pickle the actual objects in items in a separate database column:

class Book(object):
    """Some class-based item"""
    def __init__(self, title, author, pubdate):
        self.title = title
        self.author = author
        self.pubdate = pubdate  # datetime.datetime instance

def db_key(obj):
    return {'title': obj.title, 'author': obj.author, 'year': obj.pubdate.year}

weightings = [1.0, 1.0, 1.0]

# List of `Book` instances
items = [...]


# Creating and populating the index
# ------------------------------------------------------------------

idx = Index('my-index-name')
# These are also optional arguments to `Index.__init__()`
# The column names could also be auto-generated from they keys in the
# `dicts` returned by `db_key()` instead. Better? (`weightings` would also
# have to be a `dict` in this case)
idx.create(['title', 'author', 'year'], items, key=db_key, weightings=weightings)

idx.search('title:future AND author:smith')
# -> [Book(1), Book(2), Book(3) ...]


# Updating the index
# ------------------------------------------------------------------

# More `Book` instances
more_items = [...]

idx.update(more_items, key=db_key)  # column names, weightings cannot be changed here

Pickling is probably (much) more expensive that hashing a frozenset, but it's also much more flexible (you can't put anything in a set that doesn't have __hash__ and __eq__ methods) and simpler to use (you can pass almost any kinds of objects into the Index and get them back out again, not just lists/dicts of strings).

The performance hit isn't very important if the idea is to create and update the index in a background script.

As I said, re-implementing filter() on top of sqlite would be great if it can be done.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

To be clear, I did not mean that to be a criticism or negative. I appreciate the bluntness. As I said, it is much more efficient. And I legitimately greatly appreciate your feedback. I am a much better Python programmer to start 2015 that I was in 2014, and your feedback is a large part of that. I am fully aware that I'm a stranger on the Internet who has no actual right to any of your time, and yet you have given that valuable resource to me.

Tone is hard in text, but I frequently read a snarky tone in your writing (here on GitHub or on the Alfred forums). I like snark. I felt comfortable to respond with snark. The adverb "gloriously" was not ironic, and the only thing I meant by hyperbolic was that you used a grand universal ("there are no comments anywhere"). Your point was valid (I didn't have any functionally helpful comments for you to see my intention or to read the code and understand it at first go through). It is greatly helpful feedback. I was being snarky by pointing out the weakest part in your response, as I do have plenty of in-line comments.

:apology over:

a database-backed search with a database-like API. The database should be saved to disk to squeeze maximum speed out of it. It should enable key/row-based queries, and AND and OR. In particular, I want to make creation/update explicit and separate from searching. At least, it should not be necessary to have the dataset on hand if the index already exists.

This makes perfect sense and the code example is crystal clear. I confess that I am still greatly lacking in top-down programming. I'm still very much a "hacker" in the sense that I program to solve specific problems, and work from the bottom-up. This is why I ask for your opinion, because I know that you are quite good at API design and seeing things from the top-down.

I love Alfred-Workflow, and I want to be of service to its advancement. But I'm also out of your league as a programmer. This is probably the main reason for any miscommunication; I have a very patchwork knowledge of things (depth in areas where I've tried to solve problems, but lots of holes when looking at the breadth).

With these points in hand, I will attempt to write an Index class that conforms to the API proposed above. I will document it better and not ask for feedback when the code is in alpha stage anyway. I also think that I can 75% of the way to re-implementing Workflow.filter() in sqlite (this is what I have been basically aiming at so far). Hopefully, these attempts will become actual advancements and make Alfred-Workflow that little bit better.

from alfred-workflow.

shawnrice avatar shawnrice commented on July 28, 2024

The approach is basically what I tried to do with that FuzzyBooks workflow that I sent both of you. I spent some time last week re-tooling the indexer. If you want to see its current state, just look at the class file in Alphred Note: it's designed to work with an object of the Data class. If you look around line 380, you'll see a method called create_index_tables, which is the core of the class, really. Basically, it'll take the data object and create any tables that wouldn't already be accommodated by the index. If you look at the next method, create_table, you'll also see the current SQL that I'm using to keep everything tied to the index's internal data table. This version of the code also supports incremental indexing (and callbacks for progress!). Take note of the methods in there that work to avoid Sqlite's reserved words.

Here's a caveat: I don't have a generic search class written for this yet. That's still on the "todo".

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

@smargh

I frequently read a snarky tone in your writing.

Yeah, tone doesn't come over in text. I'm not a snarky or sarcastic person. I'm British, so think deadpan/ironic.

I program to solve specific problems, and work from the bottom-up

You do, but that's not a bad thingโ€”we're not talking about some huge system here. I think it's more that you tend to to focus too exclusively on the specific use case you have in mind. As a result, your ideas would often break in other situations, e.g. #51 and your initial implementation of #52, which would have "corrupted" any .tgz file downloaded.

I will document it better and not ask for feedback when the code is in alpha stage anyway

It'd probably be better to ask before it's finished, in case there are some fundamental issues. Perhaps just be a bit clearer about which bits I should/shouldn't look at.

@shawnrice

look at the class file in Alphred Note: it's designed to work with an object of the Data class.

That looks a bit more complicated than I think Alfred-Workflow's need be. I'd rather have users provide a key() function that returns a dict of the keys that go in the database than require that the data be an instance/subclass of a specific Data class. I think it would allow users to not have to think about the index except when they're using it.

There remains the question of how to handle non-strings. int is easy enough, but what about lists (many-to-many?) or datetime objects?

In my demo workflow, tags worked well enough in a text field without needing the many-to-many.

from alfred-workflow.

shawnrice avatar shawnrice commented on July 28, 2024

The extra class adds a bit that is probably unnecessary for most use cases, and for things like indexing paths or something simple, it is way overkill, and it is particularly suited to the FuzzyBooks because that's the implementation that I've been developing it with. I'm sure it'll change when I redo the search part of WHW with it. I need a third, very simple use-case to make sure that it can accommodate that. Granted, I could just write a simple wrapper that would take care of the Data and Indexing part, but I'm not sure if I want to do that because it'll add weight. I can choose not to care because it is something to be run in the background, so performance isn't top, top priority, but that just feels like cheating right now.

Honestly, the conceptual problem that I've been running into is the difference between an Indexer and a Search. I mean, we're Indexing with FTS, but, again, searching is different, mostly because the Indexer doesn't prioritize and doesn't care about the "display" of the result. Right now, my Indexer is really well built to give you all the records where something is matched, but it doesn't have the priority functioning in there.

For the specific types (again, this is currently my approach, but you might be able to take something from it or make mine better), my approach deals with lists just fine because they get their own table, but it doesn't do nested lists. I.e. it deals with:

{
  "field": "string",
  "list": [ 0, 1, 3, "a banana" ]
}

and

{
  "field": "string",
  "list": {
    "whatever": "is a banana",
    "second item": "is not a banana"
  }
}

but it chokes on

{
  "field": "string",
  "list": {
    "whatever": "is a banana",
    "second item": {
      "list2": "is not a banana"
    }
  }
}

My current solution is not to care, and so the Indexer would just skip second item in last example.

For datetime: I'm not sure. Do you mean that you want to store them as datetime object in the SQLite database? For me, I just sort of ignored them because (1) they hadn't occurred to me, and (2) there isn't a native is_date function in PHP. (Well, there is kind of, but I'm not confident in using them). But, with Python, if you can tell that a dict object is a Date object, then you could always store one as such. However, you'll probably run into problems when you have a date represented as a string. I'm pretty sure that you'd get too many false positives testing to see if each string/int was a date.

Note with regard to int: I treated them, in the indexer, as a number type rather than an integer proper. The main reason for this was that I couldn't be confident that even though I'd get an int the first time, I wouldn't get a float the second time. So, if I tried to push a float into an int column, then I'd get some sort of error. I could, of course, try to catch the error and then alter that table from int to numeric, but it just seemed a bit easier for me to use numeric from the outset. I did store bool as int because SQLite doesn't understand bool.

Lastly, if you want to see how my Indexer digests the data, then you can use the following code to play around with it:

<?php

// Testing fix. Since I'm running this outside of Alfred but want to use the bundler
// to get progress on long running operations, then I need to set these variables
$_SERVER['alfred_workflow_bundleid'] = 'com.spr.fuzzybooksearch';

$_SERVER['alfred_workflow_data'] =  $_SERVER['HOME']
    . "/Library/Application Support/Alfred 2/Workflow Data/"
    .  $_SERVER['alfred_workflow_bundleid'];
$_SERVER['alfred_workflow_cache'] = $_SERVER['HOME']
    . "/Library/Caches/com.runningwithcrayons.Alfred-2/Workflow Data/"
    . $_SERVER['alfred_workflow_bundleid'];

// Set to use head for tests... or to make sure that devel doesn't fail
$_ENV['AB_BRANCH']                  = 'devel';
$_ENV['AB_TESTING']                 = TRUE;

// Require the Alphred Library
require_once( __DIR__ . '/lib/Alphred/classes/Alphred.php' );
// Alternatively, I think you can just include the individual Data and Index files.

// Require the bundler for special sauce
require_once( __DIR__ . '/alfred.bundler.php' );
// Rev up the bundler
$bundler = new AlfredBundler;

if ( ! defined( "DATABASE" ) ) {
    define( "DATABASE", $_SERVER['alfred_workflow_data'] . '/data.sqlite3' );
}
if ( ! defined( "INDEX" ) ) {
    define( "INDEX", $_SERVER['alfred_workflow_data'] . '/index.sqlite3' );
}
if ( ! defined( "UNIFIED_DATA" ) ) {
    define( "UNIFIED_DATA", $_SERVER['alfred_workflow_data'] . '/data-files' );
}

// We're going to remove the databases and re-process them
// unlink( DATABASE );
// unlink( INDEX );

// Send a notification
$bundler->notify( 'Fuzzy Books',
                  "About to create `data` database.",
                  $bundler->icon( 'elusive', 'cogs', 'CDCDCD', false )
);

// We're going to use this variable to create the queue for the Indexer
$to_update = [];

$time = time();
$db = new \Alphred\Database\Data( DATABASE );
// Get the last update
$last_updated = $db->get_meta( 'update' );
foreach ( array_diff( scandir( UNIFIED_DATA ), ['.', '..', '.DS_Store'] ) as $file ):
        // Work only with JSON files
    if ( 'json' != pathinfo( UNIFIED_DATA . "/{$file}", PATHINFO_EXTENSION ) ) {
        continue;
    }
    // Just the check to see if the file was updated after the last update
    if ( ! filemtime( UNIFIED_DATA . "/{$file}" ) > $last_updated ) {
        continue;
    }
    $data = json_decode( file_get_contents( UNIFIED_DATA . "/{$file}" ), true );
    // Here's we're specifying the unique key, but if we leave that out, then it'll hash the data
    // to create the unique key
    $unique = $data['ean'];
    $db->add( $data, $unique, true );
    $to_update[] = $data['ean'];
endforeach;

$bundler->notify( 'Fuzzy Books',
                  "Took " . (time() - $time) . " seconds to create database.",
                  $bundler->icon( 'elusive', 'cogs', 'CDCDCD', false )
);

// Add a delay to allow the last notification to live for a second
sleep(1);

// Notify of a new action
$bundler->notify( 'Fuzzy Books',
                  "About to index `data`; this may take a while.",
                  $bundler->icon( 'elusive', 'cogs', 'CDCDCD', false )
);

// Create a new progress bar for the Indexer
$client = $bundler->wrapper('cocoadialog');
$progress = new ProgressBar($client, [
    'title'   => 'Fuzzy Books Progress',
    'percent' => 0,
    'text'    => "[0/?] Initializing...",
    'icon'    => 'download'
]);


$time = time();
$db = new \Alphred\Database\Index( DATABASE, INDEX, [ 'update' => true, 'progress_callback' => 'my_progress_bar' ] );
$db->sync_index( 'my_progress_bar' );
$bundler->notify( 'Fuzzy Books',
                  "Took " . (time() - $time) . " seconds to index data.",
                  $bundler->icon( 'elusive', 'cogs', 'CDCDCD', false )
);

// Manually re-index each record...; this might be unecessary based on the above sync_index call,
// but I think it was in there because I wanted to double-check the `reindex_record` method
foreach( $to_update as $ean ) :
    $db->reindex_record( $ean );
endforeach;


////////
// Callback for the indexer to use to update the progress bar
function my_progress_bar( $count, $total ) {
    global $progress;
    $progress->update(
        $percent = round( ( $count / $total ) * 100 ),
        $text    = sprintf( '[%d/%d] Processing complete at %s %%...', $count, $total, (string) round( ( $count/$total ) * 100, 2 ) )
    );
}

a few things... this was lifted from the current FuzzyBooks indexer, so it's hacky. The progress functions are there because I was playing around with indexing 9.5k files, and so I really wanted to be able to watch it while I did other things, so it's using the bundler for the progress bars. I added a few comments in there to help you through it.

So, you can create all sorts of different types of JSON files, drop them in a directory, and then run that to see how it works (or doesn't). Maybe it'll help, maybe not.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

@shawnrice

I may be wide of the mark here, as I can't find a succinct description of how your search database works, and I can't be bothered familiarising myself with several hundred lines of code to figure it out. (Your example code above doesn't include any data, so it's not super enlightening wrt what the indexer is doing.)

At any rate, it seems way more complex than what I have in mind.

I was considering handling int and datetime/date objects (Python's sqlite bindings handle datetime and date instances natively). But doing so would require parsing the user's query in order to generate corresponding SQL (adding date() and datetime() and all that jazz). I just don't want to go down that road.

So I think the simplest solution is to stick with pure text in an FTS virtual table. That means you can use a nice rank function to weight the various keys/columns and don't need to mess around with parsing queries.

the Indexer would just skip second item in last example.

Which is exactly why I want the user to provide a key() function to generate a nice, flat dict mapping of string keys to string values.

I honestly can't see how you could build a library that can handle arbitrary JSON in a generally useful way in an SQL database. SQL databases expect 2-dimensional datasets with explicit relationships between them. You just can't map n-dimensional JSON objects to an SQL database without extensive knowledge of the data structures.

Beyond a list of key->value mappings, the only additional complexity I think I'd consider adding is allowing the user to optionally specify which dict key should be used as a unique id. If no unique id is specified, the library will generate its own, but the update() function won't work: the index would have to be rebuilt every time.

I envisage 2 tables:

data to store the serialised data objects:

id serialised data
1 pickled obj
2 pickled obj
3 pickled obj
etc. etc.

and the FTS virtual search table:

id user_key1 user_key2 etc.
1 key1_data1 key2_data1 etc.
2 key1_data2 key2_data2 etc.
3 key1_data3 key2_data3 etc.
etc. etc. etc. etc.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

@deanishe That makes sense. I'm going to move forward with that structure.

As to the issue of types, I have this afternoon been fiddling with handling custom Python types in SQLite. While I think @shawnrice's Data class might be overkill (especially as a user-facing class), I was testing using a Dataset class as an internal intermediary class for interfacing with SQLite. The point of using a custom class is to take advantage of sqlite3's converting capabilities. Below is a simplistic example (using the Gutenberg books data from your index demo as data), but it shows the skeleton clearly enough I think:

#!/usr/bin/env python
# encoding: utf-8
from __future__ import print_function, unicode_literals

import sqlite3
import pickle
import csv


def books_data():
    d = []
    with open('/Users/smargh/Code/alfred-index-demo/src/books.tsv', 'rb') as f:
        reader = csv.reader(f, delimiter=b'\t')
        for row in reader:
            keys = ['id', 'author', 'title', 'url']
            values = [v.decode('utf-8') for v in row]
            s = dict(zip(keys, values))
            d.append(s)

    return d


class Dataset(object):
    """Represents a workflow's searchable dataset.

    This object is used to interface with the SQLite index database.
    It is strictly internal, and should not be used directly by the workflow author

    """
    def __init__(self, data):
        # data can be of any type, or combination of types,
        # that can be pickled
        self.data = data

    def __repr__(self):
        return 'Dataset({})'.format(len(self.data))


def adapt_dataset(dataset):
    """Convert ``Dataset`` obj into ``buffer`` obj
    in order to input into SQLite database.

    `dataset` = ``Dataset()`` instance
    """
    return buffer(pickle.dumps(dataset.data))


def convert_dataset(s):
    """Convert ``s`` obj into ``Dataset`` obj

    `s` = a pickled str
    """
    data = pickle.loads(s)
    return Dataset(data)

# Register the adapter
sqlite3.register_adapter(Dataset, adapt_dataset)

# Register the converter
# This function hates Unicode strings, thus the conversion
sqlite3.register_converter(str("dataset"), convert_dataset)

# An instance of the ``Dataset`` class,
# where the ``data`` is a list of dicts
p = Dataset(books_data())

# Set `detect_types` to allow for Python to SQLite to Python roundtrip
con = sqlite3.connect(':memory:',
                      detect_types=sqlite3.PARSE_COLNAMES)
with con:
    cur = con.cursor()
    # create a simple test table
    cur.execute("create table test(p)")
    # Here, the insert will use the ``adapt_dataset`` function
    cur.execute("insert into test(p) values (?)", (p,))
    # Here, the select will use the ``convert_dataset`` function
    cur.execute('select p as "p [dataset]" from test')
    # The ``print()`` function will use the ``Dataset.__repr__`` method
    print(cur.fetchone())

We would then use this Dataset class when inserting the pickled objects into the data table. This way, the implementation details (how exactly the serialization is done (pickle vs cPickle, or some other serialization perhaps), how the deserialization is done, etc) are decoupled from dataset itself and the SQLite table. Any changes in implementation would occur within the Dataset class or the adapt_* and convert_* functions.

from alfred-workflow.

shawnrice avatar shawnrice commented on July 28, 2024

@deanishe As a clarification: the Index class indexes any JSON by field. It might be overkill....

(Cutting out the Data class part of it....).
Basically, if you have a JSON file like:

{
  "field_1": "some text",
  "field_2": "some other text"
}

then the index will have two tables, the first named field_1 and the second name field_2. It would also have a data table called data where a copy of the JSON record is stored. Basically, this enables you to do a FTS query on field_1 and retrieve all the records.

It allows for the possibility of granular, FTS searching on fields. The Index expands itself as necessary to accommodate any other fields thrown in. If you add another record to the same index:

{
  "field_1": "this is another record",
  "field_3": "banana"
}

Then, you'll now have a third table called field_3. It doesn't matter that field_2 isn't in the second record.

Does that help explain it better? It might be more complex than what you have in mind.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

@smargh: that seems like overkill. What's wrong with serialize() and deserialize() methods?

Also, I edited your comment so the code is parsed and highlighted as Python code.

@shawnrice: why create a separate table for each key? You can make each one a column in the same table and use field_1:word1 AND field_2:word2 as a query to query specific columns.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

What's wrong with serialize() and deserialize() methods?

Nothing, I was just trying to take advantage of the full power of the sqlite3 module. Plus, it was a learning exercise. But, serialize() and deserialize() methods could work just as well. Do you mean methods of an internal Dataset class tho? Or of something else?

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

@smargh: No, I meant as methods of the Index class. I don't see much point in wrapper classes and registering custom handlers for them when there's only one custom datatype. Especially as Dataset isn't actually a class we want to use, but rather a wrapper around the object we want.

@shawnrice: I've tried to figure out how your "Fuzzy Books" workflow works, but I can't. There are about 100 files doing all kinds of stuff, no high-level description of how it actually works (there appears to be a server process in there?), and the comments are either of the rather obvious kind:

// Create the record with the title
$result = new \Alphred\Result( $record['title'] );

Or of the only-meaningful-to-the-person-who-wrote-it kind:

// Create the cache container and client (two for one!)
$client = new \FileCache( DATABASE, INDEX, $alfred->data() . '/Cache' );

๐Ÿ˜•

As I say, I can't tell what it's actually doing, but this whole stopwords business makes me think you're not generating your SQL properly. If you were, using reserved words as table/column names wouldn't be a problem:

>>> import sqlite3
>>> db = sqlite3.connect(':memory:')
>>> cur = db.cursor()
# Bad SQL syntax raises an Exception with reserved words as names
>>> cur.execute('CREATE TABLE select (and, autoincrement, create, table);')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
OperationalError: near "select": syntax error
# Correct SQL syntax works just fine with reserved words as names
>>> cur.execute('CREATE TABLE `select` (`and`, `autoincrement`, `create`, `table`);')
<sqlite3.Cursor object at 0x1031dff80>

from alfred-workflow.

shawnrice avatar shawnrice commented on July 28, 2024

@deanishe Yeah. That workflow is a beast and very much in development, which is why it hasn't been released to the public at all, so the comments aren't really there yet. I think that I shared it with you guys a bit pre-maturely.

re: reserved_words: you're probably right. It was failing when going through via the PHP Sqlite3 class. So, I either messed that up, or the PHP class doesn't allow it. I'll poke around.

Somewhere in my notes, I have it to write SimpleIndex and AdvancedIndex. What I've been referring to is what would be the AdvancedIndex. Right now, I can't remember why I didn't just do the single table with as many columns as fields. I know that I tried and that it didn't fit the advanced use-case that was FuzzyBooks. It might have partly stemmed from having many-to-many relationships and wanting those to get their own tables. I honestly can't remember right now.

It's probably a good idea, then, now, to ignore the PHP libraries that I've been writing because they're a bit too inaccessible and too pre-alpha to offer a route here.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

EDIT 3:

Alright, I've figured out how to use the FTS MATCH in a query on the search table, but get the item data from the data table:

sql = """SELECT data.serialised_data
                 FROM search, data
                 WHERE
                    search MATCH ?
                    and data.uid = search.rowid
        """

My current issue is figuring out how to wrap the ranking function around this. To recall, this is the old version of the SQL I was using to query and rank:

sql = """SELECT * FROM
               (SELECT rank(matchinfo(search))
               AS score, {columns}
               FROM search
               WHERE search MATCH ?)
               ORDER BY score DESC;"""

The nested SELECT ensures only one call is made to the rank function (Dean, this is all cribbed from your original demo, so I'm leaving the ranking function out here). This is important for optimization. But I'm just not advanced enough in SQL to see how to combine these two queries into one...


EDIT 2 (I'm leaving old info for real-time followup):

Ok. SQLite actually takes care of this for me. Because the search.rowid column is automatically created, and the data.uid column is the primary key, they are naturally linked. All I need to do is use an implicit join in my queries (to get the data.serialized_data from MATCH query).


EDIT: The answer on the StackOverflow query here seems like it's on the right path, but I don't understand SQLite triggers or views, so I can't tell what parts I should use and what parts to leave out (if any):

CREATE TABLE metadata (document INTEGER, page INTEGER, UNIQUE(document, page));
CREATE VIRTUAL TABLE data USING fts4();

CREATE VIEW whole AS SELECT metadata.rowid AS rowid, document, page, content 
    FROM metadata JOIN data ON metadata.rowid = data.rowid;

CREATE TRIGGER whole_insert INSTEAD OF INSERT ON whole
BEGIN
  INSERT INTO metadata (document, page) VALUES (NEW.document, NEW.page);
  INSERT INTO data (rowid, content) VALUES (last_insert_rowid(), NEW.content);
END;

CREATE TRIGGER whole_delete INSTEAD OF DELETE ON whole
BEGIN
  DELETE FROM metadata WHERE rowid = OLD.rowid;
  DELETE FROM data WHERE rowid = OLD.rowid;
END;

Ok, I need a bit of SQLite help. I've been as of yet unable to get the data and search tables properly linked. I am trying to implement Dean's solution of having two id columns in each table where the numbers auto-increment and reference one another. The issue is that FTS3 autogenerates a row id (appropriately called rowid) in a hidden column. I can't figure out how to reference that id in the data table, ensuring that the items are all properly matched.

Any help would be greatly appreciated :)

from alfred-workflow.

shawnrice avatar shawnrice commented on July 28, 2024

Replying after edit 2: Think of views as aliases. They're just short-hand for queries (this is not actually correct, but it's correct enough). So, the sqlite view takes care of that implicit join for you. The more I read up on sqlite the more I realized how well coded it is. So your SELECT statements should reference the view rather than having to write them each time.

Triggers are exactly what you think they would be: when something happens, you SQLite does other things for you automatically. Since the FTS4 table is a separate table (even though it is virtual), you still need to keep it up to date, and the triggers are what do that for you.

Just for clarity, you should stop saying that you're using FTS3 when you're using FTS4. And it should be fine to use FTS4 over 3. As I'm sure you've read here, they're basically the same thing, but FTS4 is more optimized.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

Python doesn't have access to FTS4, only FTS3 (at least Python 2.x, I haven't checked Python 3.x, as its not user-installed yet, so doesn't matter for Alfred workflows).

Good explanation on views and triggers tho. Thanks

from alfred-workflow.

shawnrice avatar shawnrice commented on July 28, 2024

So my ignorance of Python here might be beneficial: does it matter? The difference between FTS3 and FTS4 is simply in the SQL. The SQL that you wrote out above uses FTS4 rather than FTS3:

CREATE VIRTUAL TABLE data USING fts4();

You just created an FTS4 table there, not an FTS3 table. Is that going to break the Python? If you're writing out the SQL in the Python, then you have access to FTS4. What matters most is that most macs that we'll deal with re: Alfred will have a version of sqlite that has FTS4 compiled into it.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

That was example code copied from StackOverflow, not my own code.

from alfred-workflow.

shawnrice avatar shawnrice commented on July 28, 2024

So, back to my comment about my ignorance of Python: I just don't understand what about FTS4 wouldn't work in one version of Python to another. If there are predefined functions (methods?) in the Python connector that you must use that take care of the creation of virtual tables for you, then you can't use them, but if there is something in the connectors that is like an exec or query function, then you can use them. I just don't understand how Python could prevent you from making a certain type of SQL query while it would allow you to write others.

Anyway, the post you asked for help in has been edited enough times that I'm not sure whether or not you still need the help with the SQL. Did you resolve that?

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

I just don't understand what about FTS4 wouldn't work in one version of Python to another

Because Python 2's sqlite3 library is compiled without the FTS4 module. It's not like a MySQL library that mostly just sends queries to a separate server and gets data back.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

FWIW, the SQL should look something like this:

Create tables

con = sqlite3.connect(INDEX_DB)
with con:
    cur = con.cursor()
    # cur.execute(
    #     "CREATE TABLE books(id INT, author TEXT, title TEXT, url TEXT)")
    cur.execute(
        """CREATE TABLE data (id INTEGER PRIMARY KEY, data BLOB)""")
    cur.execute(
        "CREATE VIRTUAL TABLE search USING fts3(id, author, title, url)")

Populate/update tables

with con:
    cur = con.cursor()
    data = {
        'id': id_,
        'author': author,
        'title': title,
        'url': url
    }
    serialised = pickle.dumps(data, protocol=2)
    cur.execute("""INSERT OR IGNORE INTO
                data (id, data)
                VALUES (?, ?)
                """, (id_, buffer(serialised)))
    cur.execute("""INSERT OR IGNORE INTO
                search (id, author, title, url)
                VALUES (?, ?, ?, ?)
                """, (id_, author, title, url))

Query

cursor = db.cursor()
try:
    cursor.execute(
        """SELECT
                search.score,
                search.author,
                search.title,
                search.url,
                data.data
            FROM
                (SELECT
                    rank(matchinfo(search)) AS score,
                    id,
                    author,
                    title,
                    url
                    FROM search
                    WHERE search MATCH ?
                ) AS search
                    JOIN data
                        on search.id=data.id
            ORDER BY search.score
            DESC
            LIMIT 100""", (query,))
    results = cursor.fetchall()
except sqlite3.OperationalError as err:
    # show error if query is malformed
    ...

Here I'm returning score, author, title, url and data, seeing as id is workflow-internal. You still have to pull id from the search table, though, in order to JOIN on it.

I suppose the tricky bit to grasp is that your sub-SELECT is creating a kind of virtual table. I assign it the name search so I can JOIN it to data on the id column.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

I suppose the tricky bit to grasp is that your sub-SELECT is creating a kind of virtual table. I assign it the name search so I can JOIN it to data on the id column.

Genius. Thank you.

Question tho: Why return data.data and the search fields? Should the content in those fields be identical to the serialized data blob? So, you can just return data.data and then deserialize it to get the dict.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

Why return data.data and the search fields?

Because beer.

You're right. You'd only want to return the deserialised data (i.e. the original item) plus maybe score.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

Well, beer is a good reason. I've been in Texas the last couple weeks and enjoyed a good few myself :)

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

So what's the thinking re unique IDs?

In order to enable updating the search index (instead of simply replacing it), the function that returns the searchable data for an item (which I'll call get_search_data() for now) must provide an id.

Would it be better to have an argument to Index.__init__() that specifies the name of the id key in the dict returned by get_search_data() (e.g. Index(unique_key='url', ...)), or should get_search_data() have to return a dict with an explicit id key?

The former would be more complex to implement. The latter would presumably involve a fair amount of data duplication.

from alfred-workflow.

shawnrice avatar shawnrice commented on July 28, 2024

You could make it available both ways.

The ID specification matters, really, only when the data is fed into the index. I.e. if the user specifies a particular id field, then the data can match that field and replace it. Otherwise, you can make the id for the field by hashing the data (thus removing exact duplication). If the user (workflow author) neglects to specify the ID, then they have the responsibility to do implement something to do hard reset of the database to ensure its integrity on occasion.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

I'm tending towards just rebuilding the database, tbh. That would solve the additional problem of removing items that have disappeared from the underlying dataset.

That way, ids can be kept nice and simple with an auto-incrementing int, instead of having to hash a bunch of serialised data.

And the API would be very simple, too. No reset() or update() or unique keys.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

Thinking out loud:

Why not simply use the internal incremented integers? When a new index is created, all items are given a unique id. If a workflow author wants to update the information for an item, he can specify which one by the id. He knows the id because we return the id number along with the data dict when searching (useful as the uid for Alfred results). I'm thinking we keep it simple (especially on first implementation), but functional. Getting into user-defined ids is trickier, but also unnecessary now since we have unique ids already.

I personally don't want to completely rebuild the database for updates. A workflow like ZotQuery has a large dataset that changes infrequently and insignificantly after the first creation (add a few items, change a few fields). It would be much faster to simply add a handful of items to the index of thousands than rebuild. But, in order to keep this clean, we put the onus on the workflow author: you have to work with our ids. If you want to update the information for a pre-existing item, get its id from the index. That's it. But if you have that, then updating is simple.

The main API in my mind mirrors SQL:

  • create(items, key)
  • insert(item_dict)
  • update(item_id, column, value)

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

he can specify which one by the id. He knows the id because we return the id number along with the data dict when searching

Then the user has to cache the data again, but with the ids. Not sure it's such a good idea to expect users to do that.

(useful as the uid for Alfred results)

Only if the id were a hash of the data. Otherwise, if the search index is rebuilt, there's no guarantee the same item will end up with the same id.

The main API in my mind mirrors SQL:

If you want update, you also have to provide a way to delete entries.

I'm thinking something simpler:

>>> def get_search_items(item):
        return dict(title=item.title, author=item.author)

>>> idx = Index('my-index-name')
>>> idx.build(items, get_search_items)
>>> idx.search('my search query')
[Item(1), Item(2), ...]
>>> idx.delete()  # delete the index
True
>>> idx = Index('my-index-name')
>>> idx.search('some query')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
ValueError: index 'my-index-name' does not exist
>>>

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

I've been dicking around a bit with FTS (looking at reimplementing the default filter() method on top of it, as @smargh was doing).

I haven't got very far on the query side of things, just the database creation, but I'm thinking the best way to integrate a search index might be as a Serializer, alongside cpickle, JSON, etc. (regardless of whether it uses a filter()-like algorithm or standard sqlite FTS search).

Don't know how that'll be done exactly (it'll require at least one more argument for the function to generate the search key/dict), but I'm thinking it'd be a much simpler (and faster) API if the data being cached go straight into the search database, rather than requiring the user to cache and then explicitly index the data.

My thinking is that any reasonably large dataset will be downloaded/imported/generated in the background, so the indexing overhead would be relatively inconsequential.

The full dataset is stored, serialised, in the database anyway. It would mean, however, that only lists can be cached with the IndexedSerializer, but filter() requires a list anywayโ€ฆ

If this does turn out to be a good solution, I suspect a more layered API will be required (i.e. the ability to automatically index data via the Serializer API, but also to explicitly use the indexing class(es)).

@smargh: How far did you get regarding implementing the current filter() algorithm on top of sqlite?

What do you think of that? And how could it be implemented (in terms of the Serializer API)?

Should there be a parallel cache_indexed_data('name', data, key=some_func) method?

Should cache_data() have an extra, optional key argument in case indexed is the specified serializer?

If we go that way, would it be okay for cached_data() to return a searchable object instead of the actual data? Is that too non-obvious? Should it be cached_data_index() instead?

How would you separate the filter()-like implementation from an SQL-like one? Is that even necessary? Should there just be the current search algorithm?

My personal feeling is that it might be very useful for some workflows to be able to query specific attributes, e.g. author:smith, especially those with very large datasets.

My first idea in that regard is that if key returns a string, the filter() algorithm would be used, but if it returns a dict, standard sqlite search would be used. Also, how would column weightings/rankings be handled in the latter case?

Workflow.cache_data() currently looks like cache_data(name, data) (custom serializers are only supported by the store(d)_data() methods). Would cache_data(name, data, serializer=None, **serializer_kwargs) be a viable solution? Any arguments not understood by cache_data() would be passed to Serializer.dump(), i.e. Serializer.dump(obj, file_obj, **serializer_kwargs).

I'm thinking it needs to use the cache API, not the data API, as the cache API supports automatically expiring and updating stale data. Or should that API be extended to both cache and data (along with indexing), whereby store(d)_data() and cache(d)_data() both use the same mechanism and support the same options, just in different directories?

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

@smargh: How far did you get regarding implementing the current filter() algorithm on top of sqlite?

I got to these three queries:

  • MATCH_STARTSWITH (sql_query = '^' + ' '.join(words))
  • MATCH_ATOM (sql_query = ' '.join(words))
  • MATCH_SUBSTRING (sql_query = ' '.join([w + '*' for w in words]))

Note, tho, that MATCH_SUBSTRING won't match x{query}z, only {query}x.

What do you think of that? And how could it be implemented (in terms of the Serializer API)?

I think that makes a lot of sense in general. I agree, why add an unnecessary middle step; generate the data and store it in the index first. The only thing I would add is that we should ensure that the layered API is such that you can index data without using the Serializer API. That is, take pre-existing data and simply index it. This might not be as commonly used, but it should help ensure clear decoupling of the layers.

Should there be a parallel cache_indexed_data('name', data, key=some_func) method?

Should cache_data() have an extra, optional key argument in case indexed is the specified serializer?

If we go that way, would it be okay for cached_data() to return a searchable object instead of the actual data? Is that too non-obvious? Should it be cached_data_index() instead?

I think it makes sense to conceptually decouple the searchable index from the data serialization. As we are currently going, we are just storing the pickled data in the database and indexing it in addition. But the data table isn't really functionally any different than saving the data to a text file in .pickle format. Moreover, we can store the data in the data table in any of the Serializer formats. So, why don't we simply make the indexing the additonal option? You can either [1] plain serialize or [2] serialize + index. If you [1] simply serialize, then the dumped data is stored to a text file using the format as the extension (as Alfred-Workflow currently implements); if you [2] serialize and index, the dumped data is stored in the data table of a SQLite database (in the format specified via the Serializer) and the search index table is generated as well. If you want to access the data via cached_data() and it has been indexed, we simply return the loaded data read from the data table (i.e. the data is returned in exactly the same format as if it had been saved to a file). This then decouples the data serialization and storage from the searching. If you want to add FTS searchability for another part of the workflow, enable it. But you can store and retrieve the dataset exactly as before using cache_data() and cached_data().

How would you separate the filter()-like implementation from an SQL-like one? Is that even necessary? Should there just be the current search algorithm?

My personal feeling is that it might be very useful for some workflows to be able to query specific attributes, e.g. author:smith, especially those with very large datasets.

My first idea in that regard is that if key returns a string, the filter() algorithm would be used, but if it returns a dict, standard sqlite search would be used. Also, how would column weightings/rankings be handled in the latter case?

I think we should def keep the current filter() algorithm. It covers a usage scenario that the Serializer setup wouldn't; that is, it allows you to filter a stream of data against a query without storing that data stream. This is very simple and useful for workflows that are dealing with ever changing datasets. Why store it when the next dataset will be different? I think plenty of Internet based workflows function in this way and the current filter() works great for that. In that regard, I think the SQLite FTS searching should be a completely different function. It doesn't "filter" a stream of data, it "searches" a stored dataset. So I say we have a search() function for use with the SQLite stuff. Whether thats a class method or a function can be determined later as we settle the API. But that naming convention makes more sense to me.

And I totally agree that some datasets could really take advantage of attribute queries. ZotQuery is built around that principle almost entirely (Zotero itself doesn't allow attribute queries).

Workflow.cache_data() currently looks like cache_data(name, data) (custom serializers are only supported by the store(d)_data() methods). Would cache_data(name, data, serializer=None, **serializer_kwargs) be a viable solution? Any arguments not understood by cache_data() would be passed to Serializer.dump(), i.e. Serializer.dump(obj, file_obj, **serializer_kwargs).

I'm thinking it needs to use the cache API, not the data API, as the cache API supports automatically expiring and updating stale data. Or should that API be extended to both cache and data (along with indexing), whereby store(d)_data() and cache(d)_data() both use the same mechanism and support the same options, just in different directories?

I go back and forth on the store(d)_data() and cache(d)_data() functionality. On the one hand, I think delineating differing purposes for the two locations is a good idea. They are different directories; they should have some differing purposes. On the other hand, who am I to say so? I like to put stuff in the data directory because it's easier to get to. I guess it depends on whether you want to enforce certain best practices or not. I'm fine either way TBH. As for the serializers in cache_data(), I think we should add the serializer=None option as well as a indexed=False option to both store_data() and cache_data().

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

Moreover, we can store the data in the data table in any of the Serializer formats. So, why don't we simply make the indexing the additonal option?

A serialiser format doesn't make a whole lot of sense in the context of indexed data. Its purpose is to enable you to save data files in a user-editable format such as JSON. Database BLOBs aren't editable, and the serialised data will never reach the user. That's why I think specifying "indexed" as the serialisation format is the most meaningful solution.

I think adding an indexed option might create confusion: if the JSON serializer exists to create JSON files, then where is the JSON file (it won't be there if indexed is True)?

It doesn't "filter" a stream of data, it "searches" a stored dataset.

filter() doesn't take a stream, either. It needs the full dataset. You don't have to store it, certainly, but an sqlite-based filter() would have to re-create the search database in memory every time. This is pretty fast.

Perhaps the best solution is to have search() and filter() methods on the underlying Index class. Users can then use either search method on cached data and just use the normal filter() on in-memory datasets.

I guess it depends on whether you want to enforce certain best practices or not.

Not so much enforce best practices as create clarity. There is also the issue that the cache directory might disappear at any time if the user runs one of those system-maintenance apps.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

OK. That was a misunderstanding on my part then. I thought the primary purpose of the serialization was persistence, not user-edibility. Given the importance of edibility, your setup is clearly best.

And I was too glib about the "stream", I meant really that it needs an iterable, whereas SQLite-based searching searches against an object.

Perhaps the best solution is to have search() and filter() methods on the underlying Index class. Users can then use either search method on cached data and just use the normal filter() on in-memory datasets.

Interesting. Let me make sure I'm understanding fully tho. I interpret this as meaning that you could filter() (the same algorithm in Alfred-Workflow now) the data table, or search() (using SQLite queries) the search table. To filter() the data table, I'm imagining using a SELECT * from data to build an iterable of all the items that is passed to filter().

Then, in addition to these two searching capabilities for any indexed data, we would keep all of the current implementation as well. Is this what you meant? Or am I off base?


Side note: Congrats on going over 100 stars.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

I thought the primary purpose of the serialization was persistence, not user-edibility.

It is. The purpose of adding alternative serializers (JSON etc.) was to allow users (well, you in fact) to store data in more user-friendly and interoperable formats.

I meant really that it needs an iterable, whereas SQLite-based searching searches against an object.

True, filter() will also work with generators, which can work like a stream. I think (I've never used it that way, and the tests don't cover generators). I think the distinction is largely moot, however. The only purpose of re-implementing filter() on top of sqlite would be to make it faster. It it ain't faster building an in-memory database of all the items up front and then querying it, it probably ain't worth doingโ€ฆ

Interesting. Let me make sure I'm understanding fully tho. I interpret this as meaning that you could filter() (the same algorithm in Alfred-Workflow now) the data table, or search() (using SQLite queries) the search table.

Well, I'm working on the assumption that the existing filter() can be reimplemented in a similar (but probably not identical) fashion in sqlite, and will be much faster for it (otherwise it probably isn't worth doing).

The way I see it, there are persistent and in-memory datasets and 2 search algorithms: the existing filter() and the proposed search() as used in the FTS demo workflow (i.e. with explicit, user-defined columns, and the ability to query specific columns: author:smith AND title:nations).

If filter() and search() are both implemented as methods on the Index class and Index supports in-memory databases as well as on-disk ones, then both search algorithms can be used on both kinds of data (with different key functions, however). Workflow.filter() would just build an in-memory database of items using an Index instance then use its filter() method.

If a user doesn't want to rebuild the database on every query (and it is relatively slow), they can use cached_data(..., serializer='indexed') (possibly using background updates), and the filter() method on the returned object, which should provide massive speed improvements, and make filter() useable with much larger datasets.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

OK, I've come back to this, but this thread is now quite long and diffuse, so I've gotten a bit lost. However, having a SQLite serializer seemed like the logical place to start, so that's what I did. Below is a simple serializer to save data in a SQLite database. I have not added any indexing features yet.

import sqlite3
import cPickle


class SQLiteSerializer(object):
    """Wrapper around :mod:`sqlite3` with FTS searching support.

    .. versionadded:: 2.0

    """

    @classmethod
    def load(cls, file_obj):
        """Load serialized object from open SQLite file.

        .. versionadded:: 2.0

        :param file_obj: file handle
        :type file_obj: ``file`` object
        :returns: object loaded from sqlite file
        :rtype: object

        """
        # Get path to file and then close connection
        filepath = file_obj.name
        file_obj.close()
        with sqlite3.connect(filepath) as con:
            data = con.execute('SELECT value FROM data').fetchall()
            return [cls.deserialize(x[0]) for x in data]

    @classmethod
    def dump(cls, obj, file_obj):
        """Serialize object ``obj`` to open SQLite file.

        .. versionadded:: 2.0

        :param obj: Python object to serialize
        :type obj: Python object
        :param file_obj: file handle
        :type file_obj: ``file`` object

        """

        filepath = file_obj.name
        file_obj.close()
        with sqlite3.connect(filepath) as con:
            # First make the `data` table
            MAKE_TABLE = """CREATE TABLE IF NOT EXISTS data
                            (id INTEGER PRIMARY KEY, value BLOB)"""
            con.execute(MAKE_TABLE)
            # Then fill the `data` table
            UPDATE_ITEMS = """INSERT OR IGNORE INTO data (value)
                              VALUES (?)"""
            args_gen = ((cls.serialize(item),) for item in obj)
            con.executemany(UPDATE_ITEMS, args_gen)

    @staticmethod
    def serialize(obj):
        """Serialize an object using :mod:`cPickle`
        to a binary format accepted by :mod:`sqlite3`.

        """
        return sqlite3.Binary(cPickle.dumps(obj, protocol=-1))

    @staticmethod
    def deserialize(obj):
        """Deserialize objects retrieved from :mod:`sqlite3`."""
        return cPickle.loads(bytes(obj))

This could be added to Alfred-Workflow as it stands right now (thus the need for the retrieval of the file path from the file_obj in the load() and dump() methods). However, it is unclear to me how we would implement indexing and searching given the serializer API currently in place.

store(d)_data() and cache(d)_data() both utilize a Python obj -> serializer -> Python obj chain, so the user only ever sees Python objs. As we have stated, we need an Index object with a search() method, so how is this retrieved?

Do we write a wrapper class that is returned by the stored_data() and cached_data() methods, regardless of serializer? Or what?

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

I've been fiddling with this all day, and I have a few thoughts and things I've learned.

First, sqlite3 does support FTS4 (at least version 2.6.0 and higher). This is good news because FTS4 makes a few helpful things possible. [1] It allows for the full-text index simply to "index" the data from another table. This means that instead of pickling the whole data item in the data table with a simple id, data schema, we can allow for a robust schema in the data table (Aside: one of the things I've come to believe strongly in fiddling, is that we should force dictionaries as the object type for individual items to be indexed. Indexing to allow for column-specific queries means you have a data scheme. So give it to us explicitly, we will then build the data table around it (keys = column names)). [2] It allows for compress and uncompress functions. I've borrowed Kapeli's idea of using this to store a simple suffix-tree for the item entries, thus adding suffix and substring querying to the FTS searches.

Based on these things (and my fiddling today), I really feel that the best option is to write a custom container class that will store a dictionary as a row in a SQLite data table which is indexed by FTS4. Basically, I say we go one step further than your Serializer idea. Why have the workflow author construct the dataset, clump it together into a list, and then pass that to a Serializer, when you could simply let them add the data to our container (rather than the intermediate list). The way I see it, if a workflow has a dataset with a consistent schema (a prerequisite for this functionality), then it will be created via some form of iteration. In all of my workflows, I build dicts that I append to a list that is then passed to filter(). I propose we allow for dicts to be inserted into our custom container, which can then be searched().

Those dicts define the schema of the data table:

id user_key1 user_key2 etc.
1 key1_data1 key2_data1 etc.
2 key1_data2 key2_data2 etc.
3 key1_data3 key2_data3 etc.
etc. etc. etc. etc.

We then use SQLite triggers to keep the data table and the search virtual table in sync. And, if the workflow author wants the plain dataset out as a Python object (list of dicts), we can give that to him to (basically the same functionality as stored_data() or cached_data()).

For a basic example using the Gutenberg books dataset from your index demo, you can check out this gist. It's purely functional and specific to that dataset, but it should be clear what I'm doing, and how one could abstract it. Let me know your thoughts on this proposed plan of action.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

Actually, after making a sample custom container, and running some tests, I think your original idea is better. Make an Index class that you can dump data into. It's faster to use sqlite3's executemany() function on an INSERT than to iteratively INSERT each item.

But, I am going to still use the SQL with the TRIGGERS from the gist linked in my previous comment.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

I have a working version of a full Index class. I actually have made it to work as either a dump-all-the-info-into-me container, or a container that you can add to as you build the dataset. There is clear documentation in the code as well as sample code to display how it works with the Gutenberg books data. It took me while with school having just started back up, but I really like this class and it's API. Hopefully I'm 95% there with this. The code is in this gist.

You will see that I used much simpler SQL (no triggers, no content mirroring, just the data compression). And I've tried to keep the code clean and functional. Let me know what you think about the API, how close we are to closing this issue, and what you want to have changed.


FWIW, I ran some tests and the insertmany() method is basically 10x faster than the insert() method. For the Gutenberg data, I got these results:

  • Indexed 44549 items in 9.248 seconds w/ insertmany
  • Indexed 44549 items in 80.465 seconds w/ insert

The test code is:

from index import Index
from time import time
import csv


def test_insertmany():
    def books_data():
        d = []
        with open('/Users/smargh/Code/alfred-index-demo/src/books.tsv', 'rb') as f:
            reader = csv.reader(f, delimiter=b'\t')
            for row in reader:
                keys = ['id', 'author', 'title', 'url']
                values = [v.decode('utf-8') for v in row]
                s = dict(zip(keys, values))
                d.append(s)
        return d

    dataset = Index(name='gutenberg_many',
                    schema=['id', 'title', 'author', 'url'])

    start = time()
    dataset.insertmany(books_data())
    print('Indexed 44549 items in {:0.3f} seconds'.format(time() - start))
    # Takes ~11 seconds to index 44549 rows w/ `porter` tokenizer
    # Takes ~10 seconds to index 44549 rows w/ `simple` tokenizer


def test_insert():
    dataset = Index(name='gutenberg_singleton',
                    schema=['id', 'title', 'author', 'url'])

    start = time()
    with open('/Users/smargh/Code/alfred-index-demo/src/books.tsv', 'rb') as f:
        reader = csv.reader(f, delimiter=b'\t')
        for row in reader:
            keys = ['id', 'author', 'title', 'url']
            values = [v.decode('utf-8') for v in row]
            s = dict(zip(keys, values))
            dataset.insert(s)
    print('Indexed 44549 items in {:0.3f} seconds'.format(time() - start))

print('MANY')
test_insertmany()
print('SINGLETON')
test_insert()

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

FWIW, I wanted to detail some of my API thinking.

First, I made schema a required init param because [1] it makes things much easier (instead of trying to auto-create it when needed in the most efficient way each time) and [2] makes explicit the most important detail. The workflow author will need to feed consistent data into Index, where each dict has the exact same keys. Forcing him/her to state the keys will hopefully keep dumb data from getting in (or at least make debugging it easier for him/her).

Second, as I state in a comment on line 20, the sqlite3.Connection object needs to persist for the entirety of the Index object's life so that the various custom functions (like compress, uncompress, and rank) all function properly at their various points. I did extensive testing, and generating new Connection objects broke functions.

Third, I moved away from the content={table} FTS4 approach because this did not allow for the data compression. Basically, it seems that using the content param allows for implicit triggers that keep data in the specified table search-indexed. However, the trigger is hidden and doesn't seem to respect the compress and uncompress functions. Again, I tested this.

Fourth, I use the compress and uncompress functions for the same reason as Kapeli (see the link in my comment above). Basically, I use his trick to allow for suffix and contains searches, but the strings used for searching a big and so large datasets create large files. The compression allows the actual file written to disk to be as small as possible, while keeping the advanced search capabilities working.

Fifth, I split insert() and insertmany() for the sake of explicitness (trying to embrace and really understand the Zen of Python). No "magic" under the covers, just different methods for different (if related) tasks.

Sixth, I allow for update(), but within a limited scope. Right now, the searching methods (more on that later) will always return the score and the rowid. The rowid is fixed for each item, so you can use that to update an item's info. Since this is how SQLite would do it, this is how I do it. Again, no "magic", but functionality is there if you need it.

Seventh, I made the findall() method because I wanted a Python searching function. Why make people write SQL and pass a string into a search function? And why make people learn the odd conventions for the various search types (substring, suffix, prefix). However, the flexibility of the searching also makes for a more complex API. I tried to make it straight forward, but powerful. I feel good about the power; you can basically create any search using the findall() function. But I'm not so sure about its clarity. I'd love to get your feedback on that.

Eighth, since some people do know SQL, and since the findall() function does have some limitations, I also made the sql_search() method. This takes a single string as input and tries to use it as a SQL query. Nothing more, nothing less.

Ninth, since the sqlite3.Connection needs to be open for the duration of the Index object's life, user's also need to close() the connection. At this stage, I don't see a way around this. Do you? Would the __del__ magic method be useful?

Tenth, I wanted some flexibility allowed in the column weighting, so I allowed for tuples on initialization as well as an optional param on sql_search(). Should there be an optional param on findall()? Should I force any ranking specification to happen on initialization, and drop all later overrides?

That's where I was coming from and some of my future questions. Like I said, I finally feel good about this, so I'm hoping we're close to done on this particular problem.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

Indexed 44549 items in 9.248 seconds w/ insertmany
Indexed 44549 items in 80.465 seconds w/ insert

๐Ÿ˜ฎ Woah. That's an amazing difference. I tried the same change in my "vanilla" version of the Gutenberg demo and it went from ~8s to ~4s, so a definite must-have in any case.

What you've come up with is really awesome. Looks great.

My one suggestion would be to "de-sql" the variable names. rowid should be just id and I think schema should be renamed, too. search_keys? keys? search_items? Not sure. I just suspect terms like schema and row are not optimal for users looking at Alfred-Workflow from the API side.

Also, I think Kapeli's idea is very clever and all, but is that the right fit for a generalised search?

It is, in any case, bloody useful in some cases, and it could definitely go in as an option (perhaps as a flag in filter()?)

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

๐Ÿ˜ฎ Woah. That's an amazing difference. I tried the same change in my "vanilla" version of the Gutenberg demo and it went from ~8s to ~4s, so a definite must-have in any case.

Yeah, the transaction costs are quite high. Plus sqlite3 has optimized the insertmany() method, so it really adds up.

What you've come up with is really awesome. Looks great.

I very much appreciate that.

My one suggestion would be to "de-sql" the variable names. rowid should be just id and I think schema should be renamed, too. search_keys? keys? search_items? Not sure. I just suspect terms like schema and row are not optimal for users looking at Alfred-Workflow from the API side.

Good point. How about id and keys? I like just keys over some prefixed form because they should line up with the dictionary keys passed into insert() or insertmany().

Also, I think Kapeli's idea is very clever and all, but is that the right fit for a generalised search? It is, in any case, bloody useful in some cases, and it could definitely go in as an option (perhaps as a flag in filter()?)

I honestly added it because I spent so much time trying to find a solution to the "contains" and "endswith" searches. I wanted to demonstrate that implementing filter() in pure SQLite could be possible. I personally am fine with only a "startswith" query type. And adding a flag to the searching methods makes sense. The only issue is that you need to pass the compress= and uncompress= functions on database creation, so you will use the processing time no matter what. That is, if you think it best to leave it out for speed/overhead reasons, we need to totally leave it out; no optional on. Otherwise, we just leave it on. I'm flexible.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

Oops, I forgot you need the suffixes for "contains". Have you tried it with longer keys? What's the performance hit like?

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

The performance hit only occurs on creation/indexing. Here are some results:

# Indexed 44549 items in 2.018 seconds w/out creating suffix-arrays
# Indexed 44549 items in 12.109 seconds w/out creating suffix-arrays

Clearly, it's a big difference. But there is no difference on searching, so if we do the indexing/creation in the background, it may be possible to keep the functionality.

Also, it really sucks the std lib sqlite3 doesn't allow for extensions. I'd love to add some fuzzy searching, but running a Python Levenshtein function is way slower than loading a C extension. Is there any way to use sqlite3.Connection.create_function() to call to a C script file, instead of a local Python function?

from alfred-workflow.

shawnrice avatar shawnrice commented on July 28, 2024

@smargh, I'm just jumping in here for a hot second.

I went through this exact same thing in either November or December (I can't remember which) and got obsessed with different possible stemmers for SQLite3 (in the context of PHP, but, actually, there is no difference here). Porter isn't the best stemmer out there theoretically, but it is the best because it's the default and it covers every necessary case. Yes, another exotic stemmer like Levenshtein (exotic because it isn't default) could cover more, but it's not better because it's not default.

I'm also chiming in because I think that your Classics PhD vocational training/thinking might be taking over here (I chalk my obsession to my Rhetoric PhD vocational training/thinking which isn't substantially different than yours here), and it's something that easily leads you into edge case rather than what should be supported by a standard library.

I haven't been active in this thread recently, but I've been watching it, and it looks like you're doing great things, and I completely intend to borrow from those great things (with credit). So, stick with the great things for the general library instead of the edge cases.

If you really need to scratch the itch for the edge case (which I completely understand that need), then you should push the plugin concept that is being discussed in the v2 thread and make the stemmer pluggable.

A last note: don't hold me to this because I do not know if I'm correct, but I think that a C file would need to be complied to work for SQLite3. In other words, if someone does not have the dev-tools installed, then trying to invoke a C "script" file (it's not a script) to program the function would fail because there would be no compiler installed to compile it. So, offering that functionality would make it fail far too often on a client's computer because you have to assume that the user hasn't installed any sort of C compiler. But I haven't tested how that part works, so I won't say for certain.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

Shawn,

Always feel free to jump in for a hot second. You're always welcome ๐Ÿ˜‰

And you're right, I'm going down an ultimately unnecessary rabbit-hole. Things are working well enough. I was just enamored with the whole fuzzy searching thing, and then I just wanted to learn about Levenshtein. But I do think that things are working well enough now. It's fair to expect people to type correctly ๐Ÿ˜„

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

C would definitely have to be compiled. Is it even possible to load SQLite modules at runtime?

You'd have to distribute the compiled modules, and at least 2 of them: one for Python 2.6 and another for 2.7. I think any compiled C extension is compatible with any patch version.

At any rate, I agree with @shawnrice.

The speed hit on the suffixes is worrying. 12s is a very long time and it works out about 0.25s/1000. That's already very close to where you'd want your ad-hoc search (Workflow.filter()) to be finished searching, not indexing.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

Well, we can go full-tilt for speed (and get some crazy fast speed) but trade-off query flexibility. Basically, users would be restricted to the equivalent of item for item in items if item.startswith(query). I personally am fine with this as I suspect 99% of queries are words starting at their beginning (i.e. not the middle of the word or just initials). We would also still have the ability to do column-specific searches, and, as I said, we then get the full speed of SQLite.

Should we just go ahead and simplify down?

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

I suspect 99% of queries are words starting at their beginning

Yeah. This is my suspicion, too. Kapeli is concerned with Dash, which is a very different case: he needs expanduser to match os.path.expanduser, otherwise he's going to have to do lots of language-specific parsing to build a search database, and indexing speed is simply not an issue for Dash. It could take 10 minutes to index a docset. That's not optimal, but it's not a show-stopper for Dash, which updates any given docset once every blue moon. We're in a very different boat.

That said, I don't think "contains" is a worthless search criterion.

It's something we'd have to test. But I think we could do that faster in Python:

if query in ' '.join(item.values()):  # Assuming `dict` objects
    # handle hit here

Currently, I'm thinking we can "shortcut" a lot of filtering by using a default of max_results=50. "contains" is below "capitals", "initials" and "startswith" in the result hierarchy, so we might be able to stop a lot of searches before we get that far down the list of searches (from MATCH_STARTSWITH down to MATCH_ALLCHARS).

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

I don't know when or how anymore that I got off track on this, but re-reading the thread, it appears that this is very nearly done. I've graduated and started a new job, so I ought to have a much clearer sense of mental and time commitments now.

Anyway, I still think this would be great to get into Alfred-Workflow. Maybe we should brainstorm the "conditions of acceptance" (new jargon term from Scrum, which is what my new job uses) for this feature and try to ship it.

PS. Sorry for falling off the map on this one. Graduating and looking for a job really consumed all of my mental RAM for a good while there.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

No worries. Nothing has been done in months.

TBH, I'm still kinda stuck on the serializer API, which all other modifications depend on.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

Ok. I'm refactoring a bit, want to ensure I have the key things to change:

  1. Change rowid attribute name to id
  2. Change schema param/attribute name to keys
  3. Remove make_suffixarray and de_suffixarray and the associated substring and endswith searches

Is there anything else?


FWIW, on my new MacBook Pro, when I remove all the suffix array stuff, I can index the ~44500 books from your Gutenberg .tsv file in ~1.2 seconds, and then search is functionally instantaneous (as already discussed). The actual results are:

Indexed 44549 items in 1.306 seconds
Found 95 items in 0.004 seconds (searching for things that start with "kant")

So, if we do the indexing in the background, it seems reasonable that it would build the index fast enough for the user to type the query to do the actual search.

from alfred-workflow.

fractaledmind avatar fractaledmind commented on July 28, 2024

Some more data to consider:

To make things as efficient as possible, I made the books_data() function, which spits out the array of dictionaries that Index.insertmany() uses, a generator:

def books_data():
    keys = ('id', 'author', 'title', 'url')
    with open('books.tsv', 'rb') as f:
        reader = csv.reader(f, delimiter='\t')
        for row in reader:
            yield dict(zip(keys,
                           (v.decode('utf-8').lower() for v in row)))

I realized, along the way in testing the search capabilities, that it would be possible to search using a simple list comprehension straight from the generator:

[item for item in books_data()
 if 'kant' in item['author']
 and ('reason' and 'pure') in item['title']]

This is actually quite efficient, running the generator, filtering on the query and returning the results within ~0.4 seconds each time.

To compare, the time it takes to [1] create the Index db and then [2] search it is consistently ~1.2 seconds, with ~1.19 of those seconds consumed by the creation step [1]. The search itself, when the Index db is already created takes ~0.001 seconds. (and all of this testing has been done multiple times with that Gutenberg dataset on a MacBook Pro).

So, the main point, as I see it, is that Index creation has to be done in the background and needs to be done as soon as possible. If we can take that ~1.2 seconds and spend it when the user isn't "watching" the speed of any actual queries will be magical. The other things is that this method makes most sense in scenarios when the data being searched won't change much. Your speed returns start to go up exponentially as you query against the Index time and time again, without needing to fetch/generate the data. If, however, the data being searched is dynamic, the current Python filter() method is your best bet.

from alfred-workflow.

deanishe avatar deanishe commented on July 28, 2024

Closing issue as interest in the feature seems to have disappeared.

from alfred-workflow.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.