bitfunnel / bitfunnel Goto Github PK

A signature-based search engine

License: MIT License

CMake 0.69% Batchfile 0.04% Shell 0.43% C++ 89.59% Go 0.05% R 0.07% Makefile 0.34% M4 0.33% C 0.26% Python 6.09% PowerShell 0.06% Assembly 2.03%

search-engine search-in-text search

bitfunnel's Issues

CmdLineParser should build and pass tests on linux and osx.

Set up Travis CI

Memory allocation for users without dedicated machines

We should probably have some way to handle this for users that don't want to decide in advance how much memory they want to allocate to slices because, for example, they're running a single instance on a machine that's also running other software.

Will everything be ok if we just hook up jemalloc/tcmalloc/dlmalloc? Probably, but we should check to make sure this doesn't degrade performance in a surprising way or cause fatal errors due to fragmentation.

Required version of clang?

From the example code for std::atomic::compare_exchange_weak

// Note: the above use is not thread-safe in at least 
// GCC prior to 4.8.3 (bug 60272), clang prior to 2014-05-05 (bug 18899)
// MSVC prior to 2014-03-17 (bug 819819). The following is a workaround:

We already require GCC5 and VC2015, but we might have to bump our required version of clang if we use this inside our threading primitives.

This may not matter; filing this anyway in case I swap this in so I don't forget about this.

CmdLineParser unique_ptr

There's some code in CmdLineParser that, for old VC++ compatibility reasons, used auto_ptr. It's mostly been converted to use unique_ptr, but there's still an array of "raw" pointers that could be converted to an array of unique_ptr.

Determine safety of save/restore from files

See http://danluu.com/file-consistency/ and the tool from https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

Use of volatile

The old BitFunnel code has some uses of volatile that are safe on VC++ but may be unsafe on other compilers. In particular, on older versions of VC++, volatile acted as a memory barrier. On newer versions, the situation seems more complicated?

And clang and gcc don't have the same guarantees. I've removed uses of volatile from inside the Token* code, and I'm presently doing the same for the SimpleHash code.

The old codebase also contains volatile in:
Band*
ThreadsafeFlag
ThreadsafeState
ShardUnitTest
RecyclerUnitTest

I believe we shouldn't need Band*, ThreadsafeFlag, or ThreadsafeState, but that still leaves a couple of places where we might need to excise volatile.

README.md for Utilities

Document classes as of d3fc186.

CMakeLists.txt for tests

We currently declare tests in two different locations, one of which is the top-level file. It's a bit weird to have to add tests from the top level file. It's possible that moving enable_testing() before add_subdirectory will remove the need for the duplicate top-level test stuff. If not, we should figure out what the problem is.

Adding CMake compiler options/flags

It seems like we should use something like http://stackoverflow.com/questions/23995019/what-is-the-modern-method-for-setting-general-compile-flags-in-cmake, rather then slamming stuff into the COMMON_CXX_FLAGS / CMAKE_CXX_FLAGS string.

I may make this change as part of BitFunnel/NativeJIT#10, but if not we should keep this in mind when making changes.

Replace MurmurHash?

There are now other hash functions that are allegedly substantially faster which also have nice randomness properties. When we get a proper benchmarking setup, we should see how much time we spend hashing and consider replacing our hash.

Some things we might want to look into include FarmHash and SipHash.

Considering how small the things we're hashing are, we should also consider less "fancy" hash functions, which have a good chance at being faster on really small chunks of data. We could even consider using the crc32 intrinsic, which has single-cycle throughput.

Determine if we need the boulders/pebbles/dust algorithm

Word on the street is that the algorithm totally falls over if we remove the escape hatch when placement fails 20 times in a row. If that's true, that seems to indicate that many (most?) placements are into full locations, which implies that the algorithm isn't doing much to help us.

Additionally, this has been running in production for quite a while and the system still seems to work ok despite never having been re-trained to adjust for the change in frequency of words over time.

Some questions are:

Do we need it at all? If we're really bailing that often, why can't we just do random placement?
How much do we lose from doing random placement over optimal placement?
If we need this, why is 20 the right number? Why not 2 or 200 or something else?

Make ChunkReader allocate-free

We have a quickly written version in the index branch and should convert it to something with better performance after the first part of the pipeline works.

How compact is our data representation?

Just out of curiosity, if we grab main memory and run it through some compression algorithm, will it compress? Some architectural design decisions assume that the data is basically as compressed as is possible. Is that right?

`make test` does not depend on test projects

On OS X, if you run make test from a clean build, it will fail, claiming it can't find the test binaries.

If you run make && make test they will be found, and then run.

Therefore, we conclude that the test target does not depend on the target binaries.

Remove assert / assert.h

We have a few places where we're temporarily using assert because LogAssert hasn't been ported to Linux yet. We should get rid of those and make everything consistent as soon as LogAssert is ported.

Separate CmdLineParser unit tests

Right now CmdLineParser does 73 tests inside of a single gtest TEST(). Consider breaking these out into separate tests. Need to change the baseline generation code to generate gtest TEST() syntax.

Undo commit 8eb0dd2f (if (BITFUNNEL_PLATFORM_WINDOWS))

As soon as we have at least one file that compiles in Linux/OSX, we should back out this change.

CmdLineParser should support -- arguments.

Right now CmdLineParser supports -flags and /flags. It would be more consistent with Linux and GNU conventions if it supported --flags.

Remove C-isms?

We have a lot of C-isms in the older BitFunnel code. From talking to @MikeHopcroft , there seems to be no particular reason for this expect that this code is old and was written by people coming in with a C background.

I suspect this bug is so low priority that we'll never get to it, but it would be nice to clean some of this up.

`IFieldFormatter::WriteField` causes compilation error on Clang/OS X when we construct an `OutputColumn<size_t>`.

See note in DocumentLengthHistogram.cpp, in which we currently use OutputColumn<uint64_t> to get around this compiler error.

The essence of the error is that the call to WriteField becomes ambiguous when you use size_t.

Set up Appveyor CI

Replace most cases of #ifdef BITFUNNEL_PLATFORM_*

We use BITFUNNEL_PLATFORM_WINDOWS and friends in ifdefs, but in a lot of cases what we really mean is MSVC vs. clang vs. gcc. But not in all cases.

But this means that, for example, building with clang on Windows probably isn't going to work?

Query plans, BM25f, etc.

Do we really need complicated query support that requires NativeJIT? It's possible that some naïve thing is actually sort of ok. For Bing, we have a lot of really complex training, but if we just support something like what Lucene, it's not clear that we need all this.

TODO: do the experiment of trying a naïve ranking algorithm to see how much this matters.

Merge LoggerInterfaces with BitFunnel

Move inc/LoggerInterfaces to BitFunnel/LoggerInterfaces.
Change namespace from Logging to BitFunnel.

Consider removing BitFunnel/src

The /src directory is not as relevant now that UnitTest is buried down in BitFunnel/src/project/UnitTest.The only peer to src is now inc.

With src removed, the root would look something like

Bitfunnel
    inc
        BitFunnel
    Common
        CmdLineParser
            src
            UnitTest
        CsvTsv
            src
            UnitTest
        Utilities
            src
            UnitTest

README.md for CmdLineParser

Utilities should build and pass tests on linux and osx.

Determine if we need query boosting

Some mechanism with the same goal is obviously needed if we have 1T or maybe even 1B documents. It's not clear if any of our open source users will have anywhere near that many documents, and query boosting is both botled in on a way that makes it easy to not include, and contains a lot of complexity.

Run experiment on cacheline padding

DocInfo gets padded out to a nice size to avoid having hits span two cachelines. However, on Haswell, spanning a cacheline isn't that expensive in the general case unless you span a page boundary, and I believe I've heard that on some newer chips even spanning a page boundary isn't that bad (ofc. you still pay for bringing the page in).

This seems pretty simple to test experimentally once we have a reasonable benchmarking setup. Getting a reasonable benchmarking setup without pulling in a bunch of stuff we don't want to pull in may be non-trivial.

Change TokenTracker.h m_remainingTokenCount from unsigned int to int or int64_t?

Changing to a signed type would make checking for underflow more intuitive.

Changing to a 64-bit type would preclude the possibility of running into trouble because we have > 2B tokens outstanding. I don't think we know of any actual instances of that, but considering how few instances of the variable we have, I don't think we're really saving much memory by using a potentially 32 bit value.

CMake target TOPLEVEL doesn't show some files.

Right now certain files are commented out of the TOPLEVEL target because their inclusion either has no impact (i.e. they don't show up) or it causes the TOPLEVEL project to show no files.

CsvTsv::TypedColumn<unsigned int>::Read(): incorrect behavior for negative numbers.

Reading "-1" on Windows results in 0xffffffff. The same call on OSX fails. According to the discussion at http://www.cplusplus.com/forum/general/153669/, some implementations will silently return 0.

Determine which VC++ warning removal pragmas we don't need

Figure out public header file targets in CMake

How should we treat header files in CMakeLists.txt? Are they listed explicitly? Which ones? Only public files? All of them? Rationale?

The rationale we have today is that it groups them in a folder in the build if you're on Windows (inside the .sln).

Determine strategy for register allocation on conditionals

The current implementation is a hack. It's possible that's fine because of how it's used, but we should discuss this and make a concious design decision if that's the case.

Why is this-> needed in CsvTsv's Table.h

In the first version of table.h, line 737, Dan had to change
m_data.push_back(GetValue());
to
m_data.push_back(this->GetValue());
in order to get it to compile under GCC. We should understand why this change was actually necessary.

Add appropriate copyright message to MurmurHash.

This might be a good place to start investigating: https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp

Figure out if having exact OR queries is feasible

Replace runtime_error with BitFunnel::FatalError or BitFunnel::RecoverableError

It sounds like there's a change coming in that lets us replace are uses of runtime_error. This is basically a TODO to myself to replace this in code that just went through code review and got checked in, and also check for other instances of the same thing.

Benchmark against... something?

Want to find some good workloads to benchmark for query rate, document ingestion, etc.

Why did CmdLineParser use GetModuleFileNameA?

To get the porting started, I'm going to change this to use argv[0] instead of making some kind of multi-platform shim for both Linux and OS X, but there may be some reason that didn't work.

Talked with @MikeHopcroft, who didn't remember the reason for using GetModuleFileNameA and I don't know if we have anyone else who would know.

Update Mike's squirreled away README with more detailed build instructions

I'm going to start filing issues on us here so we don't lose track of the TODOs we discussed recently. I'm not really a fan of github issues, but it seems good enough for now and we can always migrate later if we have a large enough volume of issues that it's a problem.

Some things that I suspect aren't in the current README are that, on Windows

you shouldn't install cmake from chocolatey because it's broken
if you have a fresh Visual Studio community installed with "default" options, cmake won't work because VS doesn't actually install a C++ compiler until you either create or open a C++ project.
You'll probably need to adjust the Configure_Windows_MSVC.bat script to point to the correct VS version (better yet, we should either auto detect that or just have it point to the most recent version by default).

Consider renaming UnitTest directories to test

This would make the capitalization more consistent. Basically, inc, src, and test would all be lower case. Names of projects would be PascalCase. The name "test" is less specific than "UnitTest", but this is unlikely to cause a problem.

README.md for CsvTsv

Determine whether or not we need TermTable/BandTable training

It's probably not reasonable to ship the 1.6GB that we get as output from training. It's probably also not reasonble to expect that most (any?) open source users have enough training data to be able to effectively train something similar.

It seems like we should be able to get away with something much simpler for anyone without, say, 1B documents, though.

CsvTsv should build and pass tests on linux and osx.

Replace IInputStream with something that looks like an istream?

The original rationale for IInputStream, from Andrija:

From what I remember, we found that reading serialized plan sent from MLA to OS from standard input stream was a bottleneck and needed to address that.

I think the bottleneck was related to global locale lock. I forget whether we were using stringstream or our vector stream implementation. If the latter, in addition to the locale lock we likely had a problem there because it allowed reading one character at the time.

IInputStream has a Read() method which can read a chunk of bytes and also doesn’t get affected by locale. The cost is the virtual function call, but that was OK because we were reading large chunks.

But this could be done with something that acts like an istream.

Evaluate whether we need more unit tests.

The following Utilities classes are missing unit tests as of d3fc186:

ConsoleLogger
LockGuard
Logging
LogLevel
MurmurHash2
Mutex
NullLogger
Stopwatch
ThreadManager (tested indirectly by TaskDistributor)
ThreadsafeCounter (tested indirectly by TaskDistributor)

Modulus operator in SimpleHashSetBase::TryFindSlot

Given that we have our own hash table implementation, we should consider limiting the capacity to powers of 2 and using a mask instead of a modulus. The last time I wrote my own hash table implementation I got a substantial improvement on hash-heavy workloads by doing that.

This is minor compared to getting things up and running again, but it's a trivial change that can have an outsized performance impact so it's worth looking at once we have a system that runs.

bitfunnel / bitfunnel Goto Github PK

bitfunnel's Issues

Recommend Projects

Recommend Topics

Recommend Org