bitfunnel / bitfunnel Goto Github PK
View Code? Open in Web Editor NEWA signature-based search engine
Home Page: http://bitfunnel.org
License: MIT License
A signature-based search engine
Home Page: http://bitfunnel.org
License: MIT License
We should probably have some way to handle this for users that don't want to decide in advance how much memory they want to allocate to slices because, for example, they're running a single instance on a machine that's also running other software.
Will everything be ok if we just hook up jemalloc/tcmalloc/dlmalloc? Probably, but we should check to make sure this doesn't degrade performance in a surprising way or cause fatal errors due to fragmentation.
From the example code for std::atomic::compare_exchange_weak
// Note: the above use is not thread-safe in at least
// GCC prior to 4.8.3 (bug 60272), clang prior to 2014-05-05 (bug 18899)
// MSVC prior to 2014-03-17 (bug 819819). The following is a workaround:
We already require GCC5 and VC2015, but we might have to bump our required version of clang if we use this inside our threading primitives.
This may not matter; filing this anyway in case I swap this in so I don't forget about this.
There's some code in CmdLineParser that, for old VC++ compatibility reasons, used auto_ptr
. It's mostly been converted to use unique_ptr
, but there's still an array of "raw" pointers that could be converted to an array of unique_ptr
.
The old BitFunnel code has some uses of volatile
that are safe on VC++ but may be unsafe on other compilers. In particular, on older versions of VC++, volatile
acted as a memory barrier. On newer versions, the situation seems more complicated?
And clang and gcc don't have the same guarantees. I've removed uses of volatile from inside the Token* code, and I'm presently doing the same for the SimpleHash code.
The old codebase also contains volatile in:
Band*
ThreadsafeFlag
ThreadsafeState
ShardUnitTest
RecyclerUnitTest
I believe we shouldn't need Band*, ThreadsafeFlag, or ThreadsafeState, but that still leaves a couple of places where we might need to excise volatile
.
Document classes as of d3fc186.
We currently declare tests in two different locations, one of which is the top-level file. It's a bit weird to have to add tests from the top level file. It's possible that moving enable_testing()
before add_subdirectory
will remove the need for the duplicate top-level test stuff. If not, we should figure out what the problem is.
It seems like we should use something like http://stackoverflow.com/questions/23995019/what-is-the-modern-method-for-setting-general-compile-flags-in-cmake, rather then slamming stuff into the COMMON_CXX_FLAGS
/ CMAKE_CXX_FLAGS
string.
I may make this change as part of BitFunnel/NativeJIT#10, but if not we should keep this in mind when making changes.
There are now other hash functions that are allegedly substantially faster which also have nice randomness properties. When we get a proper benchmarking setup, we should see how much time we spend hashing and consider replacing our hash.
Some things we might want to look into include FarmHash and SipHash.
Considering how small the things we're hashing are, we should also consider less "fancy" hash functions, which have a good chance at being faster on really small chunks of data. We could even consider using the crc32
intrinsic, which has single-cycle throughput.
Word on the street is that the algorithm totally falls over if we remove the escape hatch when placement fails 20 times in a row. If that's true, that seems to indicate that many (most?) placements are into full locations, which implies that the algorithm isn't doing much to help us.
Additionally, this has been running in production for quite a while and the system still seems to work ok despite never having been re-trained to adjust for the change in frequency of words over time.
Some questions are:
We have a quickly written version in the index
branch and should convert it to something with better performance after the first part of the pipeline works.
Just out of curiosity, if we grab main memory and run it through some compression algorithm, will it compress? Some architectural design decisions assume that the data is basically as compressed as is possible. Is that right?
On OS X, if you run make test
from a clean build, it will fail, claiming it can't find the test binaries.
If you run make && make test
they will be found, and then run.
Therefore, we conclude that the test
target does not depend on the target binaries.
We have a few places where we're temporarily using assert
because LogAssert
hasn't been ported to Linux yet. We should get rid of those and make everything consistent as soon as LogAssert
is ported.
Right now CmdLineParser does 73 tests inside of a single gtest TEST(). Consider breaking these out into separate tests. Need to change the baseline generation code to generate gtest TEST() syntax.
As soon as we have at least one file that compiles in Linux/OSX, we should back out this change.
Right now CmdLineParser supports -flags and /flags. It would be more consistent with Linux and GNU conventions if it supported --flags.
We have a lot of C-isms in the older BitFunnel code. From talking to @MikeHopcroft , there seems to be no particular reason for this expect that this code is old and was written by people coming in with a C background.
I suspect this bug is so low priority that we'll never get to it, but it would be nice to clean some of this up.
See note in DocumentLengthHistogram.cpp
, in which we currently use OutputColumn<uint64_t>
to get around this compiler error.
The essence of the error is that the call to WriteField
becomes ambiguous when you use size_t
.
We use BITFUNNEL_PLATFORM_WINDOWS and friends in ifdefs, but in a lot of cases what we really mean is MSVC vs. clang vs. gcc. But not in all cases.
But this means that, for example, building with clang on Windows probably isn't going to work?
Do we really need complicated query support that requires NativeJIT? It's possible that some naïve thing is actually sort of ok. For Bing, we have a lot of really complex training, but if we just support something like what Lucene, it's not clear that we need all this.
TODO: do the experiment of trying a naïve ranking algorithm to see how much this matters.
Move inc/LoggerInterfaces to BitFunnel/LoggerInterfaces.
Change namespace from Logging to BitFunnel.
The /src directory is not as relevant now that UnitTest is buried down in BitFunnel/src/project/UnitTest.The only peer to src is now inc.
With src removed, the root would look something like
Bitfunnel inc BitFunnel Common CmdLineParser src UnitTest CsvTsv src UnitTest Utilities src UnitTest
Some mechanism with the same goal is obviously needed if we have 1T or maybe even 1B documents. It's not clear if any of our open source users will have anywhere near that many documents, and query boosting is both botled in on a way that makes it easy to not include, and contains a lot of complexity.
DocInfo gets padded out to a nice size to avoid having hits span two cachelines. However, on Haswell, spanning a cacheline isn't that expensive in the general case unless you span a page boundary, and I believe I've heard that on some newer chips even spanning a page boundary isn't that bad (ofc. you still pay for bringing the page in).
This seems pretty simple to test experimentally once we have a reasonable benchmarking setup. Getting a reasonable benchmarking setup without pulling in a bunch of stuff we don't want to pull in may be non-trivial.
Changing to a signed type would make checking for underflow more intuitive.
Changing to a 64-bit type would preclude the possibility of running into trouble because we have > 2B tokens outstanding. I don't think we know of any actual instances of that, but considering how few instances of the variable we have, I don't think we're really saving much memory by using a potentially 32 bit value.
Right now certain files are commented out of the TOPLEVEL target because their inclusion either has no impact (i.e. they don't show up) or it causes the TOPLEVEL project to show no files.
Reading "-1" on Windows results in 0xffffffff. The same call on OSX fails. According to the discussion at http://www.cplusplus.com/forum/general/153669/, some implementations will silently return 0.
See also BitFunnel/NativeJIT#15
How should we treat header files in CMakeLists.txt? Are they listed explicitly? Which ones? Only public files? All of them? Rationale?
The rationale we have today is that it groups them in a folder in the build if you're on Windows (inside the .sln).
The current implementation is a hack. It's possible that's fine because of how it's used, but we should discuss this and make a concious design decision if that's the case.
In the first version of table.h, line 737, Dan had to change
m_data.push_back(GetValue());
to
m_data.push_back(this->GetValue());
in order to get it to compile under GCC. We should understand why this change was actually necessary.
This might be a good place to start investigating: https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp
It sounds like there's a change coming in that lets us replace are uses of runtime_error. This is basically a TODO to myself to replace this in code that just went through code review and got checked in, and also check for other instances of the same thing.
Want to find some good workloads to benchmark for query rate, document ingestion, etc.
To get the porting started, I'm going to change this to use argv[0] instead of making some kind of multi-platform shim for both Linux and OS X, but there may be some reason that didn't work.
Talked with @MikeHopcroft, who didn't remember the reason for using GetModuleFileNameA
and I don't know if we have anyone else who would know.
I'm going to start filing issues on us here so we don't lose track of the TODOs we discussed recently. I'm not really a fan of github issues, but it seems good enough for now and we can always migrate later if we have a large enough volume of issues that it's a problem.
Some things that I suspect aren't in the current README are that, on Windows
This would make the capitalization more consistent. Basically, inc, src, and test would all be lower case. Names of projects would be PascalCase. The name "test" is less specific than "UnitTest", but this is unlikely to cause a problem.
It's probably not reasonable to ship the 1.6GB that we get as output from training. It's probably also not reasonble to expect that most (any?) open source users have enough training data to be able to effectively train something similar.
It seems like we should be able to get away with something much simpler for anyone without, say, 1B documents, though.
The original rationale for IInputStream, from Andrija:
From what I remember, we found that reading serialized plan sent from MLA to OS from standard input stream was a bottleneck and needed to address that.
I think the bottleneck was related to global locale lock. I forget whether we were using stringstream or our vector stream implementation. If the latter, in addition to the locale lock we likely had a problem there because it allowed reading one character at the time.
IInputStream has a Read() method which can read a chunk of bytes and also doesn’t get affected by locale. The cost is the virtual function call, but that was OK because we were reading large chunks.
But this could be done with something that acts like an istream.
The following Utilities classes are missing unit tests as of d3fc186:
Given that we have our own hash table implementation, we should consider limiting the capacity to powers of 2 and using a mask instead of a modulus. The last time I wrote my own hash table implementation I got a substantial improvement on hash-heavy workloads by doing that.
This is minor compared to getting things up and running again, but it's a trivial change that can have an outsized performance impact so it's worth looking at once we have a system that runs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.