Giter Club home page Giter Club logo

vellum's Introduction

vellum vellum

NOTE: active development of the vellum library has moved to https://github.com/blevesearch/vellum

This repository will remain as is to support previous Couchbase builds.

Tests Coverage Status GoDoc Go Report Card License

A Go library implementing an FST (finite state transducer) capable of:

  • mapping between keys ([]byte) and a value (uint64)
  • enumerating keys in lexicographic order

Some additional goals of this implementation:

  • bounded memory use while building the FST
  • streaming out FST data while building
  • mmap FST runtime to support very large FTSs (optional)

Usage

Building an FST

To build an FST, create a new builder using the New() method. This method takes an io.Writer as an argument. As the FST is being built, data will be streamed to the writer as soon as possible. With this builder you MUST insert keys in lexicographic order. Inserting keys out of order will result in an error. After inserting the last key into the builder, you MUST call Close() on the builder. This will flush all remaining data to the underlying writer.

In memory:

  var buf bytes.Buffer
  builder, err := vellum.New(&buf, nil)
  if err != nil {
    log.Fatal(err)
  }

To disk:

  f, err := os.Create("/tmp/vellum.fst")
  if err != nil {
    log.Fatal(err)
  }
  builder, err := vellum.New(f, nil)
  if err != nil {
    log.Fatal(err)
  }

MUST insert keys in lexicographic order:

err = builder.Insert([]byte("cat"), 1)
if err != nil {
  log.Fatal(err)
}

err = builder.Insert([]byte("dog"), 2)
if err != nil {
  log.Fatal(err)
}

err = builder.Insert([]byte("fish"), 3)
if err != nil {
  log.Fatal(err)
}

err = builder.Close()
if err != nil {
  log.Fatal(err)
}

Using an FST

After closing the builder, the data can be used to instantiate an FST. If the data was written to disk, you can use the Open() method to mmap the file. If the data is already in memory, or you wish to load/mmap the data yourself, you can instantiate the FST with the Load() method.

Load in memory:

  fst, err := vellum.Load(buf.Bytes())
  if err != nil {
    log.Fatal(err)
  }

Open from disk:

  fst, err := vellum.Open("/tmp/vellum.fst")
  if err != nil {
    log.Fatal(err)
  }

Get key/value:

  val, exists, err = fst.Get([]byte("dog"))
  if err != nil {
    log.Fatal(err)
  }
  if exists {
    fmt.Printf("contains dog with val: %d\n", val)
  } else {
    fmt.Printf("does not contain dog")
  }

Iterate key/values:

  itr, err := fst.Iterator(startKeyInclusive, endKeyExclusive)
  for err == nil {
    key, val := itr.Current()
    fmt.Printf("contains key: %s val: %d", key, val)
    err = itr.Next()
  }
  if err != nil {
    log.Fatal(err)
  }

How does the FST get built?

A full example of the implementation is beyond the scope of this README, but let's consider a small example where we want to insert 3 key/value pairs.

First we insert "are" with the value 4.

step1

Next, we insert "ate" with the value 2.

step2

Notice how the values associated with the transitions were adjusted so that by summing them while traversing we still get the expected value.

At this point, we see that state 5 looks like state 3, and state 4 looks like state 2. But, we cannot yet combine them because future inserts could change this.

Now, we insert "see" with value 3. Once it has been added, we now know that states 5 and 4 can longer change. Since they are identical to 3 and 2, we replace them.

step3

Again, we see that states 7 and 8 appear to be identical to 2 and 3.

Having inserted our last key, we call Close() on the builder.

step4

Now, states 7 and 8 can safely be replaced with 2 and 3.

For additional information, see the references at the bottom of this document.

What does the serialized format look like?

We've broken out a separate document on the vellum disk format v1.

What if I want to use this on a system that doesn't have mmap?

The mmap library itself is guarded with system/architecture build tags, but we've also added an additional build tag in vellum. If you'd like to Open() a file based representation of an FST, but not use mmap, you can build the library with the nommap build tag. NOTE: if you do this, the entire FST will be read into memory.

Can I use this with Unicode strings?

Yes, however this implementation is only aware of the byte representation you choose. In order to find matches, you must work with some canonical byte representation of the string. In the future, some encoding-aware traversals may be possible on top of the lower-level byte transitions.

How did this library come to be?

In my work on the Bleve project I became aware of the power of the FST for many search-related tasks. The obvious starting point for such a thing in Go was the mafsa project. While working with mafsa I encountered some issues. First, it did not stream data to disk while building. Second, it chose to use a rune as the fundamental unit of transition in the FST, but I felt using a byte would be more powerful in the end. My hope is that higher-level encoding-aware traversals will be possible when necessary. Finally, as I reported bugs and submitted PRs I learned that the mafsa project was mainly a research project and no longer being maintained. I wanted to build something that could be used in production. As the project advanced more and more techniques from the BurntSushi/fst were adapted to our implementation.

Are there tools to work with vellum files?

Under the cmd/vellum subdirectory, there's a command-line tool which features subcommands that can allow you to create, inspect and query vellum files.

How can I generate a state transition diagram from a vellum file?

The vellum command-line tool has a "dot" subcommand that can emit graphviz dot output data from an input vellum file. The dot file can in turn be converted into an image using graphviz tools. Example...

$ vellum dot myFile.vellum > output.dot
$ dot -Tpng output.dot -o output.png

Related Work

Much credit goes to two existing projects:

Most of the original implementation here started with my digging into the internals of mafsa. As the implementation progressed, I continued to borrow ideas/approaches from the BurntSushi/fst library as well.

For a great introduction to this topic, please read the blog post Index 1,600,000,000 Keys with Automata and Rust

vellum's People

Contributors

abhinavdangeti avatar mschoch avatar prateek avatar rvncerr avatar sreekanth-cb avatar steveyen avatar techiev2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vellum's Issues

add runtime options to Load/Open of the FST?

If you wanted to run an FST from some untrusted source, you'd want to enable additional bounds/cycle checking. Currently we omit this for performance, but it could somewhat reasonably be added as an option.

Using string or map or JSON object as a value for a key.

Dear Team,
Is it possible to assign a string / map / JSON object to a key? The documentation and examples speak of uint64.
If not, is it possible to get an index to the key that uniquely points to the key as a kind of hash / pointer one could use as a key in a map to attach some more metadata to the key that just an integer value available by default?

Thanks for any hints...

add back dot and svg export

I removed the dot and svg export code while making substantial improvements to the build phase.

The previous implementations would only work with the Builder, not the final FST. The new implementations will only work with the final FST (out of the box) and not with the Builder. The reason is that the builder no longer has direct access to everything required. It only has the unfinished nodes (stack) and registry of frequently used nodes. All other nodes that have been flushed out are forgotten by the builder (by design).

Exporting of the final FST to dot/svg will be exposed through the cmd-line tool, and at least initially function much like the debug command.

memory use during building isn't actually bounded

Although the strategy we use allows for bounded memory usage during building, the current implementation doesn't actually achieve this.

I believe the problem is that we have pointers connecting the states, and even after flushing the frozen states to disk, other states still hold pointers to them, preventing them from being GC'd. In practice all we really need to know is the "compiled address" of these nodes once they're serialized.

We should clean this up, rewrite to just keep a stack of nodes like the fst crate, and identify nodes, either by their position in the stack (not yet serialized) or offset in the file (serialized).

Feature Req: []byte or interface{} etc for value

uint64 forces an extra lookup table for multiple valued terms ie word matching multiple meanings/ dictionary entries.

This was mentioned as an upcoming feature in a go conf video on youtube.

Performance issue in FSTIterator.next

Hi,

I'm using bleve full text search, which uses couchbase as backend. Sometimes I observe high CPU consumption taking several minutes, sometimes even golang garbage collector cannot operate. Tracking the issue down pointed to FSTIterator.next method.

There are around ~400.000 items in the statesStack in my instance. All represents linear sequence of nodes (NumTransitions() == 1 for whole list). There was an optimization made by @steveyen in 462e86d which cuts linear sequence of nodes and spares OUTER loop. But it does not do so when all nodes are in linear sequence. Here is the code:

		// if the top of the stack represents a linear chain of states
		// (i.e., a suffix of nodes linked by single transitions),
		// then optimize by popping the suffix in one shot without
		// going back all the way to the OUTER loop
		var popNum int
		for j := len(i.statesStack) - 1; j > 0; j-- {
			if i.statesStack[j].NumTransitions() != 1 {
				popNum = len(i.statesStack) - 1 - j
				break
			}
		}
		if popNum < 1 { // always pop at least 1 entry from the stacks
			popNum = 1
		}

When all nodes are in linear sequence, popNum remains zero and is set to 1 after the cycle. In my opinion the condition is there for the case the stack top would have more transitions, but when all have 1, it causes looping.

Is it a bug in optimization or is it by intent?

Thanks for reviewing

add unsorted options to cmd-line tool

We now have a PR which adds the merge functionality. The cmd-line utilities should now gain the ability to work on unsorted input. The approach is to process chunks of data at a time, sort it, produce FST, then merged these smaller FSTs into larger ones, until you're done.

case insensitive match not working

First reported to me by @aviadl, it appears that case-insensitive matching was not working.

We should be able to write regexes like (?i)marty and match marty, MARTY and mArTy among others.

When I first reviewed the code (and the rust version it was based on) it appeared that this was intended to work. See: https://github.com/couchbase/vellum/blob/master/regexp/compile.go#L71

But, looking at the impl it was clearly wrong. It detected that we wanted case-insensitive match, but proceeded to build a state machine with the single case. The rust version called a separate case_fold method on the character class, which added the other cases. Ours didn't have that method, and also seemed not be correctly ranging over all runes when case-insensitive was requested.

I have a potential fix coming, once I remember how gerrit works.

Panic seen with FST's Reader

Following panic observed with FST Reader:

panic: runtime error: index out of range
--
 
goroutine 13541 [running]:
panic(0xcc3ca0, 0x142f580)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/panic.go:540 +0x45e fp=0xc42ec19310 sp=0xc42ec19268 pc=0x42d32e
runtime.panicindex()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/panic.go:28 +0x5e fp=0xc42ec19330 sp=0xc42ec19310 pc=0x42be2e
github.com/couchbase/vellum.(*fstStateV1).atMulti(0xc42b4997c8, 0x7fc43c792082, 0x66d7a, 0x66e11, 0x4db45, 0x0, 0x0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/couchbase/vellum/decoder_v1.go:191 +0x153 fp=0xc42ec19340 sp=0xc42ec19330 pc=0x580bd3
github.com/couchbase/vellum.(*fstStateV1).at(0xc42b4997c8, 0x7fc43c792082, 0x66d7a, 0x66e11, 0x4db45, 0x14, 0x14)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/couchbase/vellum/decoder_v1.go:123 +0x1f1 fp=0xc42ec193b8 sp=0xc42ec19340 pc=0x580731
github.com/couchbase/vellum.(*decoderV1).stateAt(0xc4336cd1d0, 0x4db45, 0x1266ac0, 0xc42b4997c8, 0x295e, 0xc42b4997c8, 0x0, 0x0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/couchbase/vellum/decoder_v1.go:67 +0xb4 fp=0xc42ec19408 sp=0xc42ec193b8 pc=0x580474
github.com/couchbase/vellum.(*FST).get(0xc424974e60, 0xc422ffe090, 0x9, 0x10, 0x1266ac0, 0xc42b4997c8, 0xd14fc, 0xc4336cd140, 0x0, 0x1)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/couchbase/vellum/fst.go:83 +0x12c fp=0xc42ec19480 sp=0xc42ec19408 pc=0x583a3c
github.com/couchbase/vellum.(*Reader).Get(0xc42b4997c0, 0xc422ffe090, 0x9, 0x10, 0xc472921880, 0xc420c89678, 0x0, 0x0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/couchbase/vellum/fst.go:253 +0x65 fp=0xc42ec194e0 sp=0xc42ec19480 pc=0x584655
github.com/blevesearch/bleve/index/scorch/segment/zap.(*Dictionary).postingsList(0xc4336cd1a0, 0xc422ffe090, 0x9, 0x10, 0x0, 0x0, 0xc472921880, 0x0, 0x0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/blevesearch/bleve/index/scorch/segment/zap/dict.go:57 +0xee fp=0xc42ec19540 sp=0xc42ec194e0 pc=0x5a64de
github.com/blevesearch/bleve/index/scorch/segment/zap.(*Dictionary).PostingsList(0xc4336cd1a0, 0xc422ffe090, 0x9, 0x10, 0x0, 0x0, 0x0, 0xc472921880, 0xc472000101, 0xc420c89600, ...)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/blevesearch/bleve/index/scorch/segment/zap/dict.go:46 +0x87 fp=0xc42ec19598 sp=0xc42ec19540 pc=0x5a6387
github.com/blevesearch/bleve/index/scorch.(*SegmentDictionarySnapshot).PostingsList(0xc431e37a20, 0xc422ffe090, 0x9, 0x10, 0x0, 0x0, 0x0, 0x1260cc0, 0xc472921880, 0x0, ...)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/blevesearch/bleve/index/scorch/snapshot_segment.go:40 +0x7f fp=0xc42ec19600 sp=0xc42ec19598 pc=0x6089ff
github.com/blevesearch/bleve/index/scorch.(*IndexSnapshot).TermFieldReader(0xc420476240, 0xc422ffe090, 0x9, 0x10, 0xc4320200b8, 0x4, 0x101, 0x40a2, 0xc433f10000, 0xc445b86770, ...)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/blevesearch/bleve/index/scorch/snapshot_index.go:423 +0x295 fp=0xc42ec196e0 sp=0xc42ec19600 pc=0x603dc5
github.com/blevesearch/bleve/search/searcher.NewTermSearcher(0x126b160, 0xc420476240, 0xc4320200d0, 0x9, 0xc4320200b8, 0x4, 0x3ff0000000000000, 0x0, 0xc44ad76000, 0x0, ...)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/blevesearch/bleve/search/searcher/search_term.go:41 +0xbf fp=0xc42ec197c0 sp=0xc42ec196e0 pc=0x9de25f
github.com/blevesearch/bleve/search/query.(*TermQuery).Searcher(0xc45fe92150, 0x126b160, 0xc420476240, 0x12666a0, 0xc420753c80, 0x0, 0x12689a0, 0xc44ad76000, 0x0, 0x0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/blevesearch/bleve/search/query/term.go:60 +0x99 fp=0xc42ec19828 sp=0xc42ec197c0 pc=0x9f4d09
github.com/blevesearch/bleve/search/query.(*ConjunctionQuery).Searcher(0xc45fe92090, 0x126b160, 0xc420476240, 0x12666a0, 0xc420753c80, 0xc422610000, 0x125e440, 0xc4949e3e40, 0xc422a8dd50, 0x0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/blevesearch/bleve/search/query/conjunction.go:58 +0x135 fp=0xc42ec19900 sp=0xc42ec19828 pc=0x9e3755
github.com/blevesearch/bleve.(*indexImpl).SearchInContext(0xc4203c6230, 0x12631c0, 0xc45fe92390, 0xc4400a2070, 0x0, 0x0, 0x0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/blevesearch/bleve/index_impl.go:458 +0x2c2 fp=0xc42ec19e58 sp=0xc42ec19900 pc=0xa01b82
github.com/couchbase/cbft.(*cacheBleveIndex).SearchInContext(0xc45fe92270, 0x12631c0, 0xc45fe92390, 0xc4400a2070, 0xc44d411ac0, 0xc444cd8800, 0x0)
/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/cbft/cache_bleve.go:110 +0x38a fp=0xc42ec19f18 sp=0xc42ec19e58 pc=0xb3f77a
github.com/blevesearch/bleve.MultiSearch.func1(0x126d500, 0xc45fe92270, 0xc4400a2070)
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/blevesearch/bleve/index_alias_impl.go:458 +0x152 fp=0xc42ec19fc8 sp=0xc42ec19f18 pc=0xa0a9d2
runtime.goexit()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.9.6/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc42ec19fd0 sp=0xc42ec19fc8 pc=0x45e8b1
created by github.com/blevesearch/bleve.MultiSearch
/home/couchbase/jenkins/workspace/couchbase-server-unix/godeps/src/github.com/blevesearch/bleve/index_alias_impl.go:465 +0x19d

More information here: https://issues.couchbase.com/browse/MB-29763

add MergeBuilder

The MergeBuilder would take n FST instances and a Writer as input, and produce a new FST containing the merged contents. Implementation just has to create iterators on the underlying FSTs, walk them, and insert to the new one in order.

There are more typos

didnt
erorr
exptect
furhter
lexicogrpahic
opertion
purproses
revesed
wouldnt

in the following files:

decoder_v1_test.go
docs/format.md
encoder_v1.go
encoder_v1_test.go
fst.go
fst_iterator.go
fst_iterator_test.go
vellum.go
vellum_test.go

and a missing word in docs/format.md: This *** us to get an early start on I/O

Aside: funnily enough, just yesterday I wrote my own implementation of DAWG/MA-FSA, which allowed me to save over 3 000 000 words with associated part-of-speech tags to just 1.5Mb on disk (4 bytes per edge). With this implementation (without tuning) I get 14Mb.

panic building FST

Input: first 1000 lines of sorted MacOS dictionary

A Aani Aaron Aaronic Aaronical Aaronite Aaronitic Aaru Ab Ababdeh Ababua Abadite Abama Abanic Abantes Abarambo Abaris Abasgi Abassin Abatua Abba Abbadide Abbasside Abbie Abby Abderian Abderite Abdiel Abdominales Abe Abel Abelia Abelian Abelicea Abelite Abelmoschus Abelonian Abencerrages Aberdeen Aberdonian Aberia Abhorson Abie Abies Abietineae Abiezer Abigail Abipon Abitibi Abkhas Abkhasian Ablepharus Abnaki Abner Abo Abobra Abongo Abraham Abrahamic Abrahamidae Abrahamite Abrahamitic Abram Abramis Abranchiata Abrocoma Abroma Abronia Abrus Absalom Absaroka Absi Absyrtus Abu Abundantia Abuta Abutilon Abyssinian Acacia Acacian Academic Academus Acadia Acadian Acadie Acaena Acalepha Acalephae Acalypha Acalypterae Acalyptrata Acalyptratae Acamar Acanthaceae Acantharia Acanthia Acanthocephala Acanthocephali Acanthocereus Acanthodea Acanthodei Acanthodes Acanthodidae Acanthodii Acanthodini Acantholimon Acanthomeridae Acanthopanax Acanthophis Acanthopteri Acanthopterygii Acanthuridae Acanthurus Acarapis Acarida Acaridea Acarina Acarus Acastus Accipiter Accipitres Aceldama Acemetae Acemetic Acephala Acephali Acephalina Acephalite Acer Aceraceae Acerae Acerata Acerates Aceratherium Acerbas Acestes Acetabularia Acetobacter Achaean Achaemenian Achaemenid Achaemenidae Achaemenidian Achaenodon Achaeta Achagua Achakzai Achamoth Achango Achariaceae Achariaceous Achates Achatina Achatinella Achatinidae Achen Achernar Acheronian Acherontic Acherontical Achetidae Acheulean Achillea Achillean Achilleid Achillize Achimenes Achinese Achitophel Achlamydeae Achmetha Acholoe Achomawi Achordata Achorion Achras Achroanthes Achromatiaceae Achromatium Achromobacter Achromobacterieae Achuas Achyranthes Achyrodes Acidanthera Acidaspis Acieral Acilius Acineta Acinetae Acinetaria Acinetina Acipenser Acipenseres Acipenseridae Acipenseroidei Acis Aclemon Acmaea Acmaeidae Acmispon Acnida Acocanthera Acoela Acoelomata Acoelomi Acoemetae Acoemeti Acoemetic Acolapissa Acolhua Acolhuan Acoma Aconitum Acontias Acontius Acorus Acousticon Acrab Acraeinae Acrania Acrasiaceae Acrasiales Acrasida Acrasieae Acraspeda Acredula Acrididae Acridiidae Acridium Acrisius Acrita Acroa Acrobates Acrocarpi Acrocera Acroceratidae Acroceraunian Acroceridae Acrochordidae Acrochordinae Acroclinium Acrocomia Acrodus Acrogynae Acromyodi Acronycta Acropora Acrosticheae Acrostichum Acrothoracica Acrotreta Acrotretidae Acrux Acrydium Actaea Actaeaceae Actaeon Actaeonidae Actiad Actian Actinia Actiniaria Actinidia Actinidiaceae Actiniomorpha Actinistia Actinobacillus Actinocrinidae Actinocrinus Actinoida Actinoidea Actinomyces Actinomycetaceae Actinomycetales Actinomyxidia Actinomyxidiida Actinonema Actinophrys Actinopoda Actinopteri Actinopterygii Actinosphaerium Actinozoa Actipylea Actium Acts Acuan Acubens Aculeata Acutilinguae Ada Adad Adai Adaize Adam Adamastor Adamhood Adamic Adamical Adamically Adamite Adamitic Adamitical Adamitism Adamsia Adansonia Adapa Adapis Adar Adda Addie Addisonian Addisoniana Addressograph Addu Addy Ade Adela Adelaide Adelarthra Adelarthrosomata Adelbert Adelea Adeleidae Adelges Adelia Adelina Adeline Adeliza Adelochorda Adelops Adelphi Adelphian Adelphoi Adenanthera Adenophora Adenostoma Adeodatus Adeona Adephaga Adessenarian Adhafera Adhara Adiantum Adib Adicea Adiel Adigei Adighe Adigranth Adin Adinida Adirondack Adlai Adlumia Adolph Adolphus Adonai Adonean Adonia Adoniad Adonian Adonic Adoniram Adonis Adorantes Adoretus Adoxa Adoxaceae Adramelech Adrammelech Adrenalin Adrian Adriana Adriatic Adrienne Adullam Adullamite Advaita Advent Adventism Adventist Aeacides Aeacus Aeaean Aechmophorus Aecidiaceae Aecidiomycetes Aedes Aegean Aegeriidae Aegialitis Aegina Aeginetan Aeginetic Aegipan Aegisthus Aegithalos Aegithognathae Aegle Aegopodium Aeluroidea Aeolia Aeolian Aeolic Aeolicism Aeolidae Aeolididae Aeolis Aeolism Aeolist Aepyceros Aepyornis Aepyornithidae Aepyornithiformes Aequi Aequian Aequiculi Aequipalpia Aerides Aerobacter Aerobranchia Aerocharidae Aerope Aerosol Aeschylean Aeschynanthus Aeschynomene Aesculaceae Aesculapian Aesculapius Aesculus Aesopian Aesopic Aestii Aethionema Aethusa Aetian Aetobatidae Aetobatus Aetolian Aetomorphae Aetosaurus Afar Afenil Afghan Afifi Aframerican Afrasia Afrasian Afric African Africana Africanism Africanist Africanization Africanize Africanoid Africanthropus Afridi Afrikaans Afrikander Afrikanderdom Afrikanderism Afrikaner Afrogaea Afrogaean Afshah Afshar Aftonian Afzelia Agaces Agade Agag Agalena Agalenidae Agalinis Agama Agamae Agamemnon Agamidae Aganice Aganippe Agao Agaonidae Agapanthus Agapemone Agapemonian Agapemonist Agapemonite Agapetidae Agapornis Agaricales Agaricus Agaristidae Agarum Agastache Agastreae Agatha Agathaea Agathaumas Agathis Agathosma Agau Agave Agawam Agaz Agdistis Agelacrinites Agelacrinitidae Agelaius Agelaus Agena Ageratum Aggie Aggregata Aggregatae Aghan Aghlabite Aghorapanthi Aghori Agialid Agib Agiel Agkistrodon Aglaia Aglaonema Aglaos Aglaspis Aglauros Aglipayan Aglipayano Aglossa Aglypha Aglyphodonta Aglyphodontia Agnatha Agnathostomata Agnes Agnoetae Agnoete Agnoetism Agnoite Agnostus Agnotozoic Agoniatites Agonista Agonostomus Agra Agrania Agrapha Agrauleum Agrilus Agrimonia Agriochoeridae Agriochoerus Agrionia Agrionidae Agriotes Agriotypidae Agriotypus Agromyza Agromyzidae Agropyron Agrostemma Agrostis Agrotis Aguacateca Agudist Agyieus Ah Ahantchuyuk Ahepatokla Ahet Ahir Ahmadiya Ahmed Ahmet Ahnfeltia Ahom Ahousaht Ahrendahronon Ahriman Ahrimanian Aht Ahtena Aias Aiawong Aidenn Aides Aigialosauridae Aigialosaurus Ailanthus Aileen Ailie Ailuridae Ailuroidea Ailuropoda Ailuropus Ailurus Aimak Aimee Aimore Ainu Aira Airedale Aissaoua Aissor Aistopoda Aistopodes Aitkenite Aitutakian Aix Aizoaceae Aizoon Ajaja Ajatasatru Ajuga Aka Akal Akali Akamnik Akan Akanekunik Akania Akaniaceae Akawai Akebia Akha Akhissar Akhlame Akhmimic Akim Akiskemikinik Akiyenik Akka Akkad Akkadian Akkadist Akontae Akoulalion Akra Akrabattine Aktistetae Aktistete Aktivismus Aktivist Akwapim Al Alabama Alabaman Alabamian Alactaga Aladdin Aladdinize Aladfar Aladinist Alain Alaki Alala Alamanni Alamannian Alamannic Alan Alangiaceae Alangium Alans Alarbus Alaria Alaric Alarodian Alascan Alaska Alaskan Alastair Alaster Alauda Alaudidae Alaunian Alawi Alb Albainn Alban Albanenses Albanensian Albania Albanian Albany Albatros Alberene Albert Alberta Albertina Albertine Albertinian Albertist Alberto Albi Albian Albigenses Albigensian Albigensianism Albin Albion Albireo Albizzia Albococcus Alboin Albrecht Albright Albruna Albuca Albuginaceae Albyn Alca Alcaaba Alcae Alcaic Alcaligenes Alcalzar Alcantara Alcantarines Alcedines Alcedinidae Alcedininae Alcedo Alcelaphus Alces Alchemilla Alchornea Alcibiadean Alcicornium Alcidae Alcippe Alcor Alcoran Alcoranic Alcoranist Alcotate Alcuinian Alcyonacea Alcyonaria Alcyone Alcyones Alcyoniaceae Alcyonium Aldebaran Alderamin Alderney Aldhafara Aldhafera Aldine Aldrovanda Aldus Alea Alebion Aleck Alectoria Alectorides Alectoris Alectoromorphae Alectoropodes Alectrion Alectrionidae Alectryon Alejandro Alemanni Alemannian Alemannic Alemannish Alemite Alencon Aleochara Aleppine Aleppo Alethea Aletris Aleurites Aleurobius Aleurodes Aleurodidae Aleut Aleutian Aleutic Alex Alexander Alexandra Alexandreid Alexandrian Alexandrianism Alexandrina Alexandrine Alexas Alexia Alexian Alexis Alexius Aleyrodes Aleyrodidae Alf Alfirk Alfred Alfreda Alfur Alfurese Alfuro Algaroth Algarsife Algarsyf Algebar Algedi Algenib Algerian Algerine Algernon Algic Algieba Algol Algoman Algomian Algomic Algonkian Algonquian Algonquin Algorab Algores Algy Alhagi Alhambra Alhambraic Alhambresque Alhena Alibamu Alicant Alice Alichino Alicia Alick Alida Alids Alikuluf Alikulufan Aline Alioth Alisma Alismaceae Alismales Alismataceae Alison Alister Alix Alkaid Alkalurops Alkanna Alkaphrah Alkes Alkoran Alkoranic Allah Allamanda Allan Allantoidea Allasch Alle Alleghenian Allegheny Allen Allentiac Allentiacan Allhallow Allhallowtide Alliaceae Alliaria Allie Allies Allionia Allioniaceae Allium Allobroges Allophylus Allosaurus Allotheria Allotriognathi Allworthy Ally Alma Almach Almain Alman Almerian Almida Almira Almohad Almohade Almohades Almon Almoravid Almoravide Almoravides Almuredin Alnaschar Alnascharism Alnilam Alnitak Alnitham Alnus Aloadae Alocasia Alogian Alois Alonso Alonsoa Alonzo Alopecias Alopecurus Alopias Alopiidae Alosa Alouatta Aloxite Aloysia Aloysius Alpax Alpen Alphard Alphean Alphecca Alpheratz Alphonist Alphonse Alphonsine Alphonsism Alphonso Alpian Alpid Alpine Alpinia Alpiniaceae Alpinism Alpinist Alpujarra Alsatia Alsatian Alshain Alsinaceae Alsine Alsophila Alstonia Alstroemeria Altaian Altaic Altaid Altair Altamira Alternanthera Alternaria Althaea Althea Altica Alticamelus Altingiaceae Aluco Aluconidae Aluconinae Aludra Alulim Alumel Alumnol Alundum Alur Alvah Alvan Alveolites Alvin Alvina Alvissmal Alya Alyssum Alytes Amabel Amadi Amadis Amaethon Amafingo Amahuaca Amakosa Amalfian Amalfitan Amalings Amalrician Amampondo Amanda Amandus Amanist Amanita Amanitopsis Amapondo Amara Amarantaceae

Output:

 vellum set --sorted 1000.txt 1000.fst
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x10c7906]

goroutine 1 [running]:
github.com/couchbaselabs/vellum.(*Builder).optimize(0xc42006a180, 0x0, 0x6, 0xff7)
	/Users/mschoch/go/src/github.com/couchbaselabs/vellum/builder.go:165 +0x26
github.com/couchbaselabs/vellum.(*Builder).Insert(0xc42006a180, 0xc4200c1009, 0x7, 0xff7, 0x0, 0x0, 0x0)
	/Users/mschoch/go/src/github.com/couchbaselabs/vellum/builder.go:83 +0x141
github.com/couchbaselabs/vellum/cmd/vellum/cmd.glob..func8(0x1266700, 0xc420016720, 0x2, 0x3, 0x0, 0x0)
	/Users/mschoch/go/src/github.com/couchbaselabs/vellum/cmd/vellum/cmd/set.go:66 +0x22a
github.com/couchbaselabs/vellum/vendor/github.com/spf13/cobra.(*Command).execute(0x1266700, 0xc420016690, 0x3, 0x3, 0x1266700, 0xc420016690)
	/Users/mschoch/go/src/github.com/couchbaselabs/vellum/vendor/github.com/spf13/cobra/command.go:644 +0x3ef
github.com/couchbaselabs/vellum/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x12664e0, 0xc42006a058, 0x0, 0x1137c8b)
	/Users/mschoch/go/src/github.com/couchbaselabs/vellum/vendor/github.com/spf13/cobra/command.go:734 +0x339
github.com/couchbaselabs/vellum/vendor/github.com/spf13/cobra.(*Command).Execute(0x12664e0, 0x6, 0xc420047f00)
	/Users/mschoch/go/src/github.com/couchbaselabs/vellum/vendor/github.com/spf13/cobra/command.go:693 +0x2b
github.com/couchbaselabs/vellum/cmd/vellum/cmd.Execute()
	/Users/mschoch/go/src/github.com/couchbaselabs/vellum/cmd/vellum/cmd/root.go:36 +0x31
main.main()
	/Users/mschoch/go/src/github.com/couchbaselabs/vellum/cmd/vellum/main.go:20 +0x20

Performance impact of Pooling

Hey! A little quick background: I'm using Vellum in a timeseries database to reverse index timeseries by tags. We use a combination of map backed segments and those using FSTs (using Vellum), periodically compacting map backed segments to FSTs.

I was doing some performance work, because we were seeing about a 20% spike during the period of FST building v steady state. I noticed lots of alloc/inuse space in func (p *builderNodePool) alloc() *builderNode and func (p *transitionPool) alloc() *transition. So I reverted 5083a46 and 5c7eccd and it helped quite a bit. The delta spikes we saw were now around ~8-10%.

Couple of questions:

  • What was the motivation for those commits? Was their a workload they helped a lot?
  • Any other advise about tuning Vellum to have more bounded memory usage?

fuzz testing?

Seems a reasonable candidate. Two main challenges are:

a) at the build state, keys must be inserted in order, so using fuzz data as input isn't straightfoward
b) at runtime we don't do much/any bounds checking on the FST, so we surely panic on this today

Regarding challenge b, see #1 which might offer a safe-mode. In this mode we should return error and not panic.

Merge performance improvements into upstream

Hello!

My team has been maintaining a forked branch of vellum with some performance pooling improvements. These changes have significantly reduced memory allocations and memory consumption of the FST building process in our timeseries database.

I know a previous teammate of mine (who is unfortunately on leave of absence) had begun contributing some code back to upstream. Are you open to me opening P.Rs to contribute the remaining performance improvements back? We'd love to not maintain our own fork, as well as contribute back to the library that saved us a ton of work :)

Thanks again for open sourcing such a great library.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.