cessen / ropey Goto Github PK

View Code? Open in Web Editor NEW

994.0 994.0 44.0 2.76 MB

A utf8 text rope for manipulating and editing large texts.

License: MIT License

Rust 100.00%

ropey's People

Contributors

Stargazers

Watchers

ropey's Issues

Chunks iterator get's confused when you change direction twice.

If you call prev() once next() has returned None, then call prev() followed by next() again. The last call to next() should return None a second time, instead the last two calls (prev() and next()) both return the same chunk.

I've created a reproduction repository here: https://github.com/smohekey/test-ropey.

Leaf node as Gap buffer (Just Idea)

I thought of an idea while participating in an issue to make emacs-ng multi-threaded.

What do you think about using each leaf node as a gap buffer?
I think that simple inserts and deletes that do not cause grow operations can be faster.
(If there is a need to change the structure of the gap buffer, the existing rope is used.)

Emacs uses about 20 bytes as the size of the gap.
https://github.com/emacs-mirror/emacs/blob/55e77a811a6f137b02eb786fb00c43f0f14bb028/src/buffer.c#L571

Could it be that it consumes too many ram or causes problems with threadsafe?
An alternative to the memory problem is to keep track of the cursor position, and only use the gap buffer on the leaf node the cursor is on.

Add feature to customize segmentation

Right now Ropey is hard-coded to segment text based on extended grapheme clusters as defined in Unicode Standard Annex 29. However, grapheme clusters are supposed to be customizable to some extent, based on application. Annex 29 itself even mentions this. Moreover, some applications may want to segment based on something other than graphemes, or even not segment at all for even better performance.

Ropey should have a way to customize this behavior, while still having sane defaults for the large majority of applications.

Design

The tentative design for this feature is to have a trait Segmenter that has two methods, is_break() and seam_is_break():

trait Segmenter {
    fn is_break(byte_idx: usize, text: &str) -> bool;
    fn seam_is_break(left: &str, right: &str) -> bool;
}

These methods return whether or not a point in a text is a valid break between segments. Ropey will then use them to determine where it can and cannot split leaf nodes. is_break() will have a default (but slightly inefficient) implementation that calls to seam_is_break().

The Rope type will then take as a type parameter any type that implements Segmenter, like so:

struct Rope<T: Segmenter> {
    // ...
    _seg: PhantomData<T>,
}

The Nodes of a Rope will also take that same type parameter. Then T::is_break() and T::seam_is_break() will be used throughout the code to determine valid breaks in the text.

Note that neither method of Segmenter takes a self parameter. The trait is just a way to inject the functions into the Rope type without run-time overhead. The Segmenter types themselves will never be instantiated.

Finally, Rope will have a default Segmenter that segments based on extended graphemes, so most users of the library won't need to concern themselves with any of this. Ideally, this whole thing should be invisible to users that don't care about it.

Compatibility Between Ropes

All operations except append() should be compatible between Ropes with different Segmenters. Append can't work (at least not efficiently) because all nodes of the Rope being appended would have to be re-evaluated to conform to the segmentation strategy of the Rope being appended to. But things like PartialEq impls, etc. should work fine across different segmentation strategies.

Open Questions

Right now Ropes have methods for checking and working with grapheme boundaries, and an iterator for iterating over grapheme clusters. These will be easy to port to work via whatever Segmenter is given. However, their names (e.g. is_grapheme_boundary()) only make sense for graphemes.

So... this is a bit bike-sheddy, but what should they be called? My main concern with renaming them is that I don't want them to feel weird or too abstract for the 90+% use-case of just segmenting on graphemes. On the other hand, they shouldn't be outright wrong for other cases either.

At the moment, just substituting "grapheme" with "segment" everywhere seems like the least bad idea. But that may also get a little confusing compared to e.g. the various "chunk" functionality.

Drawbacks

I'm increasingly convinced that abstraction has a cognitive cost for users of a library. Making something simple, straight-forward, and purpose-built is usually a good default, and the choice to abstract things should undergo a cost-benefit analysis.

Making this change will make Ropey a little more abstract, and a little less concrete. The naming question above is a hint at that. People browsing the API docs will likely wonder "What is this Segmenter thing, and what is this weird type parameter on Rope and RopeSlice?", and the various methods that depend on Segmenter will be less clearly named for the common case. All of this imposes cognitive load on users of the library.

I still think this is the right choice, because how text is segmented is purpose-specific, and I would prefer for Ropey to not be unusable for anyone just because of its segmentation strategy. Nevertheless, I want to acknowledge that this comes at a cost.

Lower memory overhead

(This is a tracking issue for reducing the memory overhead of Ropey. The stats listed below are updated as improvements are made.)

The memory overhead for creating new ropes from existing text is pretty good (~10%). However, the memory overhead of building up a rope from many small insertions is worse (up to ~55% depending on insertion placement).

Improving this would be great!

The main culprit here is that since leaf nodes generally have a fixed-size text buffer, leaf nodes that aren't maximally filled contain wasted space. Trying to keep leaf nodes close to maximally filled should resolve the memory overhead back down to close to 10%. But how to do that without negatively impacting performance too much?

Reversible Iterators

Ropey's iterators are now bidirectional (see issue #18). However, there are still use-cases where it is useful or necessary to reverse the iterator so that next() moves it backwards instead of forwards. In particular, right now you can't use any of Rust's iterator methods when iterating backwards.

One possibility would be to implement the DoubleEndedIterator trait, which behaves like two independent iterators, one on each end, both moving in opposite directions. By calling rev() you can swap the ends. However, it's not entirely clear what it would mean, then, to create an iterator starting e.g. in the middle of the rope.

Another possibility is to add a method that simply reverses the direction of the iterator in-place. This would side-step any awkwardness with DoubleEndedIterator and I believe would satisfy the large majority of use-cases. Moreover, it would make it obvious how to create a reversed iterator starting in the middle of a Rope. However, it would move Ropey even further away from the way people may initially expect things to work based on iterators in the standard library.

Share immutable chunks externally

Is it possible to get a copy-on-write reference to a chunk? Or alternatively, a rope that guarantees it only ever consists of a single chunk?

The context here is a struct which needs a memory contiguous string for some operations, but to produce a rope containing that string for other operations. Currently, I am storing both a Rope and a String that have the same content, but this isn't ideal.

still differences in lines() between slice and rope

Thanks for fixing my previous issue. I'm not sure it exactly correct, however. Now it is different if the document ends with \n. In this case, the rope correctly has an empty last line, but the slice of that rope does not. This is using ropey 0.9.1.

extern crate ropey;

use ropey::Rope;

fn main() {

    let text = "Hello there!\nHow goes it?\n";
    let rope = Rope::from_str(text);

    // should be the same as rope
    let slice = rope.slice(..);

    // but now the slice is missing the empty last line that the rope has
    assert!(slice.lines().count() == rope.lines().count());
}

B Tree-Rope design

I was learning about the Rope data structure when I saw this project.

A normal Rope uses a binary tree, and every node's key is the sum of its left subtree. (According to https://en.wikipedia.org/wiki/Rope_(data_structure))
The Ropes leaf nodes are what actually contain the string, and its key is the length of that string.

I saw in https://github.com/cessen/ropey/blob/master/design/design.md that Ropey uses a B-Tree. I also noticed that in that design paper, it shows that every single Node, not just the leaf nodes, contains a String.

I was kinda confused by this, and was wondering if you could explain how that works. For example, how would the searching work? (Like finding the what byte is at a specific position) In a normal B-Tree, to search for a key k, you:

Start at the root
Go through each key in the current node, and then find where k falls in that range of keys.
You then go to the appropriate child, and repeat.

However, I'm pretty sure that only works without duplicates.

How does Ropey's design for that work? Because if the key in Ropey's design is the byte count of its child, then there could be multiple duplicate byte counts.

I'm probably missing something very basic, and it would be great if you could explain this.

Also, I'm kind of curious as to why you did not use a B+ Tree. I feel like a B+ Tree is more fitting because all of the values are stored in it's leaf nodes, like a Rope.

Here is an image that illustrates what I mean:

Buffering writes for performance uplift

I'm not sure where to put this - its a bit of a stretch to make this an issue. But it might be something you want to add to ropey.

I've been experimenting with pre-write concatenation of edits to see how much of a difference it makes to performance. The use case here is replaying editing histories in a CRDT I'm working on, and outputting the resulting string. (I'm storing the editing history anyway, so if I can do this quickly I can skip also storing the resulting document).

To improve performance I'm essentially wrapping jumprope in a little wrapper structure which greedily concatenates all incoming changes when they're adjacent and the same type. So instead of insert 'a' then insert 'b' the changes become "insert 'ab'*. Obviously this only works when the changes are the same type, and adjacent. But the performance figures when using real world editing histories is pretty unreal. I'm seeing a 3-10x performance improvement from this change.

The automerge-perf dataset shows the biggest improvement, because each edit in that dataset is a single keystroke. This concatenation approach brings the number of "real edits" to the structure down from 260k to 10k, at the expense of a bit of bookkeeping. In aggregate performance improves by 9.5x. I'm now replaying this 260k edit history in 670 microseconds (!), or 390M keystrokes per second. (Up from 40M/sec, which is already obnoxiously fast.)

Dataset	Edit Direct (ms)	Edit Buffered (ms)	Performance multiplier
automerge-perf	6.36	0.67	9.55x
seph-blog1	3.61	0.97	3.72x

Anyway, this approach is only useful with specific use cases, and it won't help at all with random editing traces.

Its also entirely rope-agnostic. Ropey would see a similar performance improvement with these data sets.

So yeah, I'm not sure if you care about this, but its interesting!

Buffered rope wrapper here

getting an extra blank line when iterating over a slice

I'm new to rust, so sorry if I'm doing something dumb, but I'm seeing a weird bug. When iterating over a slice with lines, I get an extra empty line. If you make a new program with cargo like this, it fails. However, I added the same test to ropey as a test in the iter.rs file, and it passed. I don't understand how that can be.

Thanks,
Craig

extern crate ropey;

use ropey::Rope;

fn main() {

    let text = "Hello there!\nHow goes it?";
    let rope = Rope::from_str(text);

    // should be the same as rope
    let slice = rope.slice(..);

    // but we get an extra empty line when iterating over the slice
    assert!(slice.lines().count() == rope.lines().count());
}


#[cfg(test)]
mod tests {
    use Rope;

    #[test]
    fn lines_from_slice() {
        
        let text = "Hello there!\nHow goes it?";
        let rope = Rope::from_str(text);

        // should be the same as rope
        let slice = rope.slice(..);

        // but we get an extra empty line when iterating over the slice
        assert!(slice.lines().count() == rope.lines().count());
    }

}

Support converting to and from UTF-16 code unit indices.

Some software and APIs, especially on the Windows platform, still operate in terms of UTF-16 code units when working with text. One example is the Language Server Protocol, which specifies text offsets in UTF-16 code units.

Being able to efficiently interoperate with these APIs is useful, especially for code editors. To better support this use-case, and to generally round-out Ropey's Unicode support, we should provide a way to convert between Unicode scalar value indices (which Ropey uses) and UTF-16 code unit indices.

I think tracking the data for this can be done with relatively little overhead. I'm not as concerned that the actual conversions be super optimized (though they should still be efficient). But I don't want the feature to negatively impact clients that aren't using it.

Update:
This feature is now implemented and in master, and will be in the next release.

Bidirectional iterators

Some client code may want to iterate over e.g. chunks in both directions. For example, DFA regex need to scan backwards after finding a match to calculate where the match begins.

The tricky bit isn't technical, however. It's having a standard API to interoperate with other libraries. Unfortunately, the Rust standard library doesn't provide a bidirectional iterator trait, so this probably isn't useful to implement yet: any API I come up with will be specific to Ropey, and therefore wouldn't interoperate with other libraries anyway.

I'm creating this issue as a reminder to add this functionality if/when a bidirectional iterator trait (or equivalent) is added.

Large insertions are comparatively slow

Inserting a large amount of text with a single insert is somewhat slow right now, because it is special-cased to the construction of a new rope, and then a split and two appends. It's still O(log n) performance, but the constant is much higher. It's likely that there is a faster way to do this.

Insert to another rope

I know that you can combine/insert a string to a rope in O(log n) time, but I wonder if it is possible to reuse the information from another rope, rather than turning it into string and then insert it back, i.e. tree merge algorithm is probably needed.

Related: #4

This is needed for a preprocessor algorithm, that we could map the parent node of any rope to a specific pointer (such as file) for better diagnostic when parser goes wrong. Another feature needed is a splice algorithm which requires this but let's keep it away from here for the next time.

RopeBuilder's Default derive creates an invalid RopeBuilder

I noticed this in helix-editor/helix#4983. It looks like the cause is that the stack is not initialized with a single starting node.

Async IO support for `Rope`?

I've been trying to read from/write to files from a Rope. The documentation suggests this:

let mut text = Rope::from_reader(
    BufReader::new(File::open("my_great_book.txt")?)
)?;

This works well for synchronous programs, but I was unable to find any asynchronous counterpart to Rope::from_reader. I tried to use tokio::io::BufReader, but since that doesn't implement std::io::Read (instead tokio::io::AsyncRead) it isn't compatible with from_reader.

I'm wondering if async support is a possibility, for example something like Rope::from_reader_async behind a feature flag.

Thanks!

Typo in node size calculations

Is it perhaps an error that this line is
pub(crate) const MAX_BYTES: usize = TARGET_NODE_SIZE - 1 - (PTR_SIZE * 2);
rather than
pub(crate) const MAX_BYTES: usize = TARGET_NODE_SIZE - size_of::<usize>()*2;
or
pub(crate) const MAX_BYTES: usize = 1024 - 1 - size_of::<usize>() - (PTR_SIZE * 2);
as you have already subtracted the arc counters in TARGET_NODE_SIZE, and the layout is actually
[ smallvec enum tag ][ smallvec capacity ][ BackingArray ]
except when the "union" feature is enabled.

I found that this allows the correct maximum of 1024-16 = 1008 bytes to be achieved on my 64-bit system.

`ropey` doesn't agree with `wc`

I found it confusing how ropey's line count doesn't agree with BSD or GNU wc's line count.

I wrote a simple reproduction of this difference here: https://github.com/evanrelf/ropey-confusion

The TL;DR is that ropey considers an empty file (e.g. created with touch) to have one line, whereas wc considers it to have zero lines.

I don't necessarily think there's anything wrong with this behavior. But I found it confusing and I thought it might be worth mentioning in the documentation.

Other than this little thing, ropey has been a joy to use while working on building my own text editor. Thanks for making such a great library! 🙂

Specify the lifetime on RopeSlice::get_chars_at() and similar functions as it is specified for get_chars()

Both the get_chars() and get_chars_at() methods on RopeSlice<'a> return a value of type Chars, however, the get_chars() version returns a Chars<'a> with a lifetime matching the RopeSlice<'a>, while get_chars_at()'s return value uses an anonymous lifetime, Chars<'_>, which is limited to the scope in which the method was called.

I encountered this while trying to write a previous-word implementation, under some constraints. I found myself needing to write a function that took a slice and the cursor position, and returned an iterator over the characters in the slice traveling towards the start of the slice. I ended up with some functions that looked a bit like the following:

    fn iter_slice_rev<'a>(text: RopeSlice<'a>) -> Chars<'a> {
        let mut it = text.chars_at(text.len_chars());
        it.reverse();
        it
    }

    fn iter_slice_forward<'a>(text: RopeSlice<'a>) -> Chars<'a> {
        let it = text.chars();
        it
    }

The slice_forward() function will compile, while the slice_rev() function has borrow checker errors because the Chars instance it returns has its anonymous lifetime set to be the same lifetime as the borrow in the function which starts when chars_at() is called, and thus cannot outlive the function.

Similar function attempts also fail with lifetime errors:

    fn iter_slice_rev2<'a, 'b>(text: RopeSlice<'a>) -> Chars<'b> {
        let mut it = text.chars_at(text.len_chars());
        it.reverse();
        it
    }

    fn iter_slice_rev3<'a, 'b>(text: &RopeSlice<'a>) -> Chars<'b> {
        let mut it = text.chars_at(text.len_chars());
        it.reverse();
        it
    }

    fn iter_slice_rev4(text: RopeSlice<'_>) -> Chars<'_> {
        let mut it = text.chars_at(text.len_chars());
        it.reverse();
        it
    }

I attempted to see if there were any problems with simply adjusting the signature of the functions to return the more general lifetime, changing pub fn get_chars_at(&self, char_idx: usize) -> Option<Chars> to pub fn get_chars_at(&self, char_idx: usize) -> Option<Chars><'a> in slice.rs on line 1261, and making a similar change for get_chars(). It did not cause any issues with the test suite.

I do not believe that such a change would be a breaking change, since the lifetime is only being expanded.

A similar difference exists for chars_at(), bytes_at(), lines_at(), and chunks_at_line_break(), vs. chars(), bytes(), lines(), and all other chunks_at_*() methods.

The functions I have noticed that use the anonymous lifetime and have counterparts that use the slice's lifetime are:

get_chunks_at_line_break()
get_lines_at()
get_chars_at()
get_bytes_at()
chunks_at_line_break()
lines_at()
chars_at()
bytes_at()

The differences among the chunks_at_*() methods seem to indicate that the use of the anonymous lifetime for the former group is accidental and not a specific API choice. I am not entirely certain that such a guess would be correct however.

If this is indeed not intended, and it would be an acceptable change, then I would like to ask for the lifetime annotations to be added.

If it is acceptable and you would prefer it, I could put together a pull request for the change.

Document Ropey's design

The code base may be a little tricky to understand for newcomers at the moment, because of how much fiddly stuff is going on. To make it easier for people to contribute, Ropey's design should be documented.

Suggestions for tracking marked ranges

I'm wondering if you have any suggestion for efficiently tracking marked ranges? Issue #13 touched on this but was closed before you had a chance to comment.

I agree with issue #13, it can (probably should) be done in external structure. But an efficient implementation is non-trivial. I'm thinking something like https://github.com/atom/superstring marker_index as a good example of what is needed. It would be great to have this structure as an option to go with ropey. Or at least in the documentation to have some suggestions (maybe another crate) on how to handle the use case of tracking ranges in a ropey buffer.

Performance issue with the redundancy operations

ropey/examples/search_and_replace.rs

Line 156 in d9c841b

fn next(&mut self) -> Option<(usize, usize)> {

For the next function here, by change the linear search to the Boyer-Moore string search algorithm and then it can skip some unnecessary char comparison, then reduce the memory operations and improve the performance.

According to my test it will get a 96.5% speedup (12.2s to 6.22s).

Here's the modified code:

fn next(&mut self) -> Option<(usize, usize)> {
        let search_char = self.search_pattern.chars();
        while let Some(next_char) = self.char_iter.next() {//char_iter redundancy
            self.cur_index += 1;
            if let Some(last_char) = self.char_iter.nth(self.search_pattern_char_len - 2){
                if last_char != search_char.clone().last().unwrap(){
                    let n_place = self.search_pattern.rfind(last_char);
                    if n_place == None{// if the last chat is not matched and it is not in the search_pattern, the we can skip the whole length
                        // self.possible_matches.clear();
                        for n in 0..(self.search_pattern_char_len - 1){
                            self.char_iter.next();
                            self.cur_index += 1;
                        }
                        continue;
                    }else{// if the last char is in the search_pattern, we can skip some of the letters.
                        for n in 0..(self.search_pattern_char_len - n_place.unwrap() - 1){
                            self.char_iter.next();
                            self.cur_index += 1;
                        }
                        continue;
                    }
                }
                // It is matched with the last char, so we start to compare them
                // first match the first char, if not match, go to next loop

                if next_char != search_char.clone().next().unwrap(){
                    continue;
                }
                let mut if_match = true;
                for i in 0..(self.search_pattern_char_len - 3){
                    let next_last_char = self.char_iter.nth(self.search_pattern_char_len - 3 - i);
                    if next_last_char != search_char.clone().nth(self.search_pattern_char_len - 2 - i){
                            // self.char_iter.next();
                            // self.cur_index += 1;
                            if_match = false;
                            break;
                    }
                }
                if if_match{
                    let char_match_range = (
                        self.cur_index - 1,
                        self.cur_index + self.search_pattern_char_len - 1,
                    );
                    return Some(char_match_range);
                }else{
                    continue;
                }
            } else {
                break;
            }

        }

        None
    }

Hope this information helps!

Roadmap to 1.0

I plan for Ropey to reach 1.0 at some point. I don't want it to be a library that is perpetually stuck in beta. This is the list of things that need to happen before I declare 1.0.

Remove Grapheme support. After working with Ropey for a bit, it seems clear that this should be handled on top of Ropey, not inside it.
Verify that grapheme support (iterators, finding prev/next grapheme boundary, etc.) can be implemented sanely and efficiently on top of Ropey. Probably do this as an example in the repo.
Decide on conventions for converting between line endings and chars/bytes (see issue #11).
Implement non-trivial text editor functionality using Ropey as the backing text buffer, to validate Ropey's design and APIs:
- Text search (done: see the search-and-replace example in the repo)
- Syntax highlighting (done: see Smith editor)
- Word wrapping (done: see Led editor)
- Basic undo/redo (done: see Led editor)
- Loading/saving text files with non-utf8 text encodings (done: see the latin1 example in the repo)

Non-panicking versions of functions

I'd like to know if you would accept a PR adding a feature-gated api extension which includes indexing functions that return Option<A> instead of panicking. Thanks!

attempt to compute `0_usize - 1_usize`, which would overflow

Hello, while trying to compile some cargo package (zee) which has ropey as a dependency, I ran into this error:
attempt to compute 0_usize - 1_usize, which would overflow

Viewing the source, the comment above this declaration specifies this:

I am unsure whether this is an error on my side, relevant info:
rustc --version: rustc 1.65.0 (897e37553 2022-11-02)
rustup toolchain list:
stable-x86_64-unknown-linux-musl (default)
nightly-x86_64-unknown-linux-musl
Operating system: Alpine linux edge x86_64

`clone_slice(range: Range) -> Rope`?

For implementing copy-paste, it is useful to take part of the Rope, as a new Rope. Right now I will just have to clone() and remove prefix and suffix, which is not too bad, but not ideal. :)

Behaviour of "char_to_line" at index one-past-the-end

Firstly, thanks for the excellent library! I have run into one bit of confusing behaviour though, and I couldn't see an issue about it so I thought I'd raise it as a question.

Quoting the documentation:

If char_idx is one-past-the-end, then one-past-the-end line index is returned. This is mainly unintuitive for empty ropes, which will return a line index of 1 for a char_idx of zero. Otherwise it behaves as expected.

https://docs.rs/ropey/0.6.3/ropey/struct.Rope.html#method.char_to_line

The code works as described, but the behaviour seems strange to me, whether the rope is empty or not. It is causing me to add quite a few special cases to my code. Why does it not return the index of the last line, rather than a hypothetical new line?

Best way to go from char_idx to (line_idx, column_idx)

I have a char_index and I would like to convert to a (line, column) point.

Untested but right now I think something like this will work:

fn char_to_point(char_idx: usize, rope: &Rope) -> (usize, usize) {
  let line_idx = rope.char_to_line(char_idx);
  let line_char_index = rope.line_to_char(line_idx);
  (line_idx, char_idx - line_char_index)
}

I wonder if there's a more efficient approach though... seems that internally char_to_line probably already has easy/efficient access to the the column offset information. Could it (or maybe there's some other method) just return the line and offset so that the second call to line_to_char isn't necessary.

Incoming breakage due to static assert on size

rust-lang/rust#94075 optimizes repr(Rust) enums better, making them smaller in many cases. This breaks ropey (see https://crater-reports.s3.amazonaws.com/pr-94075/try%23419e70b02dc61a51433bb55fe5a482c6ab6f7da5/gh/cessen.led/log.txt for an example). I'm not sure when this change gets merged and will hit stable, but relying on the exact size of repr(Rust) types is suboptimal. The easy fix is to use repr(C) for the affected enums, which will disable any such optimizations.

If you have opinions on this topic, please don't hesitate to let me know here or post on the PR.

When to use from_reader() vs RopeBuilder

let file = File::open(&path).context(format!("unable to open {:?}", path))?;   
Rope::from_reader(BufReader::new(file))?

Should we re-buffer from_reader, does it already buffer it? Or we don't need to use BufReader? If yes, maybe the docs might want to mention it. I was looking at the code and it seemed like it does some buffering. I just ponder upon this when working on helix.

Question about data sharing of rope clones

I read ropeys design documentation but still can't quite get it. When cloning ropes the actual content is shared between them. If we then continue to insert text in one of the clones the other doesn't change. But we are tracking information about the text in the internal nodes so this means that atleast all of the internal nodes from root to the insert position must be changed according to the edit.
So are we actually cloning (or duplicating) all of the internal nodes from root to the insert position in this case?
What is actually cloned in this case?

Add ability to insert rope slices

Right now you can only insert &str slices into ropes. However, in an editor it may be useful to insert text from other ropes (e.g. for copy-pasting a large amount of text). Adding the ability to insert rope slices would allow for this.

test case failed lines_exact_size_iter_04

Not sure whether just me. I just clone and run cargo test.
Environment: mac m1.

---- iter::tests::lines_exact_size_iter_04 stdout ----
thread 'iter::tests::lines_exact_size_iter_04' panicked at 'assertion failed: `(left == right)`
  left: `6`,
 right: `5`', src/iter.rs:2460:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Implement Range on lines

This may already be a thing, but how could you get only certain lines (e.g. 15-65). Currently I have a loop and use .line(idx), which seems inefficient. Is there any way to do something like this: .lines()[14..65]?

Performance vs other methods (Question)

Hi!
I noticed this cool project while developing my own text-editor. I really apologize if this is the wrong place to put it, but I have a couple of questions.

How does the performance of this library compare to Xi-Rope? I heard here that Xi-Rope was slightly faster than ropey, but that article was 2 years ago. By performance I mean inserting & indexing (getting a byte at position x) at random locations.
How does the performance of a B-Tree rope (the design that ropey and xi-rope uses) compare to that of say a Piece-Table?
============================================
Suppose you have 20GB of data loaded into memory. I'm assuming what ropey and xi-rope do is that they break that 20GB of data into a bunch of nodes, where each node contains BUFFER_SIZE bytes. If you have 20GB of data & 1K node size, then you would need 20,971,520 nodes. This really bogs down random insertions & indexes.
============================================
However, a Piece-Table could just read all the 20GB into memory, and when you want to perform a random insertion/deletion, then it would be instant. (assuming you have made 0 edits). The downside of a Piece-Table is that the performance scales with how many pieces you have in the table (ie edits). If you have made 10,000 edits, then you would have to iterate through 10,000+ pieces to find out indexes. I think this can be brought down to O(log(m)) time with a RB-Tree, where m is the number of edits made instead of the number of characters. I believe this is what VSCode does.

I was wondering if the chunk of text above (about the piece table) are true. If it's true, then does that mean a Piece-Table could be theoretically faster for general things like insertions, deletions, and indexing?

n = number of bytes
m = number of edits

B-Tree Rope Random Insertion: o(log n)
B-Tree Rope Random Indexing: o(log n)

Piece-Table Random Insertion: o(m)
Piece-Table Random Indexing: o(m)

Piece-Table-With-Tree Random Insertion: o(log m)
Piece-Table-With-Tree Random Indexing: o(log m)

I'm actually using c++, but I was curious as to how this library performs against other methods.

Thanks!

Have you considered making b-tree generic and publishing it?

Ropey's b-tree implementation is very nice/fast/documented.

But right now it is tied to string type. Have you considered making it generic such as what xi-editor or xray editor do with their b-tree implementations? I think it would be a generally useful crate on its own if made generic.

Add methods for manual memory-overhead management

Although ideally memory overhead would be minimal all the time, in practice there will always be some memory overhead from manipulating ropes, much like Vecs which allocate more capacity than their contents when pushing to them.

Let's add methods analogous to capacity() and shrink_to_fit() from Vec, which could be useful in some situations.

Implementing Eq, PartialOrd and Ord

Hey,

are there any fundamental reasons why the Rope type doesn't implement Eq, PartialOrd and Ord? I would have expected implementations for those, with behavior equivalent to that of Strings containing the same unicode data. But I suppose you had good reasons for not implementing Eq at the very least.

Add a feature flag to only count LF and CRLF as line breaks.

Most other software does not conform to the Unicode spec when it comes to line breaks. This (understandably) gives people pause when deciding whether to use Ropey or not, because although Ropey is "correct" in the way it counts line breaks, it's not compatible with other software if any line breaks other than LF/CRLF are present in a file.

We can solve this with a feature flag that implements the simpler, non-Unicode-conformant line break counting that is common to most other software.

Line length in columns

Hi, and thank you for the great crate!

I'm not sure that this will match goals of the project, but I came to a suggestion:
likely, this crate will be used for a text editor or something which depends on lines' length to compute
selections, sticky positions etc.; while it is possible to get line by index as a RopeSlice and get it's length, to get actual number of columns developer have to take into account CRLF and if this is a last line of file or not; having a method which returns a number of columns for the line trimming \r\n (and other stuff you track to split on lines) from the end would be useful, at least for me.

Complete test coverage

There are already quite a few unit tests in Ropey, but it doesn't have complete coverage yet.

The most important thing is complete coverage of all public-facing API's, making sure they work as expected. Then unit tests of internal bits can follow.

This will be really useful when further optimizing Ropey, to make sure optimizations don't break anything.

Make Lines iterator more efficient

Currently the Lines iterator is roughly equivalent to just calling Rope::line() repeatedly with an incrementing index. This is O(log N) for each call to Lines::next(), and also is just generally less efficient than it needs to be. This is not only sub-optimal, but also stands out compared to the other iterators which are all O(1) and very fast.

It should be possible to also make Lines O(1) and just generally more efficient.

Add slice_byte() / get_slice_byte()

It would be a nice optimization in Helix since we receive byte ranges from tree-sitter then convert them into char indices, but then ropey converts back to bytes and uses byte calculations internally.

function to remove newline between two lines

It would be great if there was a simple way to remove the newline between two lines, turning it into one line

Add ability to insert individual characters

There seems to be no way to insert individual characters.

I'd love to use ropey as the backing buffer of a terminal editor, but without the ability to edit the rope on a per-keypress basis there's just a lot of needless overhead.

What would be the proper way to process lines as &str instances?

Hi,

I am trying to read a file into a rope, parse it line by line with Nom and then, in the future, maybe write back changes.
My files are self contained lines so it makes sense to parse them line by line, so I naively do a line.to_str() and try to parse them, but the results surprised me, as not all lines convert properly to &str:

line 20: ["option \"something\"]
line 20 as_str: Some("option \"something\"\n")
line 21: ["\n"]
line 21 as_str: Some("\n")
line 22: ["option \"i", "rrelevant\"\n"]
line 22 as_str: None

So this is probably where the buffered reader overflows into another chunk, but what is my best option for retrieving a &str for the line, so that I can parse it?

Or should I try to run Nom directly on the RopeSlice somehow?

Add possibility to add "metadata" to slices of text

Hi,

I like ropey very much, but I have one issue I do not know how to walk around. I need to add syntax highlighting and I do not want to recompute it lazily. So I need to attach short information to ranges (this example: exclusive ranges) of text. Depending on context, I would like to have a choice to see if I am breaking a piece of color (keyword) or keeping it (comment). In some cases I could avoid recalculating highlighting altogether.

What do you think?

Configurable line endings

I'm using Ropey for a LSP language server, which it seems was one of the motivating use cases for UTF-16 code unit indexing. One slight complication I noticed is that LSP mandates that only \n and \r\n be treated as line endings while Ropey uses a more general set of Unicode line boundary characters.

I think I can work around this by just replacing the alternate line breaks that Ropey recognizes with dummy whitespace characters if I receive them from the LSP client, so line indexing will stay coherent. But it would be nice if there was a native way to configure this so no workaround was required. This strategy isn't really ideal since the internal buffer doesn't necessarily match the file and correctly implementing certain features might require extra tracking of where to replace the dummy characters with the original values when sending text back to the client.

It looks like the set of newline characters Ropey recognizes is hard-coded, so maybe there isn't an API-compatible way to implement this without sacrificing non-negligible performance. If so, feel free to close this.

Take a look at IntelliJ's immutable text

Hi!

Today I've looked at how IntelliJ handles text document, and was really surprised by how minimal and simple the core data structure is at just 500 lines of Java. I thought that you might be interested in taking a look as well:

https://github.com/JetBrains/intellij-community/blob/f41f89500540c4f6720468fa52498ce5729fb8c8/platform/util/src/com/intellij/util/text/ImmutableText.java