cessen / ropey Goto Github PK
View Code? Open in Web Editor NEWA utf8 text rope for manipulating and editing large texts.
License: MIT License
A utf8 text rope for manipulating and editing large texts.
License: MIT License
If you call prev() once next() has returned None, then call prev() followed by next() again. The last call to next() should return None a second time, instead the last two calls (prev() and next()) both return the same chunk.
I've created a reproduction repository here: https://github.com/smohekey/test-ropey.
I thought of an idea while participating in an issue to make emacs-ng multi-threaded.
What do you think about using each leaf node as a gap buffer?
I think that simple inserts and deletes that do not cause grow operations can be faster.
(If there is a need to change the structure of the gap buffer, the existing rope is used.)
Emacs uses about 20 bytes as the size of the gap.
https://github.com/emacs-mirror/emacs/blob/55e77a811a6f137b02eb786fb00c43f0f14bb028/src/buffer.c#L571
Could it be that it consumes too many ram or causes problems with threadsafe?
An alternative to the memory problem is to keep track of the cursor position, and only use the gap buffer on the leaf node the cursor is on.
Right now Ropey is hard-coded to segment text based on extended grapheme clusters as defined in Unicode Standard Annex 29. However, grapheme clusters are supposed to be customizable to some extent, based on application. Annex 29 itself even mentions this. Moreover, some applications may want to segment based on something other than graphemes, or even not segment at all for even better performance.
Ropey should have a way to customize this behavior, while still having sane defaults for the large majority of applications.
The tentative design for this feature is to have a trait Segmenter
that has two methods, is_break()
and seam_is_break()
:
trait Segmenter {
fn is_break(byte_idx: usize, text: &str) -> bool;
fn seam_is_break(left: &str, right: &str) -> bool;
}
These methods return whether or not a point in a text is a valid break between segments. Ropey will then use them to determine where it can and cannot split leaf nodes. is_break()
will have a default (but slightly inefficient) implementation that calls to seam_is_break()
.
The Rope
type will then take as a type parameter any type that implements Segmenter
, like so:
struct Rope<T: Segmenter> {
// ...
_seg: PhantomData<T>,
}
The Node
s of a Rope
will also take that same type parameter. Then T::is_break()
and T::seam_is_break()
will be used throughout the code to determine valid breaks in the text.
Note that neither method of Segmenter
takes a self
parameter. The trait is just a way to inject the functions into the Rope
type without run-time overhead. The Segmenter
types themselves will never be instantiated.
Finally, Rope
will have a default Segmenter
that segments based on extended graphemes, so most users of the library won't need to concern themselves with any of this. Ideally, this whole thing should be invisible to users that don't care about it.
All operations except append()
should be compatible between Rope
s with different Segmenter
s. Append can't work (at least not efficiently) because all nodes of the Rope
being appended would have to be re-evaluated to conform to the segmentation strategy of the Rope
being appended to. But things like PartialEq
impls, etc. should work fine across different segmentation strategies.
Right now Rope
s have methods for checking and working with grapheme boundaries, and an iterator for iterating over grapheme clusters. These will be easy to port to work via whatever Segmenter
is given. However, their names (e.g. is_grapheme_boundary()
) only make sense for graphemes.
So... this is a bit bike-sheddy, but what should they be called? My main concern with renaming them is that I don't want them to feel weird or too abstract for the 90+% use-case of just segmenting on graphemes. On the other hand, they shouldn't be outright wrong for other cases either.
At the moment, just substituting "grapheme" with "segment" everywhere seems like the least bad idea. But that may also get a little confusing compared to e.g. the various "chunk" functionality.
I'm increasingly convinced that abstraction has a cognitive cost for users of a library. Making something simple, straight-forward, and purpose-built is usually a good default, and the choice to abstract things should undergo a cost-benefit analysis.
Making this change will make Ropey a little more abstract, and a little less concrete. The naming question above is a hint at that. People browsing the API docs will likely wonder "What is this Segmenter thing, and what is this weird type parameter on Rope and RopeSlice?", and the various methods that depend on Segmenter
will be less clearly named for the common case. All of this imposes cognitive load on users of the library.
I still think this is the right choice, because how text is segmented is purpose-specific, and I would prefer for Ropey to not be unusable for anyone just because of its segmentation strategy. Nevertheless, I want to acknowledge that this comes at a cost.
(This is a tracking issue for reducing the memory overhead of Ropey. The stats listed below are updated as improvements are made.)
The memory overhead for creating new ropes from existing text is pretty good (~10%). However, the memory overhead of building up a rope from many small insertions is worse (up to ~55% depending on insertion placement).
Improving this would be great!
The main culprit here is that since leaf nodes generally have a fixed-size text buffer, leaf nodes that aren't maximally filled contain wasted space. Trying to keep leaf nodes close to maximally filled should resolve the memory overhead back down to close to 10%. But how to do that without negatively impacting performance too much?
Ropey's iterators are now bidirectional (see issue #18). However, there are still use-cases where it is useful or necessary to reverse the iterator so that next()
moves it backwards instead of forwards. In particular, right now you can't use any of Rust's iterator methods when iterating backwards.
One possibility would be to implement the DoubleEndedIterator
trait, which behaves like two independent iterators, one on each end, both moving in opposite directions. By calling rev()
you can swap the ends. However, it's not entirely clear what it would mean, then, to create an iterator starting e.g. in the middle of the rope.
Another possibility is to add a method that simply reverses the direction of the iterator in-place. This would side-step any awkwardness with DoubleEndedIterator
and I believe would satisfy the large majority of use-cases. Moreover, it would make it obvious how to create a reversed iterator starting in the middle of a Rope. However, it would move Ropey even further away from the way people may initially expect things to work based on iterators in the standard library.
Is it possible to get a copy-on-write reference to a chunk? Or alternatively, a rope that guarantees it only ever consists of a single chunk?
The context here is a struct which needs a memory contiguous string for some operations, but to produce a rope containing that string for other operations. Currently, I am storing both a Rope and a String that have the same content, but this isn't ideal.
Thanks for fixing my previous issue. I'm not sure it exactly correct, however. Now it is different if the document ends with \n. In this case, the rope correctly has an empty last line, but the slice of that rope does not. This is using ropey 0.9.1.
extern crate ropey;
use ropey::Rope;
fn main() {
let text = "Hello there!\nHow goes it?\n";
let rope = Rope::from_str(text);
// should be the same as rope
let slice = rope.slice(..);
// but now the slice is missing the empty last line that the rope has
assert!(slice.lines().count() == rope.lines().count());
}
I was learning about the Rope
data structure when I saw this project.
A normal Rope
uses a binary tree, and every node's key is the sum of its left subtree. (According to https://en.wikipedia.org/wiki/Rope_(data_structure))
The Rope
s leaf nodes are what actually contain the string, and its key is the length of that string.
I saw in https://github.com/cessen/ropey/blob/master/design/design.md that Ropey
uses a B-Tree
. I also noticed that in that design paper, it shows that every single Node
, not just the leaf nodes, contains a String.
I was kinda confused by this, and was wondering if you could explain how that works. For example, how would the searching work? (Like finding the what byte is at a specific position) In a normal B-Tree
, to search for a key k
, you:
Start at the root
Go through each key in the current node, and then find where k
falls in that range of keys.
You then go to the appropriate child, and repeat.
However, I'm pretty sure that only works without duplicates.
How does Ropey
's design for that work? Because if the key in Ropey
's design is the byte count of its child, then there could be multiple duplicate byte counts.
I'm probably missing something very basic, and it would be great if you could explain this.
Also, I'm kind of curious as to why you did not use a B+ Tree
. I feel like a B+ Tree
is more fitting because all of the values are stored in it's leaf nodes, like a Rope
.
I'm not sure where to put this - its a bit of a stretch to make this an issue. But it might be something you want to add to ropey.
I've been experimenting with pre-write concatenation of edits to see how much of a difference it makes to performance. The use case here is replaying editing histories in a CRDT I'm working on, and outputting the resulting string. (I'm storing the editing history anyway, so if I can do this quickly I can skip also storing the resulting document).
To improve performance I'm essentially wrapping jumprope in a little wrapper structure which greedily concatenates all incoming changes when they're adjacent and the same type. So instead of insert 'a' then insert 'b' the changes become "insert 'ab'*. Obviously this only works when the changes are the same type, and adjacent. But the performance figures when using real world editing histories is pretty unreal. I'm seeing a 3-10x performance improvement from this change.
The automerge-perf dataset shows the biggest improvement, because each edit in that dataset is a single keystroke. This concatenation approach brings the number of "real edits" to the structure down from 260k to 10k, at the expense of a bit of bookkeeping. In aggregate performance improves by 9.5x. I'm now replaying this 260k edit history in 670 microseconds (!), or 390M keystrokes per second. (Up from 40M/sec, which is already obnoxiously fast.)
Dataset | Edit Direct (ms) | Edit Buffered (ms) | Performance multiplier |
---|---|---|---|
automerge-perf | 6.36 | 0.67 | 9.55x |
seph-blog1 | 3.61 | 0.97 | 3.72x |
Anyway, this approach is only useful with specific use cases, and it won't help at all with random editing traces.
Its also entirely rope-agnostic. Ropey would see a similar performance improvement with these data sets.
So yeah, I'm not sure if you care about this, but its interesting!
I'm new to rust, so sorry if I'm doing something dumb, but I'm seeing a weird bug. When iterating over a slice with lines
, I get an extra empty line. If you make a new program with cargo like this, it fails. However, I added the same test to ropey as a test in the iter.rs file, and it passed. I don't understand how that can be.
Thanks,
Craig
extern crate ropey;
use ropey::Rope;
fn main() {
let text = "Hello there!\nHow goes it?";
let rope = Rope::from_str(text);
// should be the same as rope
let slice = rope.slice(..);
// but we get an extra empty line when iterating over the slice
assert!(slice.lines().count() == rope.lines().count());
}
#[cfg(test)]
mod tests {
use Rope;
#[test]
fn lines_from_slice() {
let text = "Hello there!\nHow goes it?";
let rope = Rope::from_str(text);
// should be the same as rope
let slice = rope.slice(..);
// but we get an extra empty line when iterating over the slice
assert!(slice.lines().count() == rope.lines().count());
}
}
Some software and APIs, especially on the Windows platform, still operate in terms of UTF-16 code units when working with text. One example is the Language Server Protocol, which specifies text offsets in UTF-16 code units.
Being able to efficiently interoperate with these APIs is useful, especially for code editors. To better support this use-case, and to generally round-out Ropey's Unicode support, we should provide a way to convert between Unicode scalar value indices (which Ropey uses) and UTF-16 code unit indices.
I think tracking the data for this can be done with relatively little overhead. I'm not as concerned that the actual conversions be super optimized (though they should still be efficient). But I don't want the feature to negatively impact clients that aren't using it.
Update:
This feature is now implemented and in master, and will be in the next release.
Some client code may want to iterate over e.g. chunks in both directions. For example, DFA regex need to scan backwards after finding a match to calculate where the match begins.
The tricky bit isn't technical, however. It's having a standard API to interoperate with other libraries. Unfortunately, the Rust standard library doesn't provide a bidirectional iterator trait, so this probably isn't useful to implement yet: any API I come up with will be specific to Ropey, and therefore wouldn't interoperate with other libraries anyway.
I'm creating this issue as a reminder to add this functionality if/when a bidirectional iterator trait (or equivalent) is added.
Inserting a large amount of text with a single insert is somewhat slow right now, because it is special-cased to the construction of a new rope, and then a split and two appends. It's still O(log n) performance, but the constant is much higher. It's likely that there is a faster way to do this.
I know that you can combine/insert a string to a rope in O(log n) time, but I wonder if it is possible to reuse the information from another rope, rather than turning it into string and then insert it back, i.e. tree merge algorithm is probably needed.
Related: #4
This is needed for a preprocessor algorithm, that we could map the parent node of any rope to a specific pointer (such as file) for better diagnostic when parser goes wrong. Another feature needed is a splice algorithm which requires this but let's keep it away from here for the next time.
I noticed this in helix-editor/helix#4983. It looks like the cause is that the stack
is not initialized with a single starting node.
I've been trying to read from/write to files from a Rope
. The documentation suggests this:
let mut text = Rope::from_reader(
BufReader::new(File::open("my_great_book.txt")?)
)?;
This works well for synchronous programs, but I was unable to find any asynchronous counterpart to Rope::from_reader
. I tried to use tokio::io::BufReader
, but since that doesn't implement std::io::Read
(instead tokio::io::AsyncRead
) it isn't compatible with from_reader
.
I'm wondering if async support is a possibility, for example something like Rope::from_reader_async
behind a feature flag.
Thanks!
Is it perhaps an error that this line is
pub(crate) const MAX_BYTES: usize = TARGET_NODE_SIZE - 1 - (PTR_SIZE * 2);
rather than
pub(crate) const MAX_BYTES: usize = TARGET_NODE_SIZE - size_of::<usize>()*2;
or
pub(crate) const MAX_BYTES: usize = 1024 - 1 - size_of::<usize>() - (PTR_SIZE * 2);
as you have already subtracted the arc counters in TARGET_NODE_SIZE, and the layout is actually
[ smallvec enum tag ][ smallvec capacity ][ BackingArray ]
except when the "union" feature is enabled.
I found that this allows the correct maximum of 1024-16 = 1008 bytes to be achieved on my 64-bit system.
I found it confusing how ropey
's line count doesn't agree with BSD or GNU wc
's line count.
I wrote a simple reproduction of this difference here: https://github.com/evanrelf/ropey-confusion
The TL;DR is that ropey
considers an empty file (e.g. created with touch
) to have one line, whereas wc
considers it to have zero lines.
I don't necessarily think there's anything wrong with this behavior. But I found it confusing and I thought it might be worth mentioning in the documentation.
Other than this little thing, ropey
has been a joy to use while working on building my own text editor. Thanks for making such a great library! ๐
Both the get_chars()
and get_chars_at()
methods on RopeSlice<'a>
return a value of type Chars
, however, the get_chars()
version returns a Chars<'a>
with a lifetime matching the RopeSlice<'a>
, while get_chars_at()
's return value uses an anonymous lifetime, Chars<'_>
, which is limited to the scope in which the method was called.
I encountered this while trying to write a previous-word implementation, under some constraints. I found myself needing to write a function that took a slice and the cursor position, and returned an iterator over the characters in the slice traveling towards the start of the slice. I ended up with some functions that looked a bit like the following:
fn iter_slice_rev<'a>(text: RopeSlice<'a>) -> Chars<'a> {
let mut it = text.chars_at(text.len_chars());
it.reverse();
it
}
fn iter_slice_forward<'a>(text: RopeSlice<'a>) -> Chars<'a> {
let it = text.chars();
it
}
The slice_forward()
function will compile, while the slice_rev()
function has borrow checker errors because the Chars instance it returns has its anonymous lifetime set to be the same lifetime as the borrow in the function which starts when chars_at()
is called, and thus cannot outlive the function.
Similar function attempts also fail with lifetime errors:
fn iter_slice_rev2<'a, 'b>(text: RopeSlice<'a>) -> Chars<'b> {
let mut it = text.chars_at(text.len_chars());
it.reverse();
it
}
fn iter_slice_rev3<'a, 'b>(text: &RopeSlice<'a>) -> Chars<'b> {
let mut it = text.chars_at(text.len_chars());
it.reverse();
it
}
fn iter_slice_rev4(text: RopeSlice<'_>) -> Chars<'_> {
let mut it = text.chars_at(text.len_chars());
it.reverse();
it
}
I attempted to see if there were any problems with simply adjusting the signature of the functions to return the more general lifetime, changing pub fn get_chars_at(&self, char_idx: usize) -> Option<Chars>
to pub fn get_chars_at(&self, char_idx: usize) -> Option<Chars><'a>
in slice.rs on line 1261, and making a similar change for get_chars()
. It did not cause any issues with the test suite.
I do not believe that such a change would be a breaking change, since the lifetime is only being expanded.
A similar difference exists for chars_at()
, bytes_at()
, lines_at()
, and chunks_at_line_break()
, vs. chars()
, bytes()
, lines()
, and all other chunks_at_*()
methods.
The functions I have noticed that use the anonymous lifetime and have counterparts that use the slice's lifetime are:
The differences among the chunks_at_*()
methods seem to indicate that the use of the anonymous lifetime for the former group is accidental and not a specific API choice. I am not entirely certain that such a guess would be correct however.
If this is indeed not intended, and it would be an acceptable change, then I would like to ask for the lifetime annotations to be added.
If it is acceptable and you would prefer it, I could put together a pull request for the change.
The code base may be a little tricky to understand for newcomers at the moment, because of how much fiddly stuff is going on. To make it easier for people to contribute, Ropey's design should be documented.
I'm wondering if you have any suggestion for efficiently tracking marked ranges? Issue #13 touched on this but was closed before you had a chance to comment.
I agree with issue #13, it can (probably should) be done in external structure. But an efficient implementation is non-trivial. I'm thinking something like https://github.com/atom/superstring marker_index as a good example of what is needed. It would be great to have this structure as an option to go with ropey. Or at least in the documentation to have some suggestions (maybe another crate) on how to handle the use case of tracking ranges in a ropey buffer.
ropey/examples/search_and_replace.rs
Line 156 in d9c841b
For the next function here, by change the linear search to the Boyer-Moore string search algorithm and then it can skip some unnecessary char comparison, then reduce the memory operations and improve the performance.
According to my test it will get a 96.5% speedup (12.2s to 6.22s).
Here's the modified code:
fn next(&mut self) -> Option<(usize, usize)> {
let search_char = self.search_pattern.chars();
while let Some(next_char) = self.char_iter.next() {//char_iter redundancy
self.cur_index += 1;
if let Some(last_char) = self.char_iter.nth(self.search_pattern_char_len - 2){
if last_char != search_char.clone().last().unwrap(){
let n_place = self.search_pattern.rfind(last_char);
if n_place == None{// if the last chat is not matched and it is not in the search_pattern, the we can skip the whole length
// self.possible_matches.clear();
for n in 0..(self.search_pattern_char_len - 1){
self.char_iter.next();
self.cur_index += 1;
}
continue;
}else{// if the last char is in the search_pattern, we can skip some of the letters.
for n in 0..(self.search_pattern_char_len - n_place.unwrap() - 1){
self.char_iter.next();
self.cur_index += 1;
}
continue;
}
}
// It is matched with the last char, so we start to compare them
// first match the first char, if not match, go to next loop
if next_char != search_char.clone().next().unwrap(){
continue;
}
let mut if_match = true;
for i in 0..(self.search_pattern_char_len - 3){
let next_last_char = self.char_iter.nth(self.search_pattern_char_len - 3 - i);
if next_last_char != search_char.clone().nth(self.search_pattern_char_len - 2 - i){
// self.char_iter.next();
// self.cur_index += 1;
if_match = false;
break;
}
}
if if_match{
let char_match_range = (
self.cur_index - 1,
self.cur_index + self.search_pattern_char_len - 1,
);
return Some(char_match_range);
}else{
continue;
}
} else {
break;
}
}
None
}
Hope this information helps!
I plan for Ropey to reach 1.0 at some point. I don't want it to be a library that is perpetually stuck in beta. This is the list of things that need to happen before I declare 1.0.
I'd like to know if you would accept a PR adding a feature-gated api extension which includes indexing functions that return Option<A>
instead of panicking. Thanks!
Hello, while trying to compile some cargo package (zee) which has ropey as a dependency, I ran into this error:
attempt to compute 0_usize - 1_usize
, which would overflow
Viewing the source, the comment above this declaration specifies this:
I am unsure whether this is an error on my side, relevant info:
rustc --version: rustc 1.65.0 (897e37553 2022-11-02)
rustup toolchain list:
stable-x86_64-unknown-linux-musl (default)
nightly-x86_64-unknown-linux-musl
Operating system: Alpine linux edge x86_64
For implementing copy-paste, it is useful to take part of the Rope
, as a new Rope
. Right now I will just have to clone()
and remove prefix and suffix, which is not too bad, but not ideal. :)
Firstly, thanks for the excellent library! I have run into one bit of confusing behaviour though, and I couldn't see an issue about it so I thought I'd raise it as a question.
Quoting the documentation:
If char_idx is one-past-the-end, then one-past-the-end line index is returned. This is mainly unintuitive for empty ropes, which will return a line index of 1 for a char_idx of zero. Otherwise it behaves as expected.
https://docs.rs/ropey/0.6.3/ropey/struct.Rope.html#method.char_to_line
The code works as described, but the behaviour seems strange to me, whether the rope is empty or not. It is causing me to add quite a few special cases to my code. Why does it not return the index of the last line, rather than a hypothetical new line?
I have a char_index and I would like to convert to a (line, column) point.
Untested but right now I think something like this will work:
fn char_to_point(char_idx: usize, rope: &Rope) -> (usize, usize) {
let line_idx = rope.char_to_line(char_idx);
let line_char_index = rope.line_to_char(line_idx);
(line_idx, char_idx - line_char_index)
}
I wonder if there's a more efficient approach though... seems that internally char_to_line
probably already has easy/efficient access to the the column offset information. Could it (or maybe there's some other method) just return the line and offset so that the second call to line_to_char
isn't necessary.
rust-lang/rust#94075 optimizes repr(Rust)
enums better, making them smaller in many cases. This breaks ropey (see https://crater-reports.s3.amazonaws.com/pr-94075/try%23419e70b02dc61a51433bb55fe5a482c6ab6f7da5/gh/cessen.led/log.txt for an example). I'm not sure when this change gets merged and will hit stable, but relying on the exact size of repr(Rust)
types is suboptimal. The easy fix is to use repr(C)
for the affected enums, which will disable any such optimizations.
If you have opinions on this topic, please don't hesitate to let me know here or post on the PR.
let file = File::open(&path).context(format!("unable to open {:?}", path))?;
Rope::from_reader(BufReader::new(file))?
Should we re-buffer from_reader
, does it already buffer it? Or we don't need to use BufReader
? If yes, maybe the docs might want to mention it. I was looking at the code and it seemed like it does some buffering. I just ponder upon this when working on helix.
I read ropeys design documentation but still can't quite get it. When cloning ropes the actual content is shared between them. If we then continue to insert text in one of the clones the other doesn't change. But we are tracking information about the text in the internal nodes so this means that atleast all of the internal nodes from root to the insert position must be changed according to the edit.
So are we actually cloning (or duplicating) all of the internal nodes from root to the insert position in this case?
What is actually cloned in this case?
Right now you can only insert &str slices into ropes. However, in an editor it may be useful to insert text from other ropes (e.g. for copy-pasting a large amount of text). Adding the ability to insert rope slices would allow for this.
Not sure whether just me. I just clone and run cargo test.
Environment: mac m1.
---- iter::tests::lines_exact_size_iter_04 stdout ----
thread 'iter::tests::lines_exact_size_iter_04' panicked at 'assertion failed: `(left == right)`
left: `6`,
right: `5`', src/iter.rs:2460:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
This may already be a thing, but how could you get only certain lines (e.g. 15-65). Currently I have a loop and use .line(idx), which seems inefficient. Is there any way to do something like this: .lines()[14..65]?
Hi!
I noticed this cool project while developing my own text-editor. I really apologize if this is the wrong place to put it, but I have a couple of questions.
ropey
, but that article was 2 years ago. By performance I mean inserting & indexing (getting a byte at position x) at random locations.ropey
and xi-rope
uses) compare to that of say a Piece-Table
?ropey
and xi-rope
do is that they break that 20GB of data into a bunch of nodes, where each node contains BUFFER_SIZE
bytes. If you have 20GB of data & 1K node size, then you would need 20,971,520 nodes. This really bogs down random insertions & indexes.Piece-Table
could just read all the 20GB into memory, and when you want to perform a random insertion/deletion, then it would be instant. (assuming you have made 0 edits). The downside of a Piece-Table
is that the performance scales with how many pieces you have in the table (ie edits). If you have made 10,000 edits, then you would have to iterate through 10,000+ pieces to find out indexes. I think this can be brought down to O(log(m))
time with a RB-Tree
, where m
is the number of edits made instead of the number of characters. I believe this is what VSCode
does.I was wondering if the chunk of text above (about the piece table) are true. If it's true, then does that mean a Piece-Table
could be theoretically faster for general things like insertions, deletions, and indexing?
n = number of bytes
m = number of edits
B-Tree Rope
Random Insertion: o(log n)
B-Tree Rope
Random Indexing: o(log n)
Piece-Table
Random Insertion: o(m)
Piece-Table
Random Indexing: o(m)
Piece-Table-With-Tree
Random Insertion: o(log m)
Piece-Table-With-Tree
Random Indexing: o(log m)
I'm actually using c++
, but I was curious as to how this library performs against other methods.
Thanks!
Ropey's b-tree implementation is very nice/fast/documented.
But right now it is tied to string type. Have you considered making it generic such as what xi-editor or xray editor do with their b-tree implementations? I think it would be a generally useful crate on its own if made generic.
Although ideally memory overhead would be minimal all the time, in practice there will always be some memory overhead from manipulating ropes, much like Vec
s which allocate more capacity than their contents when pushing to them.
Let's add methods analogous to capacity()
and shrink_to_fit()
from Vec
, which could be useful in some situations.
Hey,
are there any fundamental reasons why the Rope
type doesn't implement Eq
, PartialOrd
and Ord
? I would have expected implementations for those, with behavior equivalent to that of String
s containing the same unicode data. But I suppose you had good reasons for not implementing Eq
at the very least.
Most other software does not conform to the Unicode spec when it comes to line breaks. This (understandably) gives people pause when deciding whether to use Ropey or not, because although Ropey is "correct" in the way it counts line breaks, it's not compatible with other software if any line breaks other than LF/CRLF are present in a file.
We can solve this with a feature flag that implements the simpler, non-Unicode-conformant line break counting that is common to most other software.
Hi, and thank you for the great crate!
I'm not sure that this will match goals of the project, but I came to a suggestion:
likely, this crate will be used for a text editor or something which depends on lines' length to compute
selections, sticky positions etc.; while it is possible to get line by index as a RopeSlice and get it's length, to get actual number of columns developer have to take into account CRLF and if this is a last line of file or not; having a method which returns a number of columns for the line trimming \r\n
(and other stuff you track to split on lines) from the end would be useful, at least for me.
There are already quite a few unit tests in Ropey, but it doesn't have complete coverage yet.
The most important thing is complete coverage of all public-facing API's, making sure they work as expected. Then unit tests of internal bits can follow.
This will be really useful when further optimizing Ropey, to make sure optimizations don't break anything.
Currently the Lines
iterator is roughly equivalent to just calling Rope::line()
repeatedly with an incrementing index. This is O(log N) for each call to Lines::next()
, and also is just generally less efficient than it needs to be. This is not only sub-optimal, but also stands out compared to the other iterators which are all O(1) and very fast.
It should be possible to also make Lines
O(1) and just generally more efficient.
It would be a nice optimization in Helix since we receive byte ranges from tree-sitter then convert them into char indices, but then ropey converts back to bytes and uses byte calculations internally.
It would be great if there was a simple way to remove the newline between two lines, turning it into one line
There seems to be no way to insert individual characters.
I'd love to use ropey as the backing buffer of a terminal editor, but without the ability to edit the rope on a per-keypress basis there's just a lot of needless overhead.
Hi,
I am trying to read a file into a rope, parse it line by line with Nom and then, in the future, maybe write back changes.
My files are self contained lines so it makes sense to parse them line by line, so I naively do a line.to_str()
and try to parse them, but the results surprised me, as not all lines convert properly to &str:
line 20: ["option \"something\"]
line 20 as_str: Some("option \"something\"\n")
line 21: ["\n"]
line 21 as_str: Some("\n")
line 22: ["option \"i", "rrelevant\"\n"]
line 22 as_str: None
So this is probably where the buffered reader overflows into another chunk, but what is my best option for retrieving a &str for the line, so that I can parse it?
Or should I try to run Nom directly on the RopeSlice somehow?
Hi,
I like ropey very much, but I have one issue I do not know how to walk around. I need to add syntax highlighting and I do not want to recompute it lazily. So I need to attach short information to ranges (this example: exclusive ranges) of text. Depending on context, I would like to have a choice to see if I am breaking a piece of color (keyword) or keeping it (comment). In some cases I could avoid recalculating highlighting altogether.
What do you think?
I'm using Ropey for a LSP language server, which it seems was one of the motivating use cases for UTF-16 code unit indexing. One slight complication I noticed is that LSP mandates that only \n
and \r\n
be treated as line endings while Ropey uses a more general set of Unicode line boundary characters.
I think I can work around this by just replacing the alternate line breaks that Ropey recognizes with dummy whitespace characters if I receive them from the LSP client, so line indexing will stay coherent. But it would be nice if there was a native way to configure this so no workaround was required. This strategy isn't really ideal since the internal buffer doesn't necessarily match the file and correctly implementing certain features might require extra tracking of where to replace the dummy characters with the original values when sending text back to the client.
It looks like the set of newline characters Ropey recognizes is hard-coded, so maybe there isn't an API-compatible way to implement this without sacrificing non-negligible performance. If so, feel free to close this.
Hi!
Today I've looked at how IntelliJ handles text document, and was really surprised by how minimal and simple the core data structure is at just 500 lines of Java. I thought that you might be interested in taking a look as well:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.