I ran the users.sh and I'm only getting to ticket #3735 out of 18,000+ tickets. I ass

started refactoring to abstract out data sources: <a class="is

tracboat looks for comments as well, and deion edits are also available

I knew it looked at comments and deion, my second question (outside of the main

actually, i think it exists: <a href="https://github.com/tracb

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

some suggestions: comment out that block that processes change

xml.parsers.expat.ExpatError: not well-formed (invalid token),about tracboat/tracboat

Comments (37)

glensc commented on June 8, 2024 1

started refactoring to abstract out data sources:

i'm not from Python world, so lets see where it ends up :)

from tracboat.

glensc commented on June 8, 2024

tracboat looks for comments as well, and description edits are also available

from tracboat.

kerrhome commented on June 8, 2024

I knew it looked at comments and description, my second question (outside of the main issue of the xml parse failure) is just about the change history of each ticket comment. I think you are saying that the comment change history is also examined by tracboat.

from tracboat.

kerrhome commented on June 8, 2024

I am blocked by this, so any help or guidance on what I can do to get past this would be very much appreciated.

from tracboat.

glensc commented on June 8, 2024

I wish there was some intermediate transfer format, that tracboat could process.

This way when re-running the import, would not need to hammer trac with the requests.

This would allow some other means to create the intermediate format.

from tracboat.

glensc commented on June 8, 2024

actually, i think it exists:

https://github.com/tracboat/tracboat/blob/ccc272388fcdb5a251d93d70c804f93f8a262f23/tests/trac-exampleproject-exported.json

from tracboat.

glensc commented on June 8, 2024

@kerrhome not sure what you mean, but as i recall, comments come from changelog list:

tracboat/tests/trac-exampleproject-exported.json

Line 135 in ccc2723

"changelog": [

that's example dump, which data is pulled out of trac

from tracboat.

glensc commented on June 8, 2024

but the export has issues with binary data, as already explored elsewhere

$ ./tracboat.sh --config-file=news-cms.toml export --format=json --out-file=news-cms.json
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

ps: when i scanned for issues, maybe trac just truncates something: #31

from tracboat.

kerrhome commented on June 8, 2024

You've give me a bunch to look at :) I'll get to work now. Thank you!

from tracboat.

glensc commented on June 8, 2024

some suggestions:

comment out that block that processes changelogs, then you could see other results

  File "/home/user1/tracboat/src/tracboat/trac.py", line 52, in ticket_get_changelog
    for c in source.ticket.changeLog(ticket_id)

or let it go with empty set:

    for c in []:

try to use export command, maybe can see what's the problematic data.

from tracboat.

kerrhome commented on June 8, 2024

@kerrhome not sure what you mean, but as i recall, comments come from changelog list:
tracboat/tests/trac-exampleproject-exported.json
     Line 135
  in
  [ccc2723](/tracboat/tracboat/commit/ccc272388fcdb5a251d93d70c804f93f8a262f23)

       "changelog": [
that's example dump, which data is pulled out of trac

That was exactly what I meant. Good. So for a specific changelog comment, you take the whole history for it. That means, a special character (like &) could be in the history of the comment and not visible when I just go an look at the ticket right now.

from tracboat.

kerrhome commented on June 8, 2024

We modified python2.7/xmlrpclib.py to dump some debug info when it fails. I updated the for loop to just process the failing ticket to reduce the time on this. I'll let you know when we have something to share. Thank you!

from tracboat.

kerrhome commented on June 8, 2024

We found it. For that ticket #3735 comment 11, one of the sentences looks like this:

Just installed v5.2.2 and it just showed the same behavior as v5.3.3 did last week.

When we look at the dump from xmlrpclib.py there is a \x10 between "week" and the "." (so week\x10.).

My colleague stripped out the \x10 via xmlrpclib.py and we're able to parse it just fine now. Here is what the diff looks like:

diff -C 5 /usr/lib/python2.7/xmlrpclib.py.orig /usr/lib/python2.7/xmlrpclib.py
*** /usr/lib/python2.7/xmlrpclib.py.orig	2019-02-28 11:24:27.489455298 -0500
--- /usr/lib/python2.7/xmlrpclib.py	2019-02-28 12:17:56.685429129 -0500
***************
*** 553,563 ****
              if not parser.returns_unicode:
                  encoding = "utf-8"
              target.xml(encoding, None)
  
          def feed(self, data):
!             self._parser.Parse(data, 0)
  
          def close(self):
              try:
                  parser = self._parser
              except AttributeError:
--- 553,569 ----
              if not parser.returns_unicode:
                  encoding = "utf-8"
              target.xml(encoding, None)
  
          def feed(self, data):
!             try:
!                 data = data.replace("\x10", "")
!                 self._parser.Parse(data, 0)
!             except Exception as e:
!                 print(repr(e))
!                 print(repr(data))
!                 raise
  
          def close(self):
              try:
                  parser = self._parser
              except AttributeError:

So, not a bug with tracboat or trac_to_git, but rather something in our Trac ticket that you cannot see from the Trac WebUI at all. We're fine stripping out \x10 from everything, so we'll try that and see how far we get. Thanks for your tips and assistance.

from tracboat.

kerrhome commented on June 8, 2024

We got up to #5485 and then found a couple null characters in the data. Just added this to xmlrpclib.py:

                  data = data.replace("\x10", "")
!                 data = data.replace("\x00", "")

And we're moving forward.

from tracboat.

kerrhome commented on June 8, 2024

Just to clarify that I understand the process here, I run users.sh to get the users list to plug into the mytrac.toml under [tracboat.usermap], run tracboat.sh export to do the Trac json dump, and then tracboat.sh migrate with the json dump and users updated in mytrac.toml?

from tracboat.

glensc commented on June 8, 2024

the thing is that XML does not allow control characters, it's utf-8 based mostly.

out of curiosity, from what trac version you think those were input to your system? and those were submitted as comments via trac web?

from tracboat.

glensc commented on June 8, 2024

for me no dump format worked (the binary blobs issue i pointed already), so i ran my migration using tracxmlrpc

from tracboat.

kerrhome commented on June 8, 2024

Trac 1.0 is what we're on. So I ended up having to filter out more than just \x10 and \x00. I hit several more (1e, 0c etc). But I finally made it all the way through. Hurray! Could you please confirm I'm on the right track in this comment? #59 (comment) Thank you!

from tracboat.

kerrhome commented on June 8, 2024

the thing is that XML does not allow control characters, it's utf-8 based mostly.

out of curiosity, from what trac version you think those were input to your system? and those were submitted as comments via trac web?

I missed that last question. Yes, they were comments submitted via Trac Web UI.

from tracboat.

glensc commented on June 8, 2024

I mean you are 1.0 right now, but what was version was installed (in your best guess) when they were input there, as specific as you can, i.e full version.

you are on right track because control chars are not valid in xml, those must be encoded otherwise, or stripped. which is the appropriate depends why they are there in first place.

if in comment body, then trac should filter out non-text when accepting user input. not sure if they do that now, maybe you want to get deep on this, create curl request that submits those invalid chars, and see what your trac does on that.

https://en.wikipedia.org/wiki/Valid_characters_in_XML

in any case, the problem is in trac and/or tracxmlrpc plugin and should be addressed there.

from tracboat.

glensc commented on June 8, 2024

you may copy this regex for xml 1.0:

https://stackoverflow.com/a/14323524/2314626

not sure what xml version the xmlrpc is using here. you needs somehow to dump out the xmlrpc request/responses to see answer.

from tracboat.

kerrhome commented on June 8, 2024

Sorry about that. Hit the wrong button. I've asked someone who was one of the first employees of our company if he knows what version we may have started out on. Will let you know.

I do know that years ago we switched back-end databases, so that migration could have caused problems. So far I've seen control characters inserted into comments in a few different places and I've seen wiki and ticket attachment names corrupted with control characters (2 of these so far). Something went awry over the past decade.

My guess is that these were not user input issues but rather data corruption during updates or database migration.

from tracboat.

kerrhome commented on June 8, 2024

We're pretty sure this was our Trac migration starting with version 0.9: 0.9 -> 0.11 -> 0.12 -> 1.0

from tracboat.

kerrhome commented on June 8, 2024

We made it through users, wikis and tickets now and have resolved all issues for those (by replacing the control characters with empty string or by re-uploading attachments with corrupted file names). I did move the new error with project_get to a new ticket since it was unrelated to this.

from tracboat.

kerrhome commented on June 8, 2024

I suppose we can close this now since the issue was not with tracboat, but rather with corrupted data in our Trac instance and we've worked around that.

from tracboat.

kerrhome commented on June 8, 2024

This issue is when running tracboat export. I made it all the way up to the cli.py _dumps() call and failed on json.dumps:

2019-03-01 20:37:26,687 DEBUG tracboat.trac: milestone_get_all
2019-03-01 20:37:26,687 DEBUG tracboat.trac: milestone_get_all_names
2019-03-01 20:37:58,756 DEBUG tracboat.trac: project_get is collecting authors from project
Traceback (most recent call last):
  File "/home/user1/tracboat/VENV/bin/tracboat", line 11, in <module>
    load_entry_point('tracboat', 'console_scripts', 'tracboat')()
  File "/home/user1/tracboat/src/tracboat/cli.py", line 428, in main
    cli(obj={})  # pylint: disable=unexpected-keyword-arg,no-value-for-parameter
  File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/user1/tracboat/src/tracboat/cli.py", line 118, in wrapper
    return func(*args, **kwargs)
  File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/user1/tracboat/src/tracboat/cli.py", line 268, in export
    project = _dumps(project, fmt=format)
  File "/home/user1/tracboat/src/tracboat/cli.py", line 44, in _dumps
    return json.dumps(obj, sort_keys=True, indent=2, default=json_util.default)
  File "/usr/lib/python2.7/json/__init__.py", line 251, in dumps
    sort_keys=sort_keys, **kw).encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 209, in encode
    chunks = list(chunks)
  File "/usr/lib/python2.7/json/encoder.py", line 434, in _iterencode
    for chunk in _iterencode_dict(o, _current_indent_level):
  File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
    for chunk in chunks:
  File "/usr/lib/python2.7/json/encoder.py", line 390, in _iterencode_dict
    yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

from tracboat.

kerrhome commented on June 8, 2024

Besides us doing a pickle.dump(obj, open("file.p"), "wb"), pickle.HIGHEST_PROTOCOL) right before the json.dumps call, any other advice you can give us? Thank you!

from tracboat.

kerrhome commented on June 8, 2024

The json.dump is failing on an attachment which is a png file being written as a utf8 string. We assume attachments are supported. Why would tracboat try to write a binary attachment as a string?

from tracboat.

kerrhome commented on June 8, 2024

So far it looks like this is only happening with .png files.

from tracboat.

kerrhome commented on June 8, 2024

We just hit a 7zip file (.7z). So, it appears that binary attachments to wiki pages are not being processed correctly.

from tracboat.

kerrhome commented on June 8, 2024

We decided to strip all wiki attachments for now. We just want a dump that we can import into our gitlab sandbox instance so we can get back to our pilot, but in our final run we will need wiki binary attachments working.

Here is the script we used to strip attachments from our pickle.dump:

#!python
import pickle
import json

x = pickle.load(open("pickle_dump.p"))

def check_and_strip(o, path=[]):
    if isinstance(o, dict):
        print("dict")
        o.pop("attachments", None)
        for k, v in o.items():
            check_and_strip(k, path + ['key:' + repr(k)])
            check_and_strip(v, path + ['value:' + repr(k)])
    elif isinstance(o, list):
        print("list")
        for i, j in enumerate(o):
            check_and_strip(j, path + ['list:' + str(i)])
    elif isinstance(o, str):
        #print("str")
        if '\xd0' in o:
            print("found bad char in str at {}".format(path))
            raise ValueError()

print("checking")
check_and_strip(x)

The above script is not cleaned up and started out as our script to find the issue (binary attachments to wikis), so forgive the irrelevant code.

from tracboat.

kerrhome commented on June 8, 2024

btw, we are using Python2. Do you typically use Python 2 or 3? Wondering if 3 handles binary attachments better.

from tracboat.

kerrhome commented on June 8, 2024

We're also going to try importing/migrating our pickle dump that still has binary attachments.

from tracboat.

kerrhome commented on June 8, 2024

The pickle dump would not import complaining about the binary attachments (non-utf8 chars).

The json dump failed because of our removed attachments.

from tracboat.

kerrhome commented on June 8, 2024

Looks like the binary attachment json.dumps issue is known #9. I guess our solution to exclude attachments for now will have to do during our piloting, but we need that fixed before we run our official migration.

from tracboat.

kerrhome commented on June 8, 2024

We did get our json dump to import into gitlab successfully, so that's good news. It did not import the wikis, though, which is bad news. I thought they would import automatically, but that was not the case. Trying to figure out what to do about that.

from tracboat.

kerrhome commented on June 8, 2024

I was able to get the attachments to dump via this solution:
#6 (comment), at least it didn't fail when using this solution. Since issue #6 is tracking the attachment issue and I've resolved, in one way or another, all of my utf8 related issues, I'm closing this ticket.

from tracboat.

xml.parsers.expat.ExpatError: not well-formed (invalid token) about tracboat HOT 37 CLOSED

Comments (37)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent