Comments (37)
started refactoring to abstract out data sources:
i'm not from Python world, so lets see where it ends up :)
from tracboat.
tracboat looks for comments as well, and description edits are also available
from tracboat.
I knew it looked at comments and description, my second question (outside of the main issue of the xml parse failure) is just about the change history of each ticket comment. I think you are saying that the comment change history is also examined by tracboat.
from tracboat.
I am blocked by this, so any help or guidance on what I can do to get past this would be very much appreciated.
from tracboat.
I wish there was some intermediate transfer format, that tracboat could process.
This way when re-running the import, would not need to hammer trac with the requests.
This would allow some other means to create the intermediate format.
from tracboat.
actually, i think it exists:
from tracboat.
@kerrhome not sure what you mean, but as i recall, comments come from changelog
list:
that's example dump, which data is pulled out of trac
from tracboat.
but the export has issues with binary data, as already explored elsewhere
$ ./tracboat.sh --config-file=news-cms.toml export --format=json --out-file=news-cms.json
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
ps: when i scanned for issues, maybe trac just truncates something: #31
from tracboat.
You've give me a bunch to look at :) I'll get to work now. Thank you!
from tracboat.
some suggestions:
- comment out that block that processes changelogs, then you could see other results
File "/home/user1/tracboat/src/tracboat/trac.py", line 52, in ticket_get_changelog
for c in source.ticket.changeLog(ticket_id)
or let it go with empty set:
for c in []:
- try to use export command, maybe can see what's the problematic data.
from tracboat.
@kerrhome not sure what you mean, but as i recall, comments come from
changelog
list:
tracboat/tests/trac-exampleproject-exported.json
Line 135 in [ccc2723](/tracboat/tracboat/commit/ccc272388fcdb5a251d93d70c804f93f8a262f23) "changelog": [
that's example dump, which data is pulled out of trac
That was exactly what I meant. Good. So for a specific changelog comment, you take the whole history for it. That means, a special character (like &) could be in the history of the comment and not visible when I just go an look at the ticket right now.
from tracboat.
We modified python2.7/xmlrpclib.py to dump some debug info when it fails. I updated the for loop to just process the failing ticket to reduce the time on this. I'll let you know when we have something to share. Thank you!
from tracboat.
We found it. For that ticket #3735
comment 11, one of the sentences looks like this:
Just installed v5.2.2 and it just showed the same behavior as v5.3.3 did last week.
When we look at the dump from xmlrpclib.py there is a \x10
between "week" and the "." (so week\x10.
).
My colleague stripped out the \x10
via xmlrpclib.py and we're able to parse it just fine now. Here is what the diff looks like:
diff -C 5 /usr/lib/python2.7/xmlrpclib.py.orig /usr/lib/python2.7/xmlrpclib.py
*** /usr/lib/python2.7/xmlrpclib.py.orig 2019-02-28 11:24:27.489455298 -0500
--- /usr/lib/python2.7/xmlrpclib.py 2019-02-28 12:17:56.685429129 -0500
***************
*** 553,563 ****
if not parser.returns_unicode:
encoding = "utf-8"
target.xml(encoding, None)
def feed(self, data):
! self._parser.Parse(data, 0)
def close(self):
try:
parser = self._parser
except AttributeError:
--- 553,569 ----
if not parser.returns_unicode:
encoding = "utf-8"
target.xml(encoding, None)
def feed(self, data):
! try:
! data = data.replace("\x10", "")
! self._parser.Parse(data, 0)
! except Exception as e:
! print(repr(e))
! print(repr(data))
! raise
def close(self):
try:
parser = self._parser
except AttributeError:
So, not a bug with tracboat or trac_to_git, but rather something in our Trac ticket that you cannot see from the Trac WebUI at all. We're fine stripping out \x10
from everything, so we'll try that and see how far we get. Thanks for your tips and assistance.
from tracboat.
We got up to #5485
and then found a couple null characters in the data. Just added this to xmlrpclib.py
:
data = data.replace("\x10", "")
! data = data.replace("\x00", "")
And we're moving forward.
from tracboat.
Just to clarify that I understand the process here, I run users.sh
to get the users list to plug into the mytrac.toml under [tracboat.usermap]
, run tracboat.sh export
to do the Trac json dump, and then tracboat.sh migrate
with the json dump and users updated in mytrac.toml?
from tracboat.
the thing is that XML does not allow control characters, it's utf-8 based mostly.
out of curiosity, from what trac version you think those were input to your system? and those were submitted as comments via trac web?
from tracboat.
for me no dump format worked (the binary blobs issue i pointed already), so i ran my migration using tracxmlrpc
from tracboat.
Trac 1.0 is what we're on. So I ended up having to filter out more than just \x10
and \x00
. I hit several more (1e, 0c etc). But I finally made it all the way through. Hurray! Could you please confirm I'm on the right track in this comment? #59 (comment) Thank you!
from tracboat.
the thing is that XML does not allow control characters, it's utf-8 based mostly.
out of curiosity, from what trac version you think those were input to your system? and those were submitted as comments via trac web?
I missed that last question. Yes, they were comments submitted via Trac Web UI.
from tracboat.
I mean you are 1.0 right now, but what was version was installed (in your best guess) when they were input there, as specific as you can, i.e full version.
you are on right track because control chars are not valid in xml, those must be encoded otherwise, or stripped. which is the appropriate depends why they are there in first place.
if in comment body, then trac should filter out non-text when accepting user input. not sure if they do that now, maybe you want to get deep on this, create curl
request that submits those invalid chars, and see what your trac does on that.
https://en.wikipedia.org/wiki/Valid_characters_in_XML
in any case, the problem is in trac and/or tracxmlrpc plugin and should be addressed there.
from tracboat.
you may copy this regex for xml 1.0:
not sure what xml version the xmlrpc is using here. you needs somehow to dump out the xmlrpc request/responses to see answer.
from tracboat.
Sorry about that. Hit the wrong button. I've asked someone who was one of the first employees of our company if he knows what version we may have started out on. Will let you know.
I do know that years ago we switched back-end databases, so that migration could have caused problems. So far I've seen control characters inserted into comments in a few different places and I've seen wiki and ticket attachment names corrupted with control characters (2 of these so far). Something went awry over the past decade.
My guess is that these were not user input issues but rather data corruption during updates or database migration.
from tracboat.
We're pretty sure this was our Trac migration starting with version 0.9: 0.9 -> 0.11 -> 0.12 -> 1.0
from tracboat.
We made it through users, wikis and tickets now and have resolved all issues for those (by replacing the control characters with empty string or by re-uploading attachments with corrupted file names). I did move the new error with project_get
to a new ticket since it was unrelated to this.
from tracboat.
I suppose we can close this now since the issue was not with tracboat, but rather with corrupted data in our Trac instance and we've worked around that.
from tracboat.
This issue is when running tracboat export
. I made it all the way up to the cli.py _dumps() call and failed on json.dumps
:
2019-03-01 20:37:26,687 DEBUG tracboat.trac: milestone_get_all
2019-03-01 20:37:26,687 DEBUG tracboat.trac: milestone_get_all_names
2019-03-01 20:37:58,756 DEBUG tracboat.trac: project_get is collecting authors from project
Traceback (most recent call last):
File "/home/user1/tracboat/VENV/bin/tracboat", line 11, in <module>
load_entry_point('tracboat', 'console_scripts', 'tracboat')()
File "/home/user1/tracboat/src/tracboat/cli.py", line 428, in main
cli(obj={}) # pylint: disable=unexpected-keyword-arg,no-value-for-parameter
File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/user1/tracboat/src/tracboat/cli.py", line 118, in wrapper
return func(*args, **kwargs)
File "/home/user1/tracboat/VENV/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/user1/tracboat/src/tracboat/cli.py", line 268, in export
project = _dumps(project, fmt=format)
File "/home/user1/tracboat/src/tracboat/cli.py", line 44, in _dumps
return json.dumps(obj, sort_keys=True, indent=2, default=json_util.default)
File "/usr/lib/python2.7/json/__init__.py", line 251, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 209, in encode
chunks = list(chunks)
File "/usr/lib/python2.7/json/encoder.py", line 434, in _iterencode
for chunk in _iterencode_dict(o, _current_indent_level):
File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/usr/lib/python2.7/json/encoder.py", line 390, in _iterencode_dict
yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
from tracboat.
Besides us doing a pickle.dump(obj, open("file.p"), "wb"), pickle.HIGHEST_PROTOCOL)
right before the json.dumps
call, any other advice you can give us? Thank you!
from tracboat.
The json.dump is failing on an attachment which is a png file being written as a utf8 string. We assume attachments are supported. Why would tracboat try to write a binary attachment as a string?
from tracboat.
So far it looks like this is only happening with .png files.
from tracboat.
We just hit a 7zip file (.7z). So, it appears that binary attachments to wiki pages are not being processed correctly.
from tracboat.
We decided to strip all wiki attachments for now. We just want a dump that we can import into our gitlab sandbox instance so we can get back to our pilot, but in our final run we will need wiki binary attachments working.
Here is the script we used to strip attachments from our pickle.dump:
#!python
import pickle
import json
x = pickle.load(open("pickle_dump.p"))
def check_and_strip(o, path=[]):
if isinstance(o, dict):
print("dict")
o.pop("attachments", None)
for k, v in o.items():
check_and_strip(k, path + ['key:' + repr(k)])
check_and_strip(v, path + ['value:' + repr(k)])
elif isinstance(o, list):
print("list")
for i, j in enumerate(o):
check_and_strip(j, path + ['list:' + str(i)])
elif isinstance(o, str):
#print("str")
if '\xd0' in o:
print("found bad char in str at {}".format(path))
raise ValueError()
print("checking")
check_and_strip(x)
The above script is not cleaned up and started out as our script to find the issue (binary attachments to wikis), so forgive the irrelevant code.
from tracboat.
btw, we are using Python2. Do you typically use Python 2 or 3? Wondering if 3 handles binary attachments better.
from tracboat.
We're also going to try importing/migrating our pickle dump that still has binary attachments.
from tracboat.
The pickle dump would not import complaining about the binary attachments (non-utf8 chars).
The json dump failed because of our removed attachments.
from tracboat.
Looks like the binary attachment json.dumps issue is known #9. I guess our solution to exclude attachments for now will have to do during our piloting, but we need that fixed before we run our official migration.
from tracboat.
We did get our json dump to import into gitlab successfully, so that's good news. It did not import the wikis, though, which is bad news. I thought they would import automatically, but that was not the case. Trying to figure out what to do about that.
from tracboat.
I was able to get the attachments to dump via this solution:
#6 (comment), at least it didn't fail when using this solution. Since issue #6 is tracking the attachment issue and I've resolved, in one way or another, all of my utf8 related issues, I'm closing this ticket.
from tracboat.
Related Issues (20)
- unable to match users by aliases
- use gitlab-export format HOT 3
- minimum python2.7 requirement
- Plan to support GitLab 11? HOT 3
- 11.4 Migration Issue. HOT 1
- Migrating Trac 1.2 to Gitlab 11.5.0 HOT 4
- UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte HOT 4
- OOM during migrate HOT 3
- problems with attachments and encoding HOT 22
- wiki migration "file already exists" HOT 1
- 11.9.7-ce.0: missing index_issues_on_project_id_and_iid update
- Environment variable TRACBOAT_TRAC_URI not taken into account HOT 1
- Export fails if attachments can't be found HOT 1
- 400 bad request
- how does mock migration works? HOT 1
- 12.x support? HOT 2
- 12.7.5 : AttributeError: 'module' object has no attribute 'IssueAssignees'
- 13.x support
- look-behind requires fixed-width pattern
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tracboat.