Comments (15)
@shivam-tripathi Sure go ahead :-) I did a basic comparison already on MD5 and SHA1. MD5 was faster and looks suitable for our use case. Let me know if any assistance is needed from my side.
from briefcase.
@shivam-tripathi sure u can look into it. I don't think someone is currently working on it. Did a quick look and seems like the issue comes from the file ConvertToCSV.java in the method emitSubmissionCsv in case org.javarosa.core.model.Constants.DATATYPE_BINARY: correct me if I'm wrong. Hope I helped
from briefcase.
@joeflack4 Hi!
Presently we are trying to remove redundancy when you export the data collected using Briefcase after pulling it from the server. This is essentially an offline process.
However, while pulling the data off the server - this remains an issue. I hope some solution surfaces in future. If I am correct, it needs to be done at Aggregate level - as in Briefcase we cannot determine whether or not the media file is duplicate before the fetch.
from briefcase.
@yanokwa Tested on OS X 10.12 with Birds
form
There were two images with the same timestamp in two difference instance folders.
After exporting, in the media folder, one was renamed with the suffix -2
from briefcase.
@yanokwa: Confirmed the same thing as @rclakmal on Linux Ubuntu 16.04.
from briefcase.
It appears that when encountering two separate instance media with same time stamp, it creates a duplicate of only one already copied.
I renamed one of the instance media as one existing in another instance. Also it appears while re-fetching form, if the instance media data has been tampered with - briefcase doesn't verify the contents of instance media folder and skips it.
from briefcase.
Looked at the code, I was mistaken (see the crossed out remarks in the comment above). Making changes to the instances folder makes no difference, as names are read from the XML response. If file is not found, it is skipped.
The redundancy in the image in the comment was (unfortunately causing the confusion) due to it being actually present twice.
The code handles the files with same timestamp by adding suffix.
from briefcase.
I confirm what @shivam-tripathi said earlier. When the code encounters files with the same time stamp it solves the problem by adding an incremental suffix.
from briefcase.
Thanks so much for the confirmation, gentlemen! I'm closing this issue because this is exactly the behavior we want.
from briefcase.
Actually. Instead of adding image-2.jpg, I wonder if we can check the MD5 hash and only append a number if those files are actually different. What do you think @shivam-tripathi @icemc @rclakmal?
from briefcase.
@yanokwa We can store MD5 hashes and file paths in a HashTable. MD5 hash would be the key. This will allow us to skip the duplicates. As far as I know, there is a theoretical possibility, however small, that two different files could return same hash. I think we can ignore this for practocal purposes?
Wouldn't you think we should provide this as an option? This introduce extra overhead to the export process and in a big form result set delay could be noticeble.
from briefcase.
@yanokwa following what @rclakmal said an MD5 hash will solve the problem (in most cases) but will decrease the performance during export of forms with a large amount of instances. It is left for the community to decide if this extra cost is necessary.
from briefcase.
I'd rather we decide what is best than add options to the app. I think this should be pretty fast because you'd only be doing the MD5 check when you have a matching filename, no? Either way, this is something that can be tested empirically if either of you are up for it.
from briefcase.
With everyone's permission, I would be glad to look into this.
from briefcase.
Thanks for your efforts with this. Some of our partners have very limited connections, so the duplicate issues can be a huge issue, particularly with our fork of collect which uses form linking, as some images are shared between our household and individual questionnaires.
from briefcase.
Related Issues (20)
- Crash on export when submission is empty HOT 1
- Crash on reloading from Central server not right after configuring it HOT 1
- Crash on attempted cancel of pull from Central server when offline
- Socket closed when pushing 3k+ submissions to Central HOT 3
- Forms with external secondary instances aren't shown on export or push tabs HOT 1
- Crash when pulling forms with spaces in formId from Central
- Pull before export doesn't work for forms from Central
- Pull from Collect directory where a few versions of the same form are available is not possible HOT 1
- On pull from Central, submission attachments are not requested if submission folder exists
- Forms arenβt immediately showing up in push or export after a canceled pull - only with Java 8
- Export is crashing after canceling pull when empty submissions are created - Java 8
- Make documentation link in Central push warning dialog clickable
- Support empty form versions when doing multi-version push to Central
- Form selection is lost and status is missing on Export tab when pull before export process in progress
- Briefcase tabs are disabled when run on java 9 and 10 and select sd on Windows
- Pull/Push/Export tabs are still active when the storage location is cleared HOT 3
- On export, skip encrypted submissions with manifest but no .enc file and mark as failed
- Form selection and individual export configuration in export tab becomes disabled
- Export using CLI ok, but freezes / hangs using GUI (Key too long?)
- Export of encrypted submissions using Briefcase UI fails HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from briefcase.