joeyaurel / python-gedcom Goto Github PK

View Code? Open in Web Editor NEW

This project forked from madprime/python-gedcom

152.0 152.0 38.0 575 KB

Python module for parsing, analyzing, and manipulating GEDCOM files

Home Page: https://gedcom.joeyaurel.dev

License: GNU General Public License v2.0

Python 100.00%

ancestry gedcom gedcom-parser genealogy myheritage parser python

python-gedcom's Introduction

Hi 👋 I am Joey.

A passionate full-stack developer and DevOps engineer from Germany

🐳 I’m currently learning about Docker, Kubernetes, Terraform and all things DevOps.
🌱 I’m currently learning about psychology, mass psychology, marketing and math.

Languages and Tools:

python-gedcom's People

Contributors

Stargazers

Watchers

python-gedcom's Issues

error when trying to process GEDCOM file with python-gedcom v. 0.2.0.dev

I've download and installed python-gedcom v.0.2.0.dev

I run it as follows:

from gedcom import Gedcom

file_path = '7q4425_661384sh82b72570424am5.ged' # Path to your `.ged` file
gedcom = Gedcom(file_path)

print(gedcom.element_list())

This GEDCOM file starts with

0 HEAD
1 GEDC
2 VERS 5.5.1
2 FORM LINEAGE-LINKED

and I get the following error:

Traceback (most recent call last):
  File "script.py", line 4, in <module>
    gedcom = Gedcom(file_path)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 148, in __init__
    self.__parse(file_path)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 224, in __parse
    last_element = self.__parse_line(line_number, line.decode('utf-8'), last_element)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 263, in __parse_line
    raise SyntaxError(error_message)
SyntaxError: Line `1` of document violates GEDCOM format
See: http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gctoc.htm

What am I doing wrong? This GEDCOM file has been exported from MyHeritage recently

UPD: this is with Python 3.6 under Windows 10 x64

Words for structure of Gedcom file

It seems there was an update in names of classes and methods in the 1.0.0 version, and it is good to use Elements about parts of a Gedcom files. This differentiates between Records, which are about individuals or marriages, and Elements that represent parts of a Gedcom file.
But why keep the names Child and Parent for Elements that are connected to a particular Element? For a library like this meant to cope with genealogy, these two words represent quite particular things and not necessarily the structure of a gedcom. One option would be to use Sub and Super. E.g. Element, SubElement and SuperElement.
Would like to hear your thoughts on this.
Great software by the way!

Adding individual documentation for `Gedcom` and `Element` classes

Depends on #15

enhancement: comparator / merger

I have multiple GEDCOM files with multiple edits.. would love to have a way to compare __eq__ and maybe merge updates.

Access to NOTE fields

This gedcom python library seems to work fine - thanks a lot - but provides functions to access only some items of the gedcom files. In particular I don't see how to retrieve the NOTE field for birth, death, marriages etc. that I am using a lot in my gedcoms.
Is there a generic function for accessing these other fields or at least a specific function to access the NOTEs? Or I should look for another gedcom library although I don't know which ones are really still maintained?

KR
Frç

Broken link to a GEDCOM format specification

with version 0.2.0 opening a GEDCOM file, that was only today downloaded from myheritage.com, produces error:

    gedcom = Gedcom(file_path)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 148, in __init__
    self.__parse(file_path)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 224, in __parse
    last_element = self.__parse_line(line_number, line.decode('utf-8'), last_element)
  File "C:\Python3\lib\site-packages\gedcom\__init__.py", line 263, in __parse_line
    raise SyntaxError(error_message)
SyntaxError: Line `1` of document violates GEDCOM format
See: http://homepages.rootsweb.ancestry.com/~pmcbride/gedcom/55gctoc.htm

This link displays 403 Forbidden when I try to click it

The first line of the file is:

0 HEAD

Reading non UTF-8 files

Just noticed that the parser only accepts files which are UTF-8. Will be interesting to have a pre-checking of the file and self-conversion to UTF-8

Can't install?

Hi,

This looks like a cool piece of code - so, first of all, thank you :)

I'm probably doing something wrong, but I installed python-gedcom as per instructions, ie. pip3 install python-Gedcom. It shows as being installed correctly. When I use your example code, I receive the following error message:

Traceback (most recent call last):
  File "Gedcom.py", line 3, in <module>
    from gedcom.element.individual import IndividualElement
  File "/home/james/Coding/gedcom/gedcom.py", line 3, in <module>
    from gedcom.element.individual import IndividualElement
ModuleNotFoundError: No module named 'gedcom.element'; 'gedcom' is not a package

I also commented out the: '#from gedcom.element.individual import IndividualElement', to use the strict parser as the source gedcom is an ancestry gedcom 5.5 file, and I receive the same message.

useful stats:
python version 3.7.3
Arch Linux ARM aarch64 (Odroid N2 SBC)
I'm quite new to python programming!

Any thoughts/advice would be gratefully received - thank you!

Update the project structure

Example project for this: https://github.com/pypa/sampleproject

trying to separate gedcom into maternal and paternal, error

Hello, I'm trying to separate out the maternal vs paternal lines in my gedcom. Here's the code as I have it now:

gedcom = Gedcom(file_path)
all_records = gedcom.get_root_child_elements()
home_person = gedcom.get_root_element()
parents = gedcom.get_parents(home_person)

and I get the following output:

Traceback (most recent call last):
  File "C:\Users\saman\Documents\ancestry\python-gedom-playground\bridgets.py", line 15, in <module>
    parents = gedcom.get_parents(home_person)
  File "C:\Python37\lib\site-packages\gedcom\__init__.py", line 456, in get_parents
    raise ValueError("Operation only valid for elements with %s tag." % GEDCOM_TAG_INDIVIDUAL)
ValueError: Operation only valid for elements with INDI tag.

What am I doing wrong?

pypi has the madprime version hosted.

pip install python-gedcom installs the 2012 code.

Allow parser to accept files directly, not just the filepath

Currently the Parser parse_file function accepts only a file path
parse_file(self, file_path, strict=True)
but it would be interesting to be able to pass data directly, to be compatible with a cloud storage service.

For some context, I am trying to use the package in a web application on GCS, but running into a fundamental issue I described on Stack Overflow.

Looking at the code, it appears it could work if we could pass gedcom_file as a data object directly to be parsed line by line with self.__parse_line, rather than requiring the file be loaded from disk using gedcom_file = open(file_path, 'rb')

But maybe there's another work-around I haven't thought of?

Suffixes and get_name

Any recommendations for handling suffixes in names? Looks like the parser is just splitting on /, but I could be missing something.

Currently the following names both return Jane and Lane:

1 NAME Jane /Lane/ 
1 NAME Jane /Lane/ Jr

Separate `init.py` into individual files (`Element` and `Gedcom`)

Depends on #18

[Bug] Non strict loading not working

Describe the bug
I have an example file, where the encoding itself, crashing the loading process.
gedcom_parser.parse_file(file_path, False) # Disable strict parsing
This line receving this crash
one_person_myheritage.rename to ged.log
The example file are attached.

To Reproduce

Load the one_person_myheritage.rename to ged.log
rename file to one_person_myheritage.get

Run this python lines:

gedcom_parser = Parser()    
gedcom_parser.parse_file( "one_person_myheritage.ged"  , False) # Disable strict parsing

The exception are
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 206: invalid continuation byte

Expected behavior
When using False parameter, there is no reason for this exception.

Additional context
Expected bugfix are in line
last_element = self.__parse_line(line_number, line.decode('utf-8-sig'), last_element, strict)
in function def parse(self, gedcom_stream, strict=True):

Fix `setup.py` outputting correct markdown when reading the `README.md`

Depends on #18

Allow parsing files with UTF-8 BOM

I don't know what the gedcom 5.5 format says about this, but for the sake of simplicity and because most text editors nowadays add it by default, this code should detect and ignore an UTF-8 BOM mark at the start of the file.

It is super complicated to understand why the loading failed because it only says: Line 1 of document violates GEDCOM format 5.5 and nothing more. Because these bytes are meant to be ignored, you can't see the issue on line 1 unless you load the file in python and print a representation of said line.

One option is to use the utf-8-sig codec instead.
https://docs.python.org/3/library/codecs.html#module-encodings.utf_8_sig

Date handling

Dates may be specified as before, after, about, or a range. Its not clear how the year extraction methods handle such cases.

Perhaps an additional method which returns the modifier.

invalid continuation byte

With a .ged file exported from geneanet, I get an encoding problem error:

21 input = input[3:]
22 prefix = 3
---> 23 (output, consumed) = codecs.utf_8_decode(input, errors, True)
24 return (output, consumed+prefix)
25

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 20: invalid continuation byte

Is this an error in the ged?

Hello,

I'm trying to use this module,

from gedcom import Gedcom
file_path = '/home/bokkie/Dropbox/Documents/Genealogie/1776\ Ballast.ged'
f = Gedcom(file_path)

but I get the following trace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bokkie/.local/lib/python3.5/site-packages/gedcom/__init__.py", line 148, in __init__
    self.__parse(file_path)
  File "/home/bokkie/.local/lib/python3.5/site-packages/gedcom/__init__.py", line 224, in __parse
    last_element = self.__parse_line(line_number, line.decode('utf-8'), last_element)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 9: invalid continuation byte

Is there an issue with my gedcom file?

[Feature request] 5.5.n Compatibility?

Is your feature request related to a problem? Please describe.
5.5.1 has become the most common standard for usage. That said, Tamura Jones 5.5.5, which is fully 5.5.1-backwards compatible is likely to supplant this going forward. It would be great to see 5.5.1/5.5.5 parsing implemented. I can't seem to find discussion of this in the docs, so posting this as part query (Do you plan to...) and part request (Man, it would great if...).

Describe the solution you'd like
See above.

Additional context
https://www.gedcom.org/gedcom.html

can't save gedcom file

Hello,
I am trying to read in a gedcom file, modify the data and then save a gedcom file.
It works ok until I try to save it using save_gedcom() - I get only an empty output file ( 0 bytes).
I assume I have to open a file using open( filename, "w"), then pass it to save_gedcom.
However I get an empty file ( 0 bytes). Is this supposed to work?
It does not even work when I try to save the "original", i.e. non-modified gedcom file, such as in the code below.

`file_in = "/home/xy/test1.ged"
file_out = "/home/xy/test1_out2.ged"

from gedcom.element.individual import IndividualElement
from gedcom.element.object import ObjectElement
from gedcom.parser import Parser

gedcom_parser = Parser()
gedcom_parser.parse_file(file_in)

outfile = open( file_out, "w" )
gedcom_parser.save_gedcom(outfile)
outfile.close()`

Support for MyHeritage and Ancestry generated GEDCOM files

It seems that this is one of the only actively developed gedcom parsers in python these days (2018). Ancestry seems to produce gedcom files that break the parsing:

python3 parse_gedcom.py 
Traceback (most recent call last):
  File "parse_gedcom.py", line 4, in <module>
    gedcom = Gedcom(file_path)
  File "/usr/local/lib/python3.7/site-packages/python_gedcom-0.2.0.dev0-py3.7.egg/gedcom/__init__.py", line 148, in __init__
  File "/usr/local/lib/python3.7/site-packages/python_gedcom-0.2.0.dev0-py3.7.egg/gedcom/__init__.py", line 224, in __parse
  File "/usr/local/lib/python3.7/site-packages/python_gedcom-0.2.0.dev0-py3.7.egg/gedcom/__init__.py", line 262, in __parse_line
SyntaxError: Line `65692` of document violates GEDCOM format 5.5

the lines in question are:

4 TEXT DOREY – Ethel Marie, 84, of Liverpool, passed away peacefully on Wednesday, July 27, 2016 in Queens Manor, Liverpool.
Born in Western Head, Queens County, she was a daughter of the late William an
5 CONC d Hilda (Guest) Wolfe.
Ethel was a former waitress at the Mersey Hotel in the late forties. She was a member of the coffee bowling league for thirty years and was a volunteer with the Canadian Red Cr

Notice the carriage return in the TEXT data that puts the next line "Born in Western Head..." onto a line by itself.

I believe that this breaks the gedcom format (though I have not researched this extensively in the spec). That being said, Ancestry is one of the largest genealogy providers and I think it would be ideal to have a parser that can parse the output from this provider.

I'm wondering if there is any interest handling this use case here? If so I can try and work up a patch and submit a PR.

I think there is a need to have a gedcom parser that can read "real world" gedcom files.

Specifications

The GEDCOM specifications link to Chronoplex, instead they should link to here: https://gedcom.io/specs/

gedcom datestrings to dates, including for ISO 8601 dates

Some gedcom files have dates formatted in ISO 8601 format. Some genealogy software, for example Gramps, parses this correctly.
It would be nice if python-gedcom can parse datestrings to date objects, including datestrings in ISO 8601 format. Then several methods for working with dates become available without having to be implemented again.
The necessary functionality is available in Python 3.7 but forcing an upgrade to this version seems premature. Using the package python-dateutil seems to be a viable option. It can do all of this and is available for Python 2.7 and up.

Bugs with get_name and get_marriages

I've created a PR (#8) that addresses the following bugs:

get_marriages. This method returns a Tuple that has the date and location. It was being assembled incorrectly. If there was a single marriage on Oct 5, 2000 in location Canada, then it was returning two marriages one with date="Oct 5, 2000", location="" and a second marriage with date="", location="Canada".
get_name was iterating through all the NAME records and returning the last one found. Typically the first NAME record is the "preferred" record, and it should be what is returned with get_name

Difference between `get_parent_element()` and `get_parents()`

Hi
What is the use of the function
record.get_parent_element()

We can only use : parents = gedcom.get_parents(record)

given/surname_match methods should be case insensitive

given_match and surname_match are currently case sensitive substring searches for their respective fields.

I propose that these should be case insensitive searches. It does not seem reasonable that the user would intend to use the case insensitivity to distinguish be between (for example)

John Smith
and
John SMITH

I propose the case insensitivity because there are essentially to schools of thought on storing names. The "traditional" school of thought says to store all names in uppercase for ease of readability vs. the typical first letter capitalization. This improves the hit rate of the content of the fields rather than its formatting.

Backwards compatibility is not broken, however this would potentially change the expected results.

I will submit the patch to this as well as other changes in a PR

[Bug] 'FamilyElement' object has no attribute 'is_family'

When calling get_families, you receiving exception
When calling get_families, you receiving exception "'FamilyElement' object has no attribute 'is_family'"

To Reproduce
Steps to reproduce the behavior: (Example)

Download attached gedcom_bug_report.py.txt
Rename gedcom_bug_report.py.txt to gedcom_bug_report.py
Download attached korets_one_person.ged.txt
Rename korets_one_person.ged.txt to korets_one_person.ged
Put latest gedcom folder, to same folder
Run gedcom_bug_report.py
See error "'FamilyElement' object has no attribute 'is_family'"

Expected behavior
Return all families without errors

Version
Latest source version (not pip version)

gedcom_bug_report.py.txt
korets_one_person.ged.txt

enhancement: return all_names

while the following snippet can return all names.. having it built in IndividualElement would help

all_names = [a.get_value() for a in individual.get_child_elements() if a.get_tag() == gedcom.tags.GEDCOM_TAG_NAME]

`get_birth_data` and `get_death_data` returning last `DATE` and `PLAC` found

get_birth_data right now will iterate through all BIRT records and then retrieve the DATE, PLAC and Sources for each BIRT.

This is an example record - the individual has two BIRT records with different values.

1 BIRT
2 DATE 25 Jan 1780
2 PLAC Liverpool, Queens, Nova Scotia, Canada
1 BIRT
2 DATE 1781
2 PLAC Nova Scotia, Canada

Currently the logic in python-gedcom is to iterate through each BIRT and then assign any found DATE and PLAC to the date and place variables. In the above example it would return "1781", "Nova Scotia, Canada"

Typically the first entry in the gedcom file is the "Preferred" entry - At least this is how it done with Ancestry.com.

I wonder if we should return the FIRST BIRT record we find instead of the LAST one? I think returning all the Sources for all BIRT elements is fine as is.

BTW, all of the above applies for DEAT too. If we agree to make this change then I will submit a PR.

element.get_census_data() and element.get_occupation() not working

These two methods are not returning any results, even though other software is able to read my .ged file correctly.

[Feature request] get INDI ID number

Is your feature request related to a problem? Please describe.
Hi, I want to get the ID number of INDI (like @I5465477880020118059@) in order to give a kind of unique tag for each name.

0 @I5465477880020118059@ INDI
1 NAME Chee Lin /TAN 陳/
2 GIVN Chee Lin
1 SEX M
1 EVEN
2 TYPE Misc Event
1 BIRT
2 DATE 1944

Describe the solution you'd like
I would like to know if there is some way I get the ID number of INDI (like @I5465477880020118059@)

I have tried this, but only got None for INDI_ID_number

from gedcom.element.individual import IndividualElement
from gedcom.parser import Parser
import gedcom

def generate_names_json(file_name):
    file_path = file_name
    gedcom_parser = Parser()
    gedcom_parser.parse_file(file_path)
    root_child_elements = gedcom_parser.get_root_child_elements()

    count_individual = 0

    for element in root_child_elements:
        curr_element_dict = {}
        if isinstance(element, IndividualElement):
            if element.get_tag() == gedcom.tags.GEDCOM_TAG_INDIVIDUAL:
                curr_element_dict['INDI_ID_number'] = element.get_value()
            curr_element_dict['name'] = element.get_name()
    print(curr_element_dict)

Any ideas relevant will be very helpful for me. Thanks very much for your help!

More test files

There are some test gedcom files available at
http://heiner-eichmann.de/gedcom/gedcom.htm
Python-gedcom crunches some of these but not all. While I do not know how strict they represent the standard it may be useful to test against them.

Add tests for the parser to check and validate actual functionality

Depends on #15

[Feature request] documentation/example for print_gedcom and save_gedcom

Is your feature request related to a problem? Please describe.
I'm SO HAPPY to have found gedcom.parser! I'm making a family tree in Django with the idea of having a gedcom file be the source of truth, so I can periodically re-import the updated file to my family tree and to the other programs where I have trees. This has been great for parsing, so that part is all set. But in order for my master plan to work, I have a script to generate unique ids for each person and I want to also be able to write out an updated gedcom file with the unique ids added in one of the tags (maybe @alia@ maybe something else).

I'm starting to play around with print_gedcom and save_gedcom and it sounds like they only take one argument (self)- and I'm not getting it to output successfully yet. Do you have a usage example?

Describe the solution you'd like
A usage example added to documentation for print_gedcom and save_gedcom.

Describe alternatives you've considered
So far I'm just trying things :)