claroty / access_parser Goto Github PK
View Code? Open in Web Editor NEWA Python based parser for Microsoft Access database files
License: Apache License 2.0
A Python based parser for Microsoft Access database files
License: Apache License 2.0
Hello,
I am working whith javascript program that needs to parse accessDB files and used this extension (https://github.com/regrapes/access-db-parser). I used this extension in order to try and parse through the IPEDS database (Download Link: Here).
Problem appears when trying to parse DRVC2021.
I open different .mdb files without problem.
Now, with a different db mdb i try to get the db with:
db = AccessParser("/mnt/doss/4/Zonda.mdb")
i get KeyError: 8192
Why?
Hello,
I am working on a javascript program that needs to parse accessDB files and used this extension, which used this project and turned it into a node package. I used this extension in order to try and parse through the IPEDS database (Download Link: Here).
The problem code parses through the HD2018 table in the IPEDS201819 database
Upon trying to parse I get a huge amount of error statements in the following format
Memo data inline
Parsing memo field ♂���077227858
Memo data inline
Parsing memo field ↔���Concordia University Irvine
Memo data inline
Parsing memo field ♂���076084946
Memo data inline
Parsing memo field ☻�
Memo data inline
As the table is massive, I went and deleted some print statements for the file to find if any other errors were called and it also sent the following errors
Overflow record flag is not present 2990
LVAL type 1
Overflow record flag is not present 2848
LVAL type 1
Overflow record flag is not present 2776
LVAL type 1
Overflow record flag is not present 2572
LVAL type 1
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
The result of the parsing yielded the following
[
'100654',
'1',
...Everything fine between here...
'119',
'1',
'Alabama A & M University',
'AAMU',
'4900 Meridian Street',
'Normal',
'AL',
'35762Dr. Andrew Hugine, Jr.President2563725000636001109 \x0B耀\x00\x00\x00\x00㤱㈷㘱㔴〵 ㈀ 眀眀眀⸀愀愀洀甀⸀攀搀甀⼀眀眀眀⸀愀愀洀甀⸀攀搀甀⼀䄀
搀洀椀猀猀椀漀渀猀⼀倀愀最攀猀⼀搀攀昀愀甀氀琀⸀愀猀瀀砀眀',
'',
'President',
'2563725000',
'636001109 ',
'197216455',
'00100200 ',
'www.aamu.edu/',
'www.aamu.edu/Admissions/Pages/default.aspxwww.aamu.edu/admissions/fincialaid/pages/default.aspxhttps://www.aamu.edu/Admissions/UndergraduateAdmissions/Pages/Apply%20Today',
'',
'https://www.aamu.edu/Admissions/UndergraduateAdmissions/Pages/Apply%20Today.aspxhttps://galileo.aamu.edu/NetPriceCalculator/npcalc.htm www.aamu.edu/administrativeoffices/VADS/Pages/Disability-Services.aspxA-',
'',
' ',
' ',
'www.aamu.edu/administrativeoffices/VADS/Pages/Disability-Services.aspx',
'A-2 -2 -2Madison County\x80\x00\x00\x00\x00\x00㘀⣮\x05\x00\x00\x00\x00\x00\x00삈ȒӭӜӋүҫ',
'',
'-2 ',
'-2',
'Madison County',
'�\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x006�(\x05',
''
]
As mentioned, these are simply snip bits of the errors, as this error was thrown for what seemed like nearly every line of these multiple thousand line database. I posted this as an issue on the node project and was sent to here as this seems to be a logic issue. Please tell me if anymore information is required, or if this is a duplicate issue and i didn't realize.
Alex
When trying to access data from the Svenskt Pressregister 1903-1911 dataset (direct download link: Dataset(8991 kB)) I get some unexpected results.
…
and ’
. They instead become \0&
(0x00 0x38) and \0\25 \0
(0x00 0x25 0x32 0x00). In Microsoft Access they have the expected values of …
and ’
respectively.The following code
from access_parser import AccessParser
db_path = "SVEPDB.accdb"
db = AccessParser(db_path)
tables = db.catalog.keys()
concert_table = [x for x in tables if x.startswith("Konsert")][0]
if concert_table != "Konsert":
print(f"Can't find Konsert table, found {concert_table}")
else:
print("Found Konsert table")
table = db.parse_table("nskon")
ellipsis_value = table['ntit'][4338]
apostrophe_value = table['ntit'][14986]
if ellipsis_value != 'Det "fula" Stockholm …':
print(f"Ellipsis not decoded correctly, got: {ellipsis_value}")
else:
print("Ellipsis decoded correctly")
if apostrophe_value != 'Landsmålsbref. Tell ’n Stanialus':
print(f"Apostrophe not decoded correctly, got: {apostrophe_value}")
else:
print("Apostrophe decoded correctly")
outputs
WARNING:Could not find overflow record data page overflow pointer: 27
WARNING:Could not find overflow record data page overflow pointer: 27
Can't find Konsert table, found Konsert춢
Ellipsis not decoded correctly, got: Det "fula" Stockholm &
Apostrophe not decoded correctly, got: Landsmålsbref. Tell n Stanialus
instead of the expected
Found Konsert table
Ellipsis decoded correctly
Apostrophe decoded correctly
The currency type is not retrieving correct values.
Hi, thank you very much for the amazing work!
Is there any plan for an unparse feature too?
How to reconstruct 19 digit (eg. 4676213469643538432) datetime value read in from the accessdb file to an actual date?
Records with umlauts in "short_text" columns are being ignored.
Hi
I try to extract macro from my access mdb
file
I exactly know there is one macro in it, but i have no idea how i can extract this macro
I found this atricle
Author says there is MSysAccessObjects
table in access db. However this atricle is too old (about 8 years)
So, in my mdb
there is no MSysAccessObjects
table.
Maybe it rhas been enamed or removed?
does anyone have any ideas how i can do this?
Hello,
I'm working with your parser on node js
I looked through the questions there, it is written everywhere that it is worth contacting here
db. parseTable returns an error:
parsed = uuid_1.default.stringify(buffer.slice(0, 16));
^
TypeError: Cannot read property 'stringify' of undefined
db.getTables (); - works fine
I can't understand what the problem is
Can you please help
Review these files to better understand my problem:
https://disk.yandex.ru/i/X_5XRmKTGaRmWA
https://disk.yandex.ru/i/qHNzL5Mi1fHuwg
Hi,
I'm trying to load an old database and I'm getting the following error when parsing a table.
========================================================================================== FAILURES ===========================================================================================
________________________________________________________________________________________ test_database ________________________________________________________________________________________
def test_database():
> database = Database('./tests/data/{redacted}.mdb')
tests\test_database.py:20:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
{redacted}\database.py:81: in __init__
redacted = self.access.parse_table('redacted')
.venv\Lib\site-packages\access_parser\access_parser.py:185: in parse_table
return access_table.parse()
.venv\Lib\site-packages\access_parser\access_parser.py:257: in parse
self._parse_row(record)
.venv\Lib\site-packages\access_parser\access_parser.py:293: in _parse_row
self._parse_fixed_length_data(record, column, null_table)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <access_parser.access_parser.AccessTable object at 0x000001D11021A1D0>
original_record = b'\x07\x00\x00\x00\x01\x00b\x00\x00\x00\xea\x01\x00\x00\xff\xfe{redactedreadactedredactedredac}\xff\xfe{redacted}\xff\xfe{redacted}R\x00R\x00R\x00R\x00A\x002\x00\x10\x00\x06\x00\xaf\x03'
column = Container(type=4, ver4_unknown_3=1625, column_id=8, variable_column_number=6, column_index=2, various=Container(max_nu...': '', 'Required': '', 'DisplayControl': 109, 'TextAlign': 0,
'AggregateType': -1, 'ResultType': 0, 'CurrencyLCID': 0})
null_table = [True, True, False, False, False, False, ...]
def _parse_fixed_length_data(self, original_record, column, null_table):
"""
Parse fixed-length data from record
:param original_record: unmodified record
:param column: column this data belongs to
:param null_table: null table of the row
"""
column_name = column.col_name_str
# The null table indicates null values in the row.
# The only exception is BOOL fields which are encoded in the null table
has_value = True
if column.column_id > len(null_table):
logging.warning("Invalid null table. Bool values may be wrong, deleted values may be shown in the db.")
if column.type == TYPE_BOOLEAN:
has_value = None
else:
> has_value = null_table[column.column_id]
E IndexError: list index out of range
.venv\Lib\site-packages\access_parser\access_parser.py:319: IndexError
=================================================================================== short test summary info ===================================================================================
FAILED tests/test_database.py::test_database - IndexError: list index out of range
====================================================================================== 1 failed in 0.22s ======================================================================================
Maybe I'm getting it wrong, but shouldn't this part in access_parser.py be like this?
At least for my use case it seems to be solve the issue.
- if column.column_id > len(null_table):
+ if column.column_id >= len(null_table):
I'm unable to provide a file for repro, sorry...
Thanks!
access_parser reports empty number fields as zeroes, so there's no way to tell if the field is actually empty or if it's a literal zero.
Steps to reproduce:
Firstly thanks for the work on this package. Its been ideal for my use case.
The only point it seemed to struggle with is when unicode characters were part of a text field. I trawled through the Jackcess source code to find that the buffer starting with b'\xff\xfe'
means the field is set to use "unicode compression" which seems to pull out the b'\x00'
padding from 0-255 characters in UTF-16-LE. Where it can't it leaves the original encoding, so the padding needs to be pushed back in where necessary before decoding using UTF-16-LE.
So tweaking the below function in utils
def parse_type(data_type, buffer, length=None, version=3, props=None):
parsed = ""
# Bool or int8
if data_type == TYPE_INT8:
parsed = struct.unpack_from("b", buffer)[0]
elif data_type == TYPE_INT16:
parsed = struct.unpack_from("h", buffer)[0]
elif data_type == TYPE_INT32 or data_type == TYPE_COMPLEX:
parsed = struct.unpack_from("i", buffer)[0]
elif data_type == TYPE_MONEY:
parsed = struct.unpack_from("q", buffer)[0]
if props and "Format" in props:
prop_format = props['Format']
if parsed == 0:
parsed = [y for x, y in FORMAT_TO_DEFAULT_VALUE.items() if prop_format.startswith(x)]
if not parsed:
logging.warning(f"parse_type got unknown format while parsing money field {prop_format}")
else:
parsed = parsed[0]
else:
parsed = parse_money_type(parsed, prop_format)
elif data_type == TYPE_FLOAT32:
parsed = struct.unpack_from("f", buffer)[0]
elif data_type == TYPE_FLOAT64:
parsed = struct.unpack_from("d", buffer)[0]
elif data_type == TYPE_DATETIME:
double_datetime = struct.unpack_from("q", buffer)[0]
parsed = mdb_date_to_readable(double_datetime)
elif data_type == TYPE_BINARY:
parsed = buffer[:length]
offset = length
elif data_type == TYPE_OLE:
parsed = buffer
elif data_type == TYPE_GUID:
parsed = buffer[:16]
guid = uuid.UUID(parsed.hex())
parsed = str(guid)
elif data_type == TYPE_96_bit_17_BYTES:
parsed = buffer[:17]
elif data_type == TYPE_TEXT:
# if version > 3:
# # Looks like if BOM is present text is already decoded
# if buffer.startswith(b"\xfe\xff") or buffer.startswith(b"\xff\xfe"):
# buff = buffer[2:]
# parsed = get_decoded_text(buff)
# else:
# parsed = buffer.decode("utf-16", errors='ignore')
# else:
# parsed = get_decoded_text(buffer)
###Use custom parsing functions derived from Jackcess approach (Likely still needs work to support all JET versions. but works well for .accdb 2007)
parsed = decodeTextValue(buffer)
else:
logging.debug(f"parse_type - unsupported data type: {data_type}")
return parsed
and adding the following functions that shamelessly attempt to mimic the approach taken in Jackcess (Java)
###Text type decoding functions
def decodeTextValue(data: bytes):
'''Decodes a compressed or uncompressed text value'''
## see if data is compressed. the 0xFF, 0xFE sequence indicates that
## compression is used (sort of, see algorithm below)
TEXT_COMPRESSION_HEADER = b'\xff\xfe'
isCompressed = (len(data) > 1 and
data[0] == TEXT_COMPRESSION_HEADER[0] and
data[1] == TEXT_COMPRESSION_HEADER[1])
if isCompressed:
textBuf = ''
dataStart = len(TEXT_COMPRESSION_HEADER)
dataEnd = dataStart
inCompressedMode = True
while dataEnd < len(data):
endByte = data[dataEnd:dataEnd+1]
if endByte == b'\x00':
# handle current segment
textBuf = textBuf + decodeTextSegment(data, dataStart, dataEnd, inCompressedMode)
inCompressedMode = not inCompressedMode
dataEnd += 1
dataStart = dataEnd
else:
dataEnd += 1
# handle last segment
textBuf = textBuf + decodeTextSegment(data, dataStart, dataEnd, inCompressedMode)
return textBuf
return decodeUncompressedText(data,0,len(data)) #should pass charset from jet version
def decodeTextSegment(data: bytes, dataStart: int, dataEnd: int, inCompressedMode: bool):
'''
Decodes a segment of a text value into the given buffer according to the
given status of the segment (compressed/uncompressed).
'''
if dataEnd <= dataStart:
#no data
return
dataLength = dataEnd - dataStart
if inCompressedMode:
tmpData = bytearray(dataLength * 2)
tmpIdx = 0
for i in range(dataStart,dataEnd):
tmpData[tmpIdx:tmpIdx+1] = data[i:i+1]
tmpIdx += 2
data = bytes(tmpData)
dataStart = 0
dataEnd = len(data)
return decodeUncompressedText(data,dataStart,dataEnd)
def decodeUncompressedText(textBytes: bytes,dataStart: int, dataEnd: int): #should include charset variable
bytesToDecode = textBytes[dataStart:dataEnd]
try:
decoded = bytesToDecode.decode('utf_16_le') ##appears to be the default for the jet versions i've tested. may need to be variable based on db charset
except UnicodeDecodeError:
try:
decoded = bytesToDecode.decode('latin1')
except UnicodeDecodeError:
decoded = bytesToDecode.decode('utf-8', errors='ignore')
return decoded
Seems to successfully parse whatever unicode characters I add to my sample .accdb files.
I've no doubt they could be written more succinctly but just tried to replicate what I found here: https://github.com/jahlborn/jackcess/blob/7b338ea7bfd9fb9677c7c1634d760f41b4a611ab/src/main/java/com/healthmarketscience/jackcess/impl/ColumnImpl.java#L1584C21-L1584C22
Hope it helps
Thanks again!
I'm trying to open this file. It can parse some tables fine, but some tables fail on the following:
IndexError: list index out of range
> /home/geoffrey/.local/lib/python3.10/site-packages/access_parser-0.0.4-py3.10.egg/access_parser/access_parser.py(344)_parse_dynamic_length_data()
342 if i in relative_record_metadata.variable_length_jump_table:
343 jump_table_addition += 0x100
--> 344 rel_start = relative_offsets[i]
345 # If this is the last one use var_len_count as end offset
346 if i + 1 == len(relative_offsets):
Seems like relative_offsets
is empty at this point. I'm not sure what this could be? mdbtools is able to extract those without issue
On this particular table (PAR-file version
), I also got ERROR:Failed to parse bool field, Column not found in null_table column: Vrijgegeven , column id: 12 , null_table: [False, False, False, True, True, False, True, True]
, but it doesn't happen for all tables (Parameterlijst_V37
).
In case this might be of help, here's a smaller file where the same problem happens.
I'm trying to use parse_msys_table
to get column types, match them up with their corresponding access_parser.util.TYPE_*
literal and do some conditional processing. ResultType
seemed promising, but always seems to be 0
no matter the actual column type - I expected to see 3
(TYPE_INT16
) for ID
and 10
(TYPE_TEXT
) for Creator
.
>>> db.parse_msys_table()["table_log"]
{'ID': {'ColumnWidth': -1, 'ColumnOrder': 1, 'ColumnHidden': '', 'TextAlign': 0, 'AggregateType': -1, 'ResultType': 0, 'CurrencyLCID': 0, 'GUID': b"\xc4^'\xdc\x11\xb5\xdeD\xa2\x05xoK\x94\xca\xa4"}, 'Creation Date': {'GUID': b';\x95\xd7>\xf8\xa6\xfeI\x9f\xa1\x82\xe4\n\xbc/\xea', 'ColumnWidth': 1995, 'ColumnOrder': 0, 'ColumnHidden': '', 'Required': '', 'IMEMode': 0, 'IMESentenceMode': 3, 'TextAlign': 0, 'AggregateType': -1, 'ShowDatePicker': 1, 'ResultType': 0, 'CurrencyLCID': 0}, 'Creator': {'GUID': b'\xa4\xc7\x80=X\x7f\xc2E\x99\x19\x94\x08QR\x0cP', 'ColumnWidth': -1, 'ColumnOrder': 0, 'ColumnHidden': '', 'Required': '', 'AllowZeroLength': '', 'DisplayControl': 109, 'IMEMode': 0, 'IMESentenceMode': 3, 'UnicodeCompression': '', 'TextAlign': 0, 'AggregateType': -1, 'ResultType': 0, 'CurrencyLCID': 0}, 'Last modified date': {'GUID': b'\x96\xb1\xed;2k\x04J\x95+\xf0mO!\xd79', 'ColumnWidth': 2325, 'ColumnOrder': 0, 'ColumnHidden': '', 'Required': '', 'IMEMode': 0, 'IMESentenceMode': 3, 'TextAlign': 0, 'AggregateType': -1, 'ShowDatePicker': 1, 'ResultType': 0, 'CurrencyLCID': 0}, 'Last modified user': {'GUID': b'\x9bNy\x82\xb9\x8a\xc7I\x9a8\x11\xd1\xd0\xf7\x11\x84', 'ColumnWidth': 2220, 'ColumnOrder': 0, 'ColumnHidden': '', 'Required': '', 'AllowZeroLength': '', 'DisplayControl': 109, 'IMEMode': 0, 'IMESentenceMode': 3, 'UnicodeCompression': '', 'TextAlign': 0, 'AggregateType': -1, 'ResultType': 0, 'CurrencyLCID': 0}}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.