claroty / access_parser Goto Github PK

A Python based parser for Microsoft Access database files

License: Apache License 2.0

Python 100.00%

access_parser's Introduction

AccessDB Parser (Pure Python)

Microsoft Access (.mdb / .accdb) database files parser. The parsing logic is fully written in python and works without any external binary dependencies.

Installing

Use pip: pip install access-parser

Or install manually:

git clone https://github.com/ClarotyICS/access_parser.git
cd access_parser
python3 setup.py install

Demo

Usage Example

from access_parser import AccessParser

# .mdb or .accdb file
db = AccessParser("/path/to/mdb/file.mdb")

# Print DB tables
print(db.catalog)

# Tables are stored as defaultdict(list) -- table[column][row_index]
table = db.parse_table("table_name")

# Pretty print all tables
db.print_database()

Known Issues

This library was tested on a limited subset of database files. Due to the differences between database versions and the complexity of the parsing we expect to find more parsing edge-cases.

To help us resolve issues faster please provide as much data as you can when opening an issue - DB file if possible and full trace including log messages.

Thanks

This library was made possible by the great work by mdb-tools. The logic in this library heavily relies on the excellent documentation they have https://github.com/brianb/mdbtools
Huge thanks to Mashav Sapir for the help debugging, CRing and contributing to this project https://github.com/mashavs

access_parser's People

Contributors

Stargazers

Watchers

Forkers

ur1katz reanimat0r techlord-rce x0day mveleci rbneiman apwebber kamnsv mdudek-ics mcsash ariyawow otwartytransport myibgit andrewdkennedy1 matatika

access_parser's Issues

Empty relative_offsets?

I'm trying to open this file. It can parse some tables fine, but some tables fail on the following:

IndexError: list index out of range
> /home/geoffrey/.local/lib/python3.10/site-packages/access_parser-0.0.4-py3.10.egg/access_parser/access_parser.py(344)_parse_dynamic_length_data()
    342                 if i in relative_record_metadata.variable_length_jump_table:
    343                     jump_table_addition += 0x100
--> 344             rel_start = relative_offsets[i]
    345             # If this is the last one use var_len_count as end offset
    346             if i + 1 == len(relative_offsets):

Seems like relative_offsets is empty at this point. I'm not sure what this could be? mdbtools is able to extract those without issue

On this particular table (PAR-file version), I also got ERROR:Failed to parse bool field, Column not found in null_table column: Vrijgegeven , column id: 12 , null_table: [False, False, False, True, True, False, True, True], but it doesn't happen for all tables (Parameterlijst_V37).

In case this might be of help, here's a smaller file where the same problem happens.

KeyError: 8192

I open different .mdb files without problem.
Now, with a different db mdb i try to get the db with:

db = AccessParser("/mnt/doss/4/Zonda.mdb")

i get KeyError: 8192

Why?

Umlauts (äöü, etc.) are being ignored

Records with umlauts in "short_text" columns are being ignored.

Can't differentiate between empty and zero number field

access_parser reports empty number fields as zeroes, so there's no way to tell if the field is actually empty or if it's a literal zero.

Steps to reproduce:

Create a database with a non-required number column
Leave the field empty in one row, and enter zero in another row
Parse the database, observe that both entries are represented by a 0 in the resulting defaultdict

datetime field is 19 digits

example.accdb.txt

How to reconstruct 19 digit (eg. 4676213469643538432) datetime value read in from the accessdb file to an actual date?

Character encoding problems in table names and field values

When trying to access data from the Svenskt Pressregister 1903-1911 dataset (direct download link: Dataset(8991 kB)) I get some unexpected results.

One table is called "Konsert" when I open it in Microsoft Access Office 2007 but when parsed by access_parser it is called "Konsert춢".
Some columns of type varchar can't decode the characters … and ’. They instead become \0& (0x00 0x38) and \0\25 \0 (0x00 0x25 0x32 0x00). In Microsoft Access they have the expected values of … and ’ respectively.

The following code

from access_parser import AccessParser

db_path = "SVEPDB.accdb"
db = AccessParser(db_path)
tables = db.catalog.keys()

concert_table = [x for x in tables if x.startswith("Konsert")][0]
if concert_table != "Konsert":
    print(f"Can't find Konsert table, found {concert_table}")
else:
    print("Found Konsert table")

table = db.parse_table("nskon")
ellipsis_value = table['ntit'][4338]
apostrophe_value = table['ntit'][14986]

if ellipsis_value != 'Det "fula" Stockholm …':
    print(f"Ellipsis not decoded correctly, got: {ellipsis_value}")
else:
    print("Ellipsis decoded correctly")

if apostrophe_value != 'Landsmålsbref. Tell ’n Stanialus':
    print(f"Apostrophe not decoded correctly, got: {apostrophe_value}")
else:
    print("Apostrophe decoded correctly")

outputs

WARNING:Could not find overflow record data page overflow pointer: 27
WARNING:Could not find overflow record data page overflow pointer: 27
Can't find Konsert table, found Konsert춢
Ellipsis not decoded correctly, got: Det "fula" Stockholm & 
Apostrophe not decoded correctly, got: Landsmålsbref. Tell  n Stanialus

instead of the expected

Found Konsert table
Ellipsis decoded correctly
Apostrophe decoded correctly

Short Text column empty

Hi, I’m trying to read a mdb file and even though it mostly works, a table that has a column with Short Text data is empty. I’m not sure how to debug it, when parsing the file I get a high number of all sorts of debug log entries, mostly unsupported data type. Unfortunately, I cannot share the file. Other short text columns in the same table are fine. The one missing has coordinates in it, something like 48.264939775287694, 15.986922046279833. What can I do to solve this?

TYPE_TEXT - Unicode Decoding

Firstly thanks for the work on this package. Its been ideal for my use case.

The only point it seemed to struggle with is when unicode characters were part of a text field. I trawled through the Jackcess source code to find that the buffer starting with b'\xff\xfe' means the field is set to use "unicode compression" which seems to pull out the b'\x00' padding from 0-255 characters in UTF-16-LE. Where it can't it leaves the original encoding, so the padding needs to be pushed back in where necessary before decoding using UTF-16-LE.

So tweaking the below function in utils

def parse_type(data_type, buffer, length=None, version=3, props=None):
    parsed = ""
    # Bool or int8
    if data_type == TYPE_INT8:
        parsed = struct.unpack_from("b", buffer)[0]
    elif data_type == TYPE_INT16:
        parsed = struct.unpack_from("h", buffer)[0]
    elif data_type == TYPE_INT32 or data_type == TYPE_COMPLEX:
        parsed = struct.unpack_from("i", buffer)[0]
    elif data_type == TYPE_MONEY:
        parsed = struct.unpack_from("q", buffer)[0]
        if props and "Format" in props:
            prop_format = props['Format']
            if parsed == 0:
                parsed = [y for x, y in FORMAT_TO_DEFAULT_VALUE.items() if prop_format.startswith(x)]
                if not parsed:
                    logging.warning(f"parse_type got unknown format while parsing money field {prop_format}")
                else:
                    parsed = parsed[0]
            else:
                parsed = parse_money_type(parsed, prop_format)
    elif data_type == TYPE_FLOAT32:
        parsed = struct.unpack_from("f", buffer)[0]
    elif data_type == TYPE_FLOAT64:
        parsed = struct.unpack_from("d", buffer)[0]
    elif data_type == TYPE_DATETIME:
        double_datetime = struct.unpack_from("q", buffer)[0]
        parsed = mdb_date_to_readable(double_datetime)
    elif data_type == TYPE_BINARY:
        parsed = buffer[:length]
        offset = length
    elif data_type == TYPE_OLE:
        parsed = buffer
    elif data_type == TYPE_GUID:
        parsed = buffer[:16]
        guid = uuid.UUID(parsed.hex())
        parsed = str(guid)
    elif data_type == TYPE_96_bit_17_BYTES:
        parsed = buffer[:17]
    elif data_type == TYPE_TEXT:
        # if version > 3:
        #     # Looks like if BOM is present text is already decoded
        #     if buffer.startswith(b"\xfe\xff") or buffer.startswith(b"\xff\xfe"):
        #         buff = buffer[2:]
        #         parsed = get_decoded_text(buff)
        #     else:
        #         parsed = buffer.decode("utf-16", errors='ignore')
        # else:
        #     parsed = get_decoded_text(buffer)
        ###Use custom parsing functions derived from Jackcess approach (Likely still needs work to support all JET versions. but works well for .accdb 2007)
        parsed = decodeTextValue(buffer)
    else:
        logging.debug(f"parse_type - unsupported data type: {data_type}")
    return parsed

and adding the following functions that shamelessly attempt to mimic the approach taken in Jackcess (Java)

###Text type decoding functions
def decodeTextValue(data: bytes):
    '''Decodes a compressed or uncompressed text value'''

    ## see if data is compressed.  the 0xFF, 0xFE sequence indicates that
    ## compression is used (sort of, see algorithm below)
    TEXT_COMPRESSION_HEADER = b'\xff\xfe'

    isCompressed = (len(data) > 1 and
                    data[0] == TEXT_COMPRESSION_HEADER[0] and
                    data[1] == TEXT_COMPRESSION_HEADER[1])
    
    if isCompressed:
        textBuf = ''

        dataStart = len(TEXT_COMPRESSION_HEADER)
        dataEnd = dataStart
        inCompressedMode = True

        while dataEnd < len(data):
            endByte = data[dataEnd:dataEnd+1]
            if endByte == b'\x00':

                # handle current segment
                textBuf = textBuf + decodeTextSegment(data, dataStart, dataEnd, inCompressedMode)
                inCompressedMode = not inCompressedMode
                dataEnd += 1
                dataStart = dataEnd
            else:
                dataEnd += 1
        # handle last segment
        textBuf = textBuf + decodeTextSegment(data, dataStart, dataEnd, inCompressedMode)

        return textBuf
    
    return decodeUncompressedText(data,0,len(data)) #should pass charset from jet version


def decodeTextSegment(data: bytes, dataStart: int, dataEnd: int, inCompressedMode: bool):
    '''
    Decodes a segment of a text value into the given buffer according to the
    given status of the segment (compressed/uncompressed).
    '''

    if dataEnd <= dataStart:
        #no data
        return
    
    dataLength = dataEnd - dataStart

    if inCompressedMode:
        tmpData = bytearray(dataLength * 2)
        tmpIdx = 0
        for i in range(dataStart,dataEnd):
            tmpData[tmpIdx:tmpIdx+1] = data[i:i+1]
            tmpIdx += 2
        
        data = bytes(tmpData)
        dataStart = 0
        dataEnd = len(data)

    return decodeUncompressedText(data,dataStart,dataEnd)

def decodeUncompressedText(textBytes: bytes,dataStart: int, dataEnd: int): #should include charset variable
    
    bytesToDecode = textBytes[dataStart:dataEnd]
    try:
        decoded = bytesToDecode.decode('utf_16_le') ##appears to be the default for the jet versions i've tested. may need to be variable based on db charset
    except UnicodeDecodeError:
        try:
            decoded = bytesToDecode.decode('latin1')
        except UnicodeDecodeError:
            decoded = bytesToDecode.decode('utf-8', errors='ignore')
    return decoded

Seems to successfully parse whatever unicode characters I add to my sample .accdb files.

I've no doubt they could be written more succinctly but just tried to replicate what I found here: https://github.com/jahlborn/jackcess/blob/7b338ea7bfd9fb9677c7c1634d760f41b4a611ab/src/main/java/com/healthmarketscience/jackcess/impl/ColumnImpl.java#L1584C21-L1584C22

Hope it helps

Thanks again!

`parse_msys_table` does not return column types as defined in `access_parser.util`

I'm trying to use parse_msys_table to get column types, match them up with their corresponding access_parser.util.TYPE_* literal and do some conditional processing. ResultType seemed promising, but always seems to be 0 no matter the actual column type - I expected to see 3 (TYPE_INT16) for ID and 10 (TYPE_TEXT) for Creator.

>>> db.parse_msys_table()["table_log"]
{'ID': {'ColumnWidth': -1, 'ColumnOrder': 1, 'ColumnHidden': '', 'TextAlign': 0, 'AggregateType': -1, 'ResultType': 0, 'CurrencyLCID': 0, 'GUID': b"\xc4^'\xdc\x11\xb5\xdeD\xa2\x05xoK\x94\xca\xa4"}, 'Creation Date': {'GUID': b';\x95\xd7>\xf8\xa6\xfeI\x9f\xa1\x82\xe4\n\xbc/\xea', 'ColumnWidth': 1995, 'ColumnOrder': 0, 'ColumnHidden': '', 'Required': '', 'IMEMode': 0, 'IMESentenceMode': 3, 'TextAlign': 0, 'AggregateType': -1, 'ShowDatePicker': 1, 'ResultType': 0, 'CurrencyLCID': 0}, 'Creator': {'GUID': b'\xa4\xc7\x80=X\x7f\xc2E\x99\x19\x94\x08QR\x0cP', 'ColumnWidth': -1, 'ColumnOrder': 0, 'ColumnHidden': '', 'Required': '', 'AllowZeroLength': '', 'DisplayControl': 109, 'IMEMode': 0, 'IMESentenceMode': 3, 'UnicodeCompression': '', 'TextAlign': 0, 'AggregateType': -1, 'ResultType': 0, 'CurrencyLCID': 0}, 'Last modified date': {'GUID': b'\x96\xb1\xed;2k\x04J\x95+\xf0mO!\xd79', 'ColumnWidth': 2325, 'ColumnOrder': 0, 'ColumnHidden': '', 'Required': '', 'IMEMode': 0, 'IMESentenceMode': 3, 'TextAlign': 0, 'AggregateType': -1, 'ShowDatePicker': 1, 'ResultType': 0, 'CurrencyLCID': 0}, 'Last modified user': {'GUID': b'\x9bNy\x82\xb9\x8a\xc7I\x9a8\x11\xd1\xd0\xf7\x11\x84', 'ColumnWidth': 2220, 'ColumnOrder': 0, 'ColumnHidden': '', 'Required': '', 'AllowZeroLength': '', 'DisplayControl': 109, 'IMEMode': 0, 'IMESentenceMode': 3, 'UnicodeCompression': '', 'TextAlign': 0, 'AggregateType': -1, 'ResultType': 0, 'CurrencyLCID': 0}}

db. parseTable gives an error

Hello,
I'm working with your parser on node js
I looked through the questions there, it is written everywhere that it is worth contacting here

db. parseTable returns an error:
parsed = uuid_1.default.stringify(buffer.slice(0, 16));
^
TypeError: Cannot read property 'stringify' of undefined

db.getTables (); - works fine
I can't understand what the problem is
Can you please help

Review these files to better understand my problem:
https://disk.yandex.ru/i/X_5XRmKTGaRmWA
https://disk.yandex.ru/i/qHNzL5Mi1fHuwg

Problems with parsing IPEDS table

Hello,
I am working whith javascript program that needs to parse accessDB files and used this extension (https://github.com/regrapes/access-db-parser). I used this extension in order to try and parse through the IPEDS database (Download Link: Here).

Problem appears when trying to parse DRVC2021.

Extract macro without connection to db

I try to extract macro from my access mdb file

I exactly know there is one macro in it, but i have no idea how i can extract this macro

I found this atricle

Author says there is MSysAccessObjects table in access db. However this atricle is too old (about 8 years)
So, in my mdb there is no MSysAccessObjects table.

Maybe it rhas been enamed or removed?

does anyone have any ideas how i can do this?

Currency type

The currency type is not retrieving correct values.

Incorrect parsing of IPEDS database

Hello,
I am working on a javascript program that needs to parse accessDB files and used this extension, which used this project and turned it into a node package. I used this extension in order to try and parse through the IPEDS database (Download Link: Here).

The problem code parses through the HD2018 table in the IPEDS201819 database

Upon trying to parse I get a huge amount of error statements in the following format

Memo data inline
Parsing memo field ♂���077227858
Memo data inline
Parsing memo field ↔���Concordia University Irvine
Memo data inline
Parsing memo field ♂���076084946
Memo data inline
Parsing memo field ☻�
Memo data inline

As the table is massive, I went and deleted some print statements for the file to find if any other errors were called and it also sent the following errors

Overflow record flag is not present 2990
LVAL type 1
Overflow record flag is not present 2848
LVAL type 1
Overflow record flag is not present 2776
LVAL type 1
Overflow record flag is not present 2572
LVAL type 1

Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes

The result of the parsing yielded the following

[
  '100654',
  '1',
...Everything fine between here...
  '119',
  '1',
  'Alabama A & M University',
  'AAMU',
  '4900 Meridian Street',
  'Normal',
  'AL',
  '35762Dr. Andrew Hugine, Jr.President2563725000636001109 \x0B耀\x00\x00\x00\x00㤱㈷㘱㔴〵　㄀　　㈀　　  眀眀眀⸀愀愀洀甀⸀攀搀甀⼀眀眀眀⸀愀愀洀甀⸀攀搀甀⼀䄀 
搀洀椀猀猀椀漀渀猀⼀倀愀最攀猀⼀搀攀昀愀甀氀琀⸀愀猀瀀砀眀',
  '',
  'President',
  '2563725000',
  '636001109 ',
  '197216455',
  '00100200  ',
  'www.aamu.edu/',
  'www.aamu.edu/Admissions/Pages/default.aspxwww.aamu.edu/admissions/fincialaid/pages/default.aspxhttps://www.aamu.edu/Admissions/UndergraduateAdmissions/Pages/Apply%20Today',
  '',
  'https://www.aamu.edu/Admissions/UndergraduateAdmissions/Pages/Apply%20Today.aspxhttps://galileo.aamu.edu/NetPriceCalculator/npcalc.htm  www.aamu.edu/administrativeoffices/VADS/Pages/Disability-Services.aspxA-',
  '',
  ' ',
  ' ',
  'www.aamu.edu/administrativeoffices/VADS/Pages/Disability-Services.aspx',
  'A-2        -2                                                                              -2Madison County\x80\x00\x00\x00\x00\x00㘀⣮\x05\x00\x00\x00\x00\x00\x00삈ȒӭӜӋүҫ',
  '',
  '-2                                                                              ',
  '-2',
  'Madison County',
  '�\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x006�(\x05',
  ''
]

As mentioned, these are simply snip bits of the errors, as this error was thrown for what seemed like nearly every line of these multiple thousand line database. I posted this as an issue on the node project and was sent to here as this seems to be a logic issue. Please tell me if anymore information is required, or if this is a duplicate issue and i didn't realize.
Alex

Unparser

Hi, thank you very much for the amazing work!

Is there any plan for an unparse feature too?

IndexError: list index out of range

Hi,

I'm trying to load an old database and I'm getting the following error when parsing a table.

========================================================================================== FAILURES =========================================================================================== 
________________________________________________________________________________________ test_database ________________________________________________________________________________________ 
                                                                                                                                                                                                
    def test_database():                                                                                                                                                                        
>       database = Database('./tests/data/{redacted}.mdb')                                                                                                                                 
                                                                                                                                                                                                
tests\test_database.py:20:                                                                                                                                                                      
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
{redacted}\database.py:81: in __init__                                                                                                                                                                
    redacted = self.access.parse_table('redacted')                                                                                                                                          
.venv\Lib\site-packages\access_parser\access_parser.py:185: in parse_table                                                                                                                      
    return access_table.parse()                                                                                                                                                                 
.venv\Lib\site-packages\access_parser\access_parser.py:257: in parse                                                                                                                            
    self._parse_row(record)                                                                                                                                                                     
.venv\Lib\site-packages\access_parser\access_parser.py:293: in _parse_row                                                                                                                       
    self._parse_fixed_length_data(record, column, null_table)                                                                                                                                   
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                                                                                                                                                                                                
self = <access_parser.access_parser.AccessTable object at 0x000001D11021A1D0>                                                                                                                   
original_record = b'\x07\x00\x00\x00\x01\x00b\x00\x00\x00\xea\x01\x00\x00\xff\xfe{redactedreadactedredactedredac}\xff\xfe{redacted}\xff\xfe{redacted}R\x00R\x00R\x00R\x00A\x002\x00\x10\x00\x06\x00\xaf\x03'                                                                                                                                                                            
column = Container(type=4, ver4_unknown_3=1625, column_id=8, variable_column_number=6, column_index=2, various=Container(max_nu...': '', 'Required': '', 'DisplayControl': 109, 'TextAlign': 0, 
'AggregateType': -1, 'ResultType': 0, 'CurrencyLCID': 0})                                                                                                                                       
null_table = [True, True, False, False, False, False, ...]                                                                                                                                      
                                                                                                                                                                                                
    def _parse_fixed_length_data(self, original_record, column, null_table):                                                                                                                    
        """                                                                                                                                                                                     
        Parse fixed-length data from record                                                                                                                                                     
        :param original_record: unmodified record                                                                                                                                               
        :param column: column this data belongs to                                                                                                                                              
        :param null_table: null table of the row                                                                                                                                                
        """                                                                                                                                                                                     
        column_name = column.col_name_str                                                                                                                                                       
        # The null table indicates null values in the row.                                                                                                                                      
        # The only exception is BOOL fields which are encoded in the null table                                                                                                                 
        has_value = True                                                                                                                                                                        
        if column.column_id > len(null_table):                                                                                                                                                  
            logging.warning("Invalid null table. Bool values may be wrong, deleted values may be shown in the db.")                                                                             
            if column.type == TYPE_BOOLEAN:                                                                                                                                                     
                has_value = None                                                                                                                                                                
        else:                                                                                                                                                                                   
>           has_value = null_table[column.column_id]                                                                                                                                            
E           IndexError: list index out of range                                                                                                                                                 
                                                                                                                                                                                                
.venv\Lib\site-packages\access_parser\access_parser.py:319: IndexError                                                                                                                          
=================================================================================== short test summary info =================================================================================== 
FAILED tests/test_database.py::test_database - IndexError: list index out of range                                                                                                              
====================================================================================== 1 failed in 0.22s ======================================================================================

Maybe I'm getting it wrong, but shouldn't this part in access_parser.py be like this?
At least for my use case it seems to be solve the issue.

- if column.column_id > len(null_table):
+ if column.column_id >= len(null_table):

I'm unable to provide a file for repro, sorry...

Thanks!