codereverser / casparser Goto Github PK
View Code? Open in Web Editor NEWParser for Consolidated Account Statements (CAS) generated from CAMS/Karvy/Kfintech
License: MIT License
Parser for Consolidated Account Statements (CAS) generated from CAMS/Karvy/Kfintech
License: MIT License
For long mutual fund scheme names that spans more than one row, only the first row is being read.
Example name:
"""
My long mutual fund scheme name ELSS -
Direct growth plan
"""
Only first row will be read: "My long mutual fund scheme name ELSS -"
Thanks for your script. It is very useful. I am trying to build some automation on top of this to analyze my MF investments.
Got this exception when parsing the amount of an IDCW payout transaction. Let me know if I can collect any more debug info to help. I will also see if I can debug further.
casparser==0.4.6
python 3.8.3
File "C:\Sathish\python\mf_statement_parser\venv2\lib\site-packages\casparser\parsers_init_.py", line 35, in read_cas_pdf
processed_data = process_cas_text("\u2029".join(partial_cas_data.lines))
File "C:\Sathish\python\mf_statement_parser\venv2\lib\site-packages\casparser\process_init_.py", line 28, in process_cas_text
return process_detailed_text(text)
File "C:\Sathish\python\mf_statement_parser\venv2\lib\site-packages\casparser\process\cas_detailed.py", line 167, in process_detailed_text
amt = Decimal(m.group(3).replace(",", "_").replace("(", "-"))
decimal.InvalidOperation: [<class 'decimal.ConversionSyntax'>]
I ran in debug mode and noticed that it fails when attempting to parse the following transaction in pdf
I got into the console on the debugger and found that we have a "." in the m.group(3) instead of probably the "amount" number?
m.group(3)
'.'
m.group(3).replace(",", "_").replace("(", "-")
'.'
amt = Decimal(m.group(3).replace(",", "_").replace("(", "-"))
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2020.3.5\plugins\python-ce\helpers\pydev_pydevd_bundle\pydevd_exec2.py", line 1, in Exec
def Exec(exp, global_vars, local_vars=None):
File "", line 1, in
decimal.InvalidOperation: [<class 'decimal.ConversionSyntax'>]
There is no formatting in json transaction have no keys and summary also having some missing keys from parsing on local system.
Hi
when I try to run the example shown in README.md, getting the below error
raise CASParseError("Layout Error! Scheme found before folio entry.")
casparser.exceptions.CASParseError: Layout Error! Scheme found before folio entry.
not sure if I have missed anything, the statement was downloaded from CAMS on 13/01/2021
Adding the schemes json snippet for reference :
"schemes": [
{
"scheme": "HSBC Medium Duration Fund - Regular Growth (Formerly",
"advisor": "N/A",
"rta_code": "OLRCBG",
"rta": "CAMS",
"isin": null,
"amfi": null,
"type": "N/A",
"open": "xxx",
"close": "xxx",
"valuation": {
"date": "2023-01-12",
"nav": "16.9027",
"value": "2637.70"
},
"transactions": []
}
How to calculate XIRR from this CAS parsed data
Thank you for a great parser.
Took me a few tries to understand which report the parser will pick up as Cams is offering many ( so confusing! :-)
Finally, it worked for me.
The table printed is great.
But, what as a user I would also like to see is the total valuation of my fund.
This is only giving the Open/Close rates but the cams file also has:
Valuation on 06-Nov-2020 INR XX,XXX.XX
Can we pick that up too ?
Thanks.
Hi, checking whether there are plans to support parsing of CAS generated by CDSL, as it is much richer in info (contains all stock holdings too alongwith mutual funds).
If no near-term plans, has there been any effort in this direction? I could pitch in, or start from somewhere if some work has already done,,
Hi,
Folio is not getting parsed in below case. Transactions are getting mapped to previously parsed folio.
Below are details of pdf elements and lines for debug
[28.93000030517578, 93.44519805908203, 553.7244873046875, 103.9654769897461, 'Date\t\tTransaction\t\tAmount\t\tUnits\t\tPrice\t\tUnit']
[358.6300048828125, 102.31519317626953, 566.5147705078125, 124.0643310546875, '(INR)\t\t(INR)\t\tBalance\nKYC: OK']
[28.93000030517578, 113.60517120361328, 99.20275115966797, 124.12545013427734, 'Folio No: 99999999']
'Date\t\tTransaction\t\tAmount\t\tUnits\t\tPrice\t\tUnit'
'Folio No: 99999999\t\t(INR)\t\t(INR)\t\tBalance\nKYC: OK'
CAS pdf files are generated primarily based on the email address and may occasionally contain multiple PAN numbers depending upon the filters used during the generation. To handle such cases, the capital gains report should have an extra column for the PAN number and preferably group the entries based on it.
Getting the following error
File "\lib\site-packages\casparser\analysis\gains.py", line 192, in merge_transactions
merged_transactions[dt].units += txn["units"]
TypeError: unsupported operand type(s) for +=: 'decimal.Decimal' and 'NoneType'
Dividend payout transactions have nothing in the "Units" column as shown in screenshot below (Only "Amount" column)
Hi Team
The Franklin Mutual Fund house changed their registrar to CAMS.
Therefore the data in PDF statement may have also changed
The Latest CAS statement is showing the Franklin Schemes but data is not returned by casparser package.
Hi,
Do you have support for multiple pds files on your roadmap ?
For eg: I have 2 different reports - one from Karvy and other from Cams.
I can run the parser twice and see the results.
But, in the end I would like to see my complete portfolio in one place.
So, running 2 scripts with output on command line and then combining can be done away with it if the parser can support multiple pds.
I know each pdf can have a different password so that needs to be handled as well.
Or you want the parser to be agnostic to this and the person running the code should handle it at their end ?
Let me know.
Thanks!
When we try to parse files & anything fails we catch need to catch that exception in a generic way & return a response.
Instead, if we export excpetions.py classes, it will make it easier for users to catch these exceptions.
Hello team,
Franklin Templeton created few Segregated Portfolio's for some stressed Mutual Funds. The data is read incorrectly in some cases. In the example given below - there are 2 Segregation records - one for qty 215931.176, and second for qty 0.008, but the parser scans the 2nd one as qty 215931.184 ...
{"scheme": "Franklin India Credit Risk Fund- Segregated Portfolio 1 (8.25% Vodafone Idea Ltd-10JUL20-Growth Plan)",
"advisor": "ICICIRON",
"rta_code": "FTI880", "type": "DEBT", "rta": "CAMS", "isin": "INF090I01TJ6", "amfi": "147954", "open": "0.000", "close": "0.000", "close_calculated": "215931.176", "valuation": {"date": "2020-07-17", "value": "0.00", "nav": "0.0818"},
"transactions": [
{"date": "2020-01-24", "description": "Creation of units - Segregated Portfolio\t\t215,931.176", "amount": "0", "units": "215931.176", "nav": "0", "balance": "215931.176", "type": "SEGREGATION", "dividend_rate": null},
{"date": "2020-01-24", "description": "Creation of units - Segregated Portfolio\t\t0.008", "amount": "0", "units": "215931.184", "nav": "0", "balance": "215931.184", "type": "SEGREGATION", "dividend_rate": null},
{"date": "2020-06-15", "description": "Payment - Units Extinguished", "amount": "-1338.33", "units": "-16360.996", "nav": "0.0818", "balance": "199570.188", "type": "REDEMPTION", "dividend_rate": null},
{"date": "2020-07-10", "description": "Payment - Units Extinguished", "amount": "-16324.84", "units": "-199570.188", "nav": "0.0818", "balance": "0.000", "type": "REDEMPTION", "dividend_rate": null}]}
This is a bug in pdfminer/mupdf but I thought It would be useful to document (since the implications are somewhat critical if you rely on the output of casparser).
If you have pages that like look this across page boundaries, it seems to count the transaction at start of page two in the previous page as well. For me, it counts the *** Stamp Duty***
transaction at the start of the second page twice (once as part of the previous page 4, and again for the actual first time it is encountered - in page 5).
My guess is the mediabox
(used by pdfminer to determine page boundaries) of the page is larger than necessary and extends into the second one.
Hi,
CAMS CAS has Dividend Payout transactions like below.
TRANSACTION_RE doesn't match since "units", "nav" and "balance" columns are missing in these entries.
Can the parser be updated to handle the Dividend Payout transaction? Not sure if Karvy CAS has similar format.
Also, the Dividend amount may need to be negated for XIRR calculations.
Can we parse eCAS from NSDL using this parser. Any plans to do the same in near future
In the current code CASParseError("Incorrect PDF password!")
is raised when the password is wrong.
casparser/casparser/parsers/mupdf.py
Line 200 in e507c53
So you have to do ugly things like:
try:
read_cas_pdf("pdf", "password")
except CASParseError as err:
if err.args:
if 'incorrect pdf password' in err.args[0].lower():
raise InvalidPasswordError
raise
One possible solution could be to create a separate Exception for wrong password inheriting from CASParseError
. Or a code
attribute could be set in the CASParseError
class, whose value could be like incorrect_password
(or something else depending on the context where it is raised) which you can check for when handling the exception.
If you don't have the bandwidth, I can make a PR for the same this weekend.
Generated file from here
https://www.camsonline.com/Investors/Statements/Consolidated-Account-Statement
still this error
There seems to be an assumption that unit balance is never negative. While this assumption seems reasonable, I have a statement in which unit balance is shown as negative (some slight rounding error by AMC). This causes parsing to fail. I believe the fix is simply applying the same logic to unit balance as is applied to units.
In the advisor field, only ARN is coming. ARN number is required to identify the advisor associated with it.
Library looks good but if I use https link (public URL) of PDF, it throws an error "No such file or directory exists"
Currently, when the detailed summary is exported, the transaction type with key "type" consists of the string version of the TransactionType Enum.
casparser/casparser/process/cas_detailed.py
Line 200 in e2f14a0
While this is the right design, if someone wants to reuse the TransactionType Enum elsewhere (like I am) on the exported data, this becomes a slight nuisance, as Json parsers like pydantic will not automatically parse the string into the Enum (as TransactionType is an Enum of ints).
I am trying to use your library. Followed all steps as listed on your pypi page. But it always shows the error
module casparser has no attribute.
data = casparser.read_cas_pdf('/home/path.pdf', 'pwd') AttributeError: module 'casparser' has no attribute 'read_cas_pdf'
Code:
import casparser data = casparser.read_cas_pdf('/home/path.pdf', 'pwd')
Great work regardless thank you.
This is the code I am using to get the parsed data
import casparser
def main():
data = casparser.read_cas_pdf('./demo2/JUL2020_AA03773313_TXN.pdf', 'FVXPK2945F', output="json")
# data = casparser.read_cas_pdf('./demo2/MAR2021_AA06997817_TXN.pdf', password='BCDPJ0121K', force_pdfminer=True)
print()
if __name__ == '__main__':
main()
and this is what the error is
Traceback (most recent call last):
File "/home/usharab/.local/lib/python3.8/site-packages/casparser/parser.py", line 163, in read_cas_pdf
investor_info = parse_investor_info(layout, *page.mediabox[2:])
File "/home/usharab/.local/lib/python3.8/site-packages/casparser/parser.py", line 55, in parse_investor_info
raise CASParseError("Unable to parse investor data")
casparser.exceptions.CASParseError: Unable to parse investor data
The version of casparser I am using is '0.2.1' and before this version I was using version '0.5.3' and that version gave the same error. Can anyone guide me what could be the issue?
I have also tried force_pdfminer too and that also returned the same error
HI Team
First of all many thanks for the great package your team has created.
I am author of repo and using your package to parse the cas pdf for my project.
I have requirement to classify funds based on type debt/equity and subtypes such large cap/small cap etc .
Would it be possible to integrate this feature in your package.
Hi,
Currently table is printing something like below
https://raw.githubusercontent.com/codereverser/casparser/main/assets/demo.jpg
However, we are getting the NAV as on in the pdf.
If NAV is printed, we can easily multiple with 'close calculated' to get the final value of the fund.
I know you recently added the fund value...
This is important as the only thing changing here daily is NAV and if the SIPs are still going out, then even close calculated changes but that change is less frequent.
Thoughts ?
Hi,
While this is parsing the CAMs report correctly, I have had no luck getting it to parse the Kfintech report.
I generated it from the url you mention in your Readme.
Get the report and I am able to open it fine.
When I run it,
โ casparser (main) โ casparser karvy2.pdf
Enter PDF password:
Error parsing pdf file :: Error parsing CAS header
Thanks.
So, I have a unique problem. My CAS has two discreet entries for the same folio number, due to switch from regular plan to direct plan. The Switch in is listed before the Switch out. So what ends up happening is the program parses the closing unit balance of 7.350 first and closing unit balance of 0.00 from the next entry overwrites it. So the parsed closing unit balance is 0.00 and the calc_close is 7.350. Which ends up as an error in the CLI version, but raises no error in the normal call. I saw your TODO comment about adding this validation as well. So here's something you can test it against.
If you don't mind me suggesting options, you could maybe add them up instead of replacing, or include calc_close in the dict too.
Also, here is the parsed data of the pdf for your convenience:
'Folio No: 0000000 / 00\t\tPAN: XXXXX0000X\t\tKYC: OK PAN: OK', 'GD65-IDFC Low Duration Fund-Growth-(Direct Plan) (Advisor: INA000000000)\t\tRegistrar : CAMS', 'Opening Unit Balance: 0.000', '14-Aug-2019\t\tNORMAL SWITCH - From IDFC Low Duration Fund-Gr-(Reg Pln)-BSE -\t\t202.98\t\t7.350\t\t27.6156\t\t7.350', 'Closing Unit Balance: 7.350\t\tNAV on 22-Dec-2020: INR 30.3779\t\tValuation on 22-Dec-2020: INR 223.28', '"Entry Load: Nil - Exit Load : Nil W.E.F 29/June/2012 . Please refer the Offer Document / Addendum issued from time to time"', 'Folio No: 0000000 / 00\t\tPAN: XXXX0000X\t\tKYC: OK PAN: OK', 'G65-IDFC Low Duration Fund-Growth-(Regular Plan) (Advisor: ARN-000000)\t\tRegistrar : CAMS', 'Opening Unit Balance: 0.000', '18-Jun-2019\t\tPurchase\t\t200.00\t\t7.424\t\t26.9379\t\t7.424', '19-Jun-2019 ***Address Updated from KRA Data***', '19-Jun-2019 ***Registration of Nominee***', '14-Aug-2019\t\tSwitch Out - To IDFC Low Duration Fund-Gr-(Dir Pln)-BSE -\t\t(202.98)\t\t(7.424)\t\t27.3406\t\t0.000', '30-Sep-2020 ***Address Updated from KRA Data***', 'Closing Unit Balance: 0.000\t\tNAV on 22-Dec-2020: INR 29.9852\t\tValuation on 22-Dec-2020: INR 0.00', '"Entry Load: Nil - Exit Load : Nil W.E.F 29/June/2012 . Please refer the Offer Document / Addendum issued from time to time"
I generated a consolidated report from CAS - CAMS + KFintech at https://www.camsonline.com/Investors/Statements/Consolidated-Account-Statement
Executing the casparser cli utility does not return successfully.
Is this expected ?
Please note the Error and the Excalamation marks in the image below.
Command executed,
$ casparser <filename>.pdf -p '<$password$>'
File Type details,
File Type : FileType.CAMS
CAS Type : CASFileType.DETAILED
also, is up-to-date,
(.venv_py310) iceman@pop-os ~/D/M/Statements> casparser-isin --update
2023-08-31 00:26:23,325 - INFO - Fetching remote isin db metadata
2023-08-31 00:26:24,283 - INFO - Local db version : 2023.8.18
2023-08-31 00:26:24,283 - INFO - Remote db version : 2023.8.18
2023-08-31 00:26:24,283 - INFO - casparser-isin database is already upto date
Some scheme descriptions have commas in them:
***IDCW @ Rs.2.95000000 per unit (TDS :138.70, TDS Rate: 7.50%)***
Redemption less TDS, STT
Lateral Shift Out less TDS, STT
Redemption Less STT -BSE - - UTR # CITIN24422132375 , less STT
These cause a problem when reading the CSV file.
Possible solutions:
For funds with SWITCH_IN_MERGER
transactions, the sale units has to be matched with the purchase transactions from the original fund from where the units were transferred.
HeaderParseError Traceback (most recent call last)
in ()
----> 1 json_str = data = casparser.read_cas_pdf("33220217220210621ZFBF290265631DC70CPIMBCP130542292.pdf", "abcd1234")
2 frames
/usr/local/lib/python3.7/dist-packages/casparser/process.py in parse_header(text)
17 if m:
18 return m.groupdict()
---> 19 raise HeaderParseError("Error parsing CAS header")
20
21
HeaderParseError: Error parsing CAS header
Code:
json_str = data = casparser.read_cas_pdf("33220217220210621ZFBF290265631DC70CPIMBCP130542292.pdf", "xyz")
while running casparser, it is giving following error:
data = casparser.read_cas_pdf("CAMS_pranshu766.pdf", "pranshu766")
Deprecation: 'getTextPage' removed from class 'Page' after v1.19.0 - use 'get_textpage'.
Traceback (most recent call last):
File "", line 1, in
File "/home2/ajitup/anaconda3/lib/python3.8/site-packages/casparser/parsers/init.py", line 33, in read_cas_pdf
partial_cas_data = cas_pdf_to_text(filename, password)
File "/home2/ajitup/anaconda3/lib/python3.8/site-packages/casparser/parsers/mupdf.py", line 213, in cas_pdf_to_text
investor_info = parse_investor_info(page_dict)
File "/home2/ajitup/anaconda3/lib/python3.8/site-packages/casparser/parsers/mupdf.py", line 145, in parse_investor_info
raise CASParseError("Unable to parse investor data")
casparser.exceptions.CASParseError: Unable to parse investor data
Please help!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.