Welcome to Doctor, Free Law Project's microservice for converting, extracting and modifying documents and audio files.
At a high level, this service provides you with high-performance HTTP endpoints that can:
- Extract text from various types of documents
- Convert audio files from one format to another while stripping messy metadata
- Create thumbnails of PDFs
- Provide metadata about PDFs
Under the hood, Doctor uses gunicorn to connect to a django service. The django service uses
carefully configured implementations of ffmpeg
, pdftotext
, tesseract
, ghostscript
, and a
number of other converters.
Assuming you have docker installed run:
docker run -d -p 5050:5050 freelawproject/doctor:latest
This will expose the endpoints on port 5050 with one gunicorn worker. This is usually ideal because it allows you to horizontally scale Doctor using an orchestration system like Kubernetes.
If you are not using a system that supports horizontal scaling, you may wish to have more gunicorn workers so that Doctor can handle more simultaneous tasks. To set that up, simply set the DOCTOR_WORKERS environment variable:
docker run -d -p 5050:5050 -e DOCTOR_WORKERS=16 freelawproject/doctor:latest
If you are doing OCR or audio conversion, scaling through a system like Kubernetes or through by giving Doctor many workers becomes particularly important. If it does not have a worker available, your call to Doctor will probably time out.
After the image is running, you should be able to test that you have a working environment by running
curl http://localhost:5050
which should return a text response:
Heartbeat detected.
The service currently supports the following tools:
- Extract text from PDF, RTF, DOC, DOCX, or WPD, HTML, TXT files.
- OCR text from a scanned PDF.
- Get page count for a PDF document.
- Check for bad redactions in a PDF document.
- Convert audio files from wma, ogg, wav to MP3.
- Create a thumbnail of the first page of a PDF (for use in Open Graph tags)
- Convert an image or images to a PDF.
- Identify the mime type of a file.
A brief description and curl command for each endpoint is provided below.
Given a document, extract out the text and assorted metadata. Supports the following document types:
pdf
- Adobe portable document format files, viapdftotext
.doc
- Word document files, viaantiword
.docx
- Open Office XML files, viadocx2txt
.html
- HTML files, vialxml.html.clean.Cleaner
. Strips out dangerous tags and hoists their contents to their parent. Hoisted tags include:a
,body
,font
,noscript
, andimg
.txt
- Text files. This attempts to normalize all encoding questions to utf-8. First, we try cp1251, then utf-8, ignoring errors.wpd
- Word Perfect files, viawpd2html
followed by cleaning the HTML as above.
curl 'http://localhost:5050/extract/doc/text/' \
-X 'POST' \
-F "file=@doctor/test_assets/vector-pdf.pdf"
Parameters:
ocr_available
: Whether doctor should use tesseract to provide OCR services for the document. OCR is always possible in doctor, but sometimes you won't want to use it, since it can be slow. If you want it disabled for this request, omit this optional parameter. To enable it, set ocr_available toTrue
:
curl 'http://localhost:5050/extract/doc/text/?ocr_available=True' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf"
Magic:
- The mimetype of the file will be determined by the name of the file you pass in. For example, if you pass in medical_assessment.pdf, the
pdf
extractor will be used.
Valid requests will receive a JSON response with the following keys:
content
: The utf-8 encoded text of the fileerr
: An error message, if one should occur.extension
: The sniffed extension of the file.extracted_by_ocr
: Whether OCR was needed and used during processing.page_count
: The number of pages, if it applies.
Given a RECAP pdf, extract out the text using PDF Plumber, OCR or a combination of the two
Parameters:
strip_margin
: Whether doctor should crop the edges of the recap document during processing. With PDF plumber it will ignore traditional 1 inch margin. With an OCR it lowers the threshold for hiding OCR gibberish. To enable it, set strip_margin toTrue
:
curl 'http://localhost:5050/extract/recap/text/?strip_margin=True' \
-X 'POST' \
-F "file=@doctor/recap_extract/gov.uscourts.cacd.652774.40.0.pdf"
Valid requests will receive a JSON response with the following keys:
content
: The utf-8 encoded text of the fileextracted_by_ocr
: Whether OCR was needed and used during processing.
This method takes a document and returns the page count.
curl 'http://localhost:5050/utils/page-count/pdf/' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf"
This will return an HTTP response with page count. In the above example it would return 2.
This method takes a document and returns the bounding boxes of bad redactions as well as any discovered text.
curl 'http://localhost:5050/utils/check-redactions/pdf/' \
-X 'POST' \
-F "file=@doctor/test_assets/x-ray/rectangles_yes.pdf"
returns as JSON response with bounding box(es) and text recovered.
{
"error": false,
"results": {
"1": [
{
"bbox": [
412.54998779296875,
480.6099853515625,
437.8699951171875,
494.39996337890625
],
"text": "“No”"
},
{
"bbox": [
273.3500061035156,
315,
536.8599853515625,
328.79998779296875
],
"text": "“Yes”, but did not disclose all relevant medical history"
},
{
"bbox": [
141.22999572753906,
232.20001220703125,
166.54998779296875,
246
],
"text": "“No”"
}
]
}
}
The "error" field is set if there was an issue processing the PDF.
If "results" is empty there were no bad redactions found otherwise it is a list of bounding box along with the text recovered.
See: https://github.com/freelawproject/x-ray/#readme
This method takes a document and returns the mime type.
curl 'http://localhost:5050/utils/mime-type/?mime=False' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf"
returns as JSON response identifying the document type
{"mimetype": "PDF document, version 1.3"}
and
curl 'http://localhost:5050/utils/mime-type/?mime=True' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf"
returns as JSON response identifying the document type
{"mimetype": "application/pdf"}
Another example
curl 'http://localhost:5050/utils/mime-type/?mime=True' \
-X 'POST' \
-F "file=@doctor/test_assets/word-doc.doc"
returns
{"mimetype": "application/msword"}
This method is useful for identifying the type of document, incorrect documents and weird documents.
This method will take an image PDF and return the PDF with transparent text overlayed on the document. This allows users to copy and paste (more or less) from our OCRd text.
curl 'http://localhost:5050/utils/add/text/pdf/' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf" \
-o image-pdf-with-embedded-text.pdf
This endpoint returns the duration of an MP3 file.
curl 'http://localhost:5050/utils/audio/duration/' \
-X 'POST' \
-F "file=@doctor/test_assets/1.mp3"
This method takes a document from the federal filing system and returns its document entry number.
curl 'http://localhost:5050/utils/document-number/pdf/' \
-X 'POST' \
-F "file=@doctor/test_assets/recap_documents/ca2_1-1.pdf"
This will return an HTTP response with the document number. In the above example it would return 1-1.
Given an image of indeterminate length, this endpoint will convert it to a pdf with reasonable page breaks. This is meant for extremely long images that represent multi-page documents, but can be used to convert a smaller image to a one-page PDF.
curl 'http://localhost:5050/convert/image/pdf/' \
-X 'POST' \
-F "file=@doctor/test_assets/long-image.tiff" \
--output test-image-to-pdf.pdf
Keep in mind that this curl will write the file to the current directory.
Given a list of urls for images, this endpoint will convert them to a pdf. This can be used to convert multiple images to a multi-page PDF. We use this to convert financial disclosure images to simple PDFs.
curl 'http://localhost:5050/convert/images/pdf/?sorted_urls=%5B%22https%3A%2F%2Fcom-courtlistener-storage.s3-us-west-2.amazonaws.com%2Ffinancial-disclosures%2F2011%2FA-E%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11_Page_1.tiff%22%2C+%22https%3A%2F%2Fcom-courtlistener-storage.s3-us-west-2.amazonaws.com%2Ffinancial-disclosures%2F2011%2FA-E%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11_Page_2.tiff%22%5D' \
-X POST \
-o image.pdf
This returns the binary data of the pdf.
Thumbnail takes a pdf and returns a png thumbnail of the first page.
curl 'http://localhost:5050/convert/pdf/thumbnail/' \
-X 'POST' \
-F "file=@doctor/test_assets/image-pdf.pdf" \
-o test-thumbnail.png
This returns the binary data of the thumbnail.
Keep in mind that this curl will also write the file to the current directory.
Given a PDF and a range or pages, this endpoint will return a zip file containing thumbnails for each page requested. This endpoint also takes an optional parameter called max_dimension, this property scales the long side of each thumbnail (width for landscape pages, height for portrait pages) to fit in the specified number of pixels.
For example if you want thumbnails for the first four pages:
curl 'http://localhost:5050/convert/pdf/thumbnails/' \
-X 'POST' \
-F "file=@doctor/test_assets/vector-pdf.pdf" \
-F 'pages="[1,2,3,4]"' \
-F 'max_dimension=350' \
-o thumbnails.zip
This will return four thumbnails in a zip file.
This endpoint takes an audio file and converts it to an MP3 file. This is used to convert different audio formats from courts across the country and standardizes the format for our end users.
This endpoint also adds the SEAL of the court to the MP3 file and updates the metadata to reflect our updates.
curl 'http://localhost:5050/convert/audio/mp3/?audio_data=%7B%22court_full_name%22%3A+%22Testing+Supreme+Court%22%2C+%22court_short_name%22%3A+%22Testing+Supreme+Court%22%2C+%22court_pk%22%3A+%22test%22%2C+%22court_url%22%3A+%22http%3A%2F%2Fwww.example.com%2F%22%2C+%22docket_number%22%3A+%22docket+number+1+005%22%2C+%22date_argued%22%3A+%222020-01-01%22%2C+%22date_argued_year%22%3A+%222020%22%2C+%22case_name%22%3A+%22SEC+v.+Frank+J.+Custable%2C+Jr.%22%2C+%22case_name_full%22%3A+%22case+name+full%22%2C+%22case_name_short%22%3A+%22short%22%2C+%22download_url%22%3A+%22http%3A%2F%2Fmedia.ca7.uscourts.gov%2Fsound%2Fexternal%2Fgw.15-1442.15-1442_07_08_2015.mp3%22%7D' \
-X 'POST' \
-F "file=@doctor/test_assets/1.wma"
This returns the audio file as a file response.
This endpoint takes an audio file and converts it to an OGG file. The conversion process downsizes files by using a single audio channel and fixing the sampling rate to 8 kHz.
This endpoint also optimizes the output for voice over IP applications.
curl 'http://localhost:5050/convert/audio/ogg/' \
-X 'POST' \
-F "file=@doctor/test_assets/1.wma"
This returns the audio file as a file response.
Testing is designed to be run with the docker-compose.dev.yml
file. To see more about testing
checkout the DEVELOPING.md file.
For debugging purposes, it's possible to set your Sentry DSN to send events to Sentry. By default, no SENTRY_DSN is set and no events will be sent to Sentry. To use Sentry set the SENTRY_DSN environment variable to your DSN. Using Docker you can set it with:
docker run -d -p 5050:5050 -e SENTRY_DSN=<https://yout-sentry-dsn> freelawproject/doctor:latest
doctor's People
Forkers
drewsilcock albertisfu troglodite2 centaurioun ebk13579 cweider honeykjoule grossir kastningbrandon billylaing trashhalo johnrrptyhh 9xbryan conglesolutionxdoctor's Issues
No need for clever get_audio_binary junk
This whole mess can be vastly simplified. It was only needed during the terrible times, when ffmpeg and avconv were duking it out while remaining API-compatible (it was dumb).
These days, I imagine you installed ffmpeg, and we can just forget the whole thing happened.
We should be able to x-ray PDFs
We should have a microservice for looking for redactions in PDFs
All requests.get and .post requests must always have timeouts
This line shows a request.get that doesn't have a timeout parameter. Can you grep the code and check that there aren't any other instances?
Thank you.
Changelog versions and releases to docker hub are untethered
I guess this is no surprise, but we need to do something to make our release notes match up with our releases to docker hub.
For example, a few days ago the changelog says we released version 0.3.0. That's cool, but we stopped tagging our docker builds that way and instead use git hashes now.
This doesn't affect us in prod because we just deploy the correct tag from docker when we deploy, BUT for anybody reading our release notes, this is not good, since they can't choose which version they want.
I guess we need to either:
- Go all in on the git hashes and start putting them into the release notes (and thus give up semantic versioning).
- Figure out how to put the version number into the repo somehow and make sure that we remember to update it when needed. Also, add the tag to the builds we push to docker.
Some PDFs use symbol fonts
One such example opinion illustrates the issue:
remanded for a new trial, holding that defendant had Aadequately
alleged plain error@ where the trial court abused its discretion in
…
442, 446. We granted the State=s petition for leave to appeal under
Open double quotes are A
, closing double quotes are @
, curly apostrophes are =
. Ideally these would be transcribed using their Unicode counterparts.
The upstream PDF, both the original and the CL copy, display correctly:
But they don't copy-paste correctly! The highlighted section pastes as Aadequately alleged plain error@
, exactly as described. So what's going on?
Those incorrect characters are rendered using a font named ABCGOP+WPTypographicSymbols
, which is as the name indicates is in fact a symbol font:
Some component inside CL converts PDFs to plain text. That component should be extended to recognize this font, and then to apply a translation table mapping the symbols in this font to their Unicode equivalents.
Everything is slow
BTE runs slow. Too Slow.
For a process that should take < 8 seconds it takes nearly 118.
Requirements needs to be updated because of LXML changes
doctor | File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
doctor | File "/opt/app/doctor/urls.py", line 3, in <module>
doctor | from . import views
doctor | File "/opt/app/doctor/views.py", line 34, in <module>
doctor | from doctor.tasks import (
doctor | File "/opt/app/doctor/tasks.py", line 18, in <module>
doctor | from lxml.html.clean import Cleaner
doctor | File "/usr/local/lib/python3.10/site-packages/lxml/html/clean.py", line 18, in <module>
doctor | raise ImportError(
doctor | ImportError: lxml.html.clean module is now a separate project lxml_html_clean.
doctor | Install lxml[html_clean] or lxml_html_clean directly.
I'm updating doctor to better handle images or annotations inside a PDF but I came across our new friend - the removal of Cleaner
functionality from lmxl
.
We should remove and replace this code - I assume it's already causing issues that we haven't noticed yet or may soon enough.
IndexError getting PACER doc number from PDF headers
Sentry Issue: DOCTOR-P
IndexError: list index out of range
File "django/core/handlers/exception.py", line 47, in inner
response = get_response(request)
File "django/core/handlers/base.py", line 181, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "doctor/views.py", line 350, in get_document_number
document_number = get_document_number_from_pdf(fp)
File "doctor/tasks.py", line 596, in get_document_number_from_pdf
document_number = [dn for dn in document_number_matches[0] if dn]
Be lenient in audio metadata we accept
We do some fun stuff in the set_mp3_meta_data
function to make the file better, but in doing so, we make our code fragile. If small pieces of data are missing, we crash and fail to make an mp3. For example, the error below is because we don't have the date argued.
Sentry Issue: DOCTOR-V
KeyError: 'date_argued'
File "doctor/views.py", line 353, in convert_audio
set_mp3_meta_data(audio_data, filepath)
File "doctor/tasks.py", line 496, in set_mp3_meta_data
date_argued = audio_data["date_argued"]
Lame.
Seals Rookery req for audio processing
Need to include seals rookery which is a separate docker image.
This should be solved using docker-compose and linking volumes correctly. Fortunately seals rookery is python 2/3 compatible, although the installer currently is not.
build deploy pipeline breaks when merging prs
Went to pull in new version after you merged #188 and noticed that the docker hub build hadn't been updated. When I check out the deploy action it seems to fail consistently on merge, missing the credentials needed to push to docker hub. https://github.com/freelawproject/doctor/actions/workflows/deploy.yml
Not sure if this is intentional but wanted to flag it.
Improve Readme
The readme is getting better but still needs some work.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 8811: invalid start byte
Sentry Issue: DOCTOR-E
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 8811: invalid start byte
File "doctor/views.py", line 101, in extract_doc_content
content, err, returncode = extract_from_html(fp)
File "doctor/tasks.py", line 339, in extract_from_html
content = f.read()
This is linked to the courtlistener Sentry issue https://freelawproject.sentry.io/issues/5017932231/?project=5257254, the events were registered at almost the same time.
Also, is one of the causes of this issue freelawproject/courtlistener#3811
Filed by @grossir
Timeouts in tests
Need to add timeouts to tests in Doctor.
We had timeouts that were too short in CL and we would've known this if we had added timeouts to Doctor Tests.
DependencyError: PyCryptodome is required for AES algorithm
Sentry Issue: DOCTOR-D
DependencyError: PyCryptodome is required for AES algorithm
(8 additional frame(s) were not displayed)
...
File "PyPDF2/_reader.py", line 1146, in get_object
retval = self._encryption.decrypt_object(
File "PyPDF2/_encryption.py", line 741, in decrypt_object
return cf.decrypt_object(obj)
File "PyPDF2/_encryption.py", line 182, in decrypt_object
obj[dictkey] = self.decrypt_object(value)
File "PyPDF2/_encryption.py", line 176, in decrypt_object
data = self.strCrypt.decrypt(obj.original_bytes)
File "PyPDF2/_encryption.py", line 141, in decrypt
raise DependencyError("PyCryptodome is required for AES algorithm")
Unable to convert audio file due to encoding issue
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 162: ordinal not in range(256)
Sentry Issue: DOCTOR-N
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 162: ordinal not in range(256)
(3 additional frame(s) were not displayed)
...
File "doctor/tasks.py", line 479, in set_mp3_meta_data
audio_file.tag.audio_source_url = audio_data["download_url"]
File "eyed3/id3/tag.py", line 797, in audio_source_url
self._setUrlFrame(frames.URL_AUDIOSRC_FID, url)
File "eyed3/id3/tag.py", line 759, in _setUrlFrame
self.frame_set[fid] = frames.UrlFrame(fid, url)
File "eyed3/id3/frames.py", line 421, in __init__
self.url = url
File "eyed3/id3/frames.py", line 432, in url
url.encode(ISO_8859_1) # Likewise, it must encode
Docs are unclear, types are over-broad, cleanup needed
Over in https://github.com/freelawproject/courtlistener/pull/2312/files#diff-6b97c7d7c910daaa051cf22884b960484e47ee033e28656eaa1cbd3aa885ed36L205, @albertisfu made a change from:
content = response.content
To:
content = response.text
That made me wonder if doctor returns utf-8-encoded text or something else. I dug in. The docs didn't say, the code wasn't clear, and I saw some commented out stuff that shouldn't be in our finished work.
Improvements to text extraction needed
The Needs OCR function needs to be improved. Currently we do this to determine if something that is OCR eligible should be OCRd.
The Situation
if content.strip() == "" or pdf_has_images(path):
return True
The content is generated from pdftotext
using this code
process = subprocess.Popen(
["pdftotext", "-layout", "-enc", "UTF-8", path, "-"],
shell=False,
stdout=subprocess.PIPE,
stderr=subprocess.DEVNULL,
)
content, err = process.communicate()
return content.decode(), err, process.returncode
later down stream - on CL we take the content - and say - are we sure we didn't need to OCR this and we do this
for line in content.splitlines():
line = line.strip()
if line.startswith(("Case", "Appellate", "Appeal", "USCA")):
continue
elif line:
# We found a line with good content. No OCR needed.
return False
# We arrive here if no line was found containing good content.
return True
Where we look for any row that doesnt appear to be a bates stamp. And as long as we find any text - garbled or otherwise we say we are good to go.
This leads unfortunately to some seriously garbled plain text in our Recap - and potentially our opinion db.
Examples
I dont want to rag on pdftotext
it has done an admirable job for the most part but I do not think it is the best way to approach what we dealing with now. For one - we are attempting to extract out content and place it into a plain text db field. This is challenging because a good amount of documents contain pdf objects, such as /widgets
, /annotations
, /freetext
, /Stamp
and /Popup
. Although this is not an exhaustive list we see links and signatures, and I'm sure more types.
In addition to the complexity of handling documents that contain pdf stream objects, we also have to deal with images inserted into PDFs or even worse, the first or maybe just the last page being a rasterized PDF pages while the middle 30 odd pages being vector PDFs.
In this case - our checks fail and have no way to catch them because - after we iterate beyond the bates stamp on page 2 we get good text. See: gov.uscourts.nysd.411264.100.0.pdf
This also fails when - for example, a free text widget is added on to the PDF page of an image that crosses out content or adds content to the page.
Here is an example - of a non image pdf page - containing Free Text widget (widget I think, it could be something different) meant to cross out the PROPOSED part.
This is not the perfect example, because the underlying content appears to contain text but is corrupted and looks like this
In fact, williams-v-t-mobile
Side by Side comparison of Williams v T-Mobile
Note the proposed - is incorrectly added here to the text frustrating the adjustment made by the court. Which is noted in the document itself.
Angled, Circular, and Sideways Text
Not to be out done - many judges - 👋 CAND
likes to use Stamps with circular text. These stamps are often at the end of the document but not exclusively. In doing that the courts introduce gibberish into our documents when we extract the text or OCR them.
For example gov.uscourts.cand.16711.1203.0.pdf
and another file have them adjacent to the text. One of these is stamped into an image pdf and the other is in a regular pdf and garbles it.
In both cases - the content that is generated makes the test for OCR fail to identify a needed OCR.
Sideways Text
We also run into this problem where - pdftotext does an amazing job of figuring out the text on the side and writing it into the text. But here is the result - this is just a fancy thing some courts - and some firms like to do.
But look at the result. It unnaturally expands the plain text - and frustrates plain text searches for sure.
In this case and in others see below.
Margin Text
Occasionally the use of margin text in small font causes some weird creations in text. which again cause extra wide text that is hard to view and display and which I think make it hard to query or search for the content you may be looking for.
Final complaint (Bates Stamps)
Bates stamps on every page are ingested into the content and dont reflect the document that was generated. I would not expect to see bates stamps or sidebar content in a published book so I dont think we should display it in the plain text.
What should we do
If you've read this far @mlissner I know you must be dying to hear what I think the solution happens to be.
We should drop (I think) pdftotext
for you guessed it pdfplumber
.
Pdfplumber can both sample the pdfs better to determine if the entire page is likely an image - while correctly guessing that lines or signatures are in the document and leaving be. Additionally, we can easily extract out the pure text of the document while avoiding the pitfalls contained.
We should drop the check in CL and just make all the assessments done here in doctor as well.
Solutions coming in the next post.
Add Sentry to Doctor
We had sentry for BTE but I dont think it made the transfer over to Doctor. This is probably a mistake.
`fetch_audio_duration` is not working properly
I am copying this graph from freelawproject/courtlistener#440 which shows that for the same duration, we get different file sizes when querying the actual bucket (and checking the length of the downloaded bytes)
Another more colorful graph that takes the year from the date_created shows that the problem is from late 2019 to the present
Examples of wrong and correct durations:
-
one created in 2023 with duration 3028 (3028/60=50.46), but lasts 58:46 on the audio player
-
a correct one created in 2014 with duration 3029 that lasts 50:25 on the audio player (which roughly matches 3029/60 = 50.48)
-
one created in 2023 with duration 2000 (2000/60=33.33) but lasts 38:49 on the audio player.
-
a correct one created in 2014 with duration 2001 that lasts 33:18 minutes in the audio player
Code that needs correcting:
Line 354 in 4009f00
MP3: Encoding issues with non latin-1 compatible inputs
Sentry Issue: DOCTOR-X
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 162: ordinal not in range(256)
(3 additional frame(s) were not displayed)
...
File "doctor/views.py", line 353, in convert_audio
set_mp3_meta_data(audio_data, filepath)
File "doctor/tasks.py", line 503, in set_mp3_meta_data
audio_file.tag.audio_source_url = audio_data["download_url"]
UTF-8 Encoding Latin-1 Bug in HTML files
Unfortunately, we are experiencing occasional crashes on CL from the inability to extract HTML from certain HTML files.
These occur mostly around HTML files downloads from the NY lower courts, with messages like
*Error: 'utf-8' codec can't decode byte 0xe9 in position 437
. Having gone thru a number of them I think this is a
latin-1 utf8 encoding bug.
Soo...
I've reviewed the HTML text extraction and found atleast one bug in the unicode/decode encoding part of the loop. If utf8 fails it doesnt bother looping thru the other encodings. But I'm not sure that is the source of the error. So, for now I've updated the error messages, fixed that bug and I expect to wait and see if my improved error messages can catch the bug more directly.
ResourceWarnings cont.
ResourceWarning: Enable tracemalloc to get the object allocation traceback
57 Resource Warnings appeared after the switch to py3.8. We had the same issue in the switch to python 3 for CL.
ResourceWarning: Enable tracemalloc to get the object allocation traceback
13
/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/unittest/case.py:630: ResourceWarning: unclosed <socket.socket fd=5, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, raddr=/var/run/docker.sock>
14```
Doctor not identifying PACER headers very well
I noticed a couple cases today where the OCR didn't trigger but should have:
https://www.courtlistener.com/docket/63348437/1/navarro-v-pelosi/
https://www.courtlistener.com/docket/5319662/1/unicorn-investment-bank-v-kuruvilla/
In both cases, the text that's extracted looks like:
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 1 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 2 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 3 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 4 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 5 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 6 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 7 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 8 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 9 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 10 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 11 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 12 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 13 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 14 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 15 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 16 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 17 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 18 of 18
We used to have a kinda flakey regex for this, but perhaps it's not working? Or needs a tweak?
Support generating multiple thumbnails from a PDF
When we finally get back to working on the Big Cases Bot, we'll probably want to be able to generate multiple thumbnails for a single PDF. The old bot would tweet with four thumbnails from the PDF.
I guess the two ways to do this would be to either:
- Have an API where you say which pages you want and it somehow gives you back multiple pages. (Does HTTP support this?)
- Have an API where you say which page (singular) you want. If you don't say, it gives you page one. If you do, it gives you the page you requested.
Gunicorn only has one worker and thus not much performance
I might be wrong about this, but....Sentry isn't very happy with us right now because, unless I'm missing something, Doctor's gunicorn config is set up with only one worker to serve requests:
https://github.com/freelawproject/doctor/blob/main/docker/docker-entrypoint.sh#L2
In CL, we have 48 workers configured to serve the website, and Doctor certainly needs more than one, particularly since it's often tied up with tesseract stuff.
Here's the config: https://docs.gunicorn.org/en/stable/settings.html#workers
The Sentry issue is: COURTLISTENER-2BA
It's triggering now because we doing a lot of crawling at the moment.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.