Giter Club home page Giter Club logo

doctor's Introduction

Doctor

Welcome to Doctor, Free Law Project's microservice for converting, extracting and modifying documents and audio files.

At a high level, this service provides you with high-performance HTTP endpoints that can:

  • Extract text from various types of documents
  • Convert audio files from one format to another while stripping messy metadata
  • Create thumbnails of PDFs
  • Provide metadata about PDFs

Under the hood, Doctor uses gunicorn to connect to a django service. The django service uses carefully configured implementations of ffmpeg, pdftotext, tesseract, ghostscript, and a number of other converters.

Quick Start

Assuming you have docker installed run:

docker run -d -p 5050:5050 freelawproject/doctor:latest

This will expose the endpoints on port 5050 with one gunicorn worker. This is usually ideal because it allows you to horizontally scale Doctor using an orchestration system like Kubernetes.

If you are not using a system that supports horizontal scaling, you may wish to have more gunicorn workers so that Doctor can handle more simultaneous tasks. To set that up, simply set the DOCTOR_WORKERS environment variable:

docker run -d -p 5050:5050 -e DOCTOR_WORKERS=16 freelawproject/doctor:latest

If you are doing OCR or audio conversion, scaling through a system like Kubernetes or through by giving Doctor many workers becomes particularly important. If it does not have a worker available, your call to Doctor will probably time out.

After the image is running, you should be able to test that you have a working environment by running

curl http://localhost:5050

which should return a text response:

Heartbeat detected.

ENDPOINTS

Overview

The service currently supports the following tools:

  1. Extract text from PDF, RTF, DOC, DOCX, or WPD, HTML, TXT files.
  2. OCR text from a scanned PDF.
  3. Get page count for a PDF document.
  4. Check for bad redactions in a PDF document.
  5. Convert audio files from wma, ogg, wav to MP3.
  6. Create a thumbnail of the first page of a PDF (for use in Open Graph tags)
  7. Convert an image or images to a PDF.
  8. Identify the mime type of a file.

A brief description and curl command for each endpoint is provided below.

Extractors

Endpoint: /extract/doc/text/

Given a document, extract out the text and assorted metadata. Supports the following document types:

  • pdf - Adobe portable document format files, via pdftotext.
  • doc - Word document files, via antiword.
  • docx - Open Office XML files, via docx2txt.
  • html - HTML files, via lxml.html.clean.Cleaner. Strips out dangerous tags and hoists their contents to their parent. Hoisted tags include: a, body, font, noscript, and img.
  • txt - Text files. This attempts to normalize all encoding questions to utf-8. First, we try cp1251, then utf-8, ignoring errors.
  • wpd - Word Perfect files, via wpd2html followed by cleaning the HTML as above.
curl 'http://localhost:5050/extract/doc/text/' \
  -X 'POST' \
  -F "file=@doctor/test_assets/vector-pdf.pdf"

Parameters:

  • ocr_available: Whether doctor should use tesseract to provide OCR services for the document. OCR is always possible in doctor, but sometimes you won't want to use it, since it can be slow. If you want it disabled for this request, omit this optional parameter. To enable it, set ocr_available to True:
curl 'http://localhost:5050/extract/doc/text/?ocr_available=True' \
  -X 'POST' \
  -F "file=@doctor/test_assets/image-pdf.pdf"

Magic:

  • The mimetype of the file will be determined by the name of the file you pass in. For example, if you pass in medical_assessment.pdf, the pdf extractor will be used.

Valid requests will receive a JSON response with the following keys:

  • content: The utf-8 encoded text of the file
  • err: An error message, if one should occur.
  • extension: The sniffed extension of the file.
  • extracted_by_ocr: Whether OCR was needed and used during processing.
  • page_count: The number of pages, if it applies.

Endpoint: /extract/recap/text/

Given a RECAP pdf, extract out the text using PDF Plumber, OCR or a combination of the two

Parameters:

  • strip_margin: Whether doctor should crop the edges of the recap document during processing. With PDF plumber it will ignore traditional 1 inch margin. With an OCR it lowers the threshold for hiding OCR gibberish. To enable it, set strip_margin to True:
curl 'http://localhost:5050/extract/recap/text/?strip_margin=True' \
  -X 'POST' \
  -F "file=@doctor/recap_extract/gov.uscourts.cacd.652774.40.0.pdf"

Valid requests will receive a JSON response with the following keys:

  • content: The utf-8 encoded text of the file
  • extracted_by_ocr: Whether OCR was needed and used during processing.

Utilities

Endpoint: /utils/page-count/pdf/

This method takes a document and returns the page count.

curl 'http://localhost:5050/utils/page-count/pdf/' \
 -X 'POST' \
 -F "file=@doctor/test_assets/image-pdf.pdf"

This will return an HTTP response with page count. In the above example it would return 2.

Endpoint: /utils/check-redactions/pdf/

This method takes a document and returns the bounding boxes of bad redactions as well as any discovered text.

curl 'http://localhost:5050/utils/check-redactions/pdf/' \
  -X 'POST' \
  -F "file=@doctor/test_assets/x-ray/rectangles_yes.pdf"

returns as JSON response with bounding box(es) and text recovered.

{
  "error": false,
  "results": {
    "1": [
      {
        "bbox": [
          412.54998779296875,
          480.6099853515625,
          437.8699951171875,
          494.39996337890625
        ],
        "text": "“No”"
      },
      {
        "bbox": [
          273.3500061035156,
          315,
          536.8599853515625,
          328.79998779296875
        ],
        "text": "“Yes”, but did not disclose all relevant medical history"
      },
      {
        "bbox": [
          141.22999572753906,
          232.20001220703125,
          166.54998779296875,
          246
        ],
        "text": "“No”"
      }
    ]
  }
}

The "error" field is set if there was an issue processing the PDF.

If "results" is empty there were no bad redactions found otherwise it is a list of bounding box along with the text recovered.

See: https://github.com/freelawproject/x-ray/#readme

Endpoint: /utils/mime-type/

This method takes a document and returns the mime type.

curl 'http://localhost:5050/utils/mime-type/?mime=False' \
 -X 'POST' \
 -F "file=@doctor/test_assets/image-pdf.pdf"

returns as JSON response identifying the document type

{"mimetype": "PDF document, version 1.3"}

and

curl 'http://localhost:5050/utils/mime-type/?mime=True' \
 -X 'POST' \
 -F "file=@doctor/test_assets/image-pdf.pdf"

returns as JSON response identifying the document type

{"mimetype": "application/pdf"}

Another example

curl 'http://localhost:5050/utils/mime-type/?mime=True' \
 -X 'POST' \
 -F "file=@doctor/test_assets/word-doc.doc"

returns

{"mimetype": "application/msword"}

This method is useful for identifying the type of document, incorrect documents and weird documents.

Endpoint: /utils/add/text/pdf/

This method will take an image PDF and return the PDF with transparent text overlayed on the document. This allows users to copy and paste (more or less) from our OCRd text.

curl 'http://localhost:5050/utils/add/text/pdf/' \
 -X 'POST' \
 -F "file=@doctor/test_assets/image-pdf.pdf" \
 -o image-pdf-with-embedded-text.pdf

Endpoint: /utils/audio/duration/

This endpoint returns the duration of an MP3 file.

curl 'http://localhost:5050/utils/audio/duration/' \
 -X 'POST' \
 -F "file=@doctor/test_assets/1.mp3"

Endpoint: /utils/document-number/pdf/

This method takes a document from the federal filing system and returns its document entry number.

curl 'http://localhost:5050/utils/document-number/pdf/' \
 -X 'POST' \
 -F "file=@doctor/test_assets/recap_documents/ca2_1-1.pdf"

This will return an HTTP response with the document number. In the above example it would return 1-1.

Converters

Endpoint: /convert/image/pdf/

Given an image of indeterminate length, this endpoint will convert it to a pdf with reasonable page breaks. This is meant for extremely long images that represent multi-page documents, but can be used to convert a smaller image to a one-page PDF.

curl 'http://localhost:5050/convert/image/pdf/' \
 -X 'POST' \
 -F "file=@doctor/test_assets/long-image.tiff" \
  --output test-image-to-pdf.pdf

Keep in mind that this curl will write the file to the current directory.

Endpoint: /convert/images/pdf/

Given a list of urls for images, this endpoint will convert them to a pdf. This can be used to convert multiple images to a multi-page PDF. We use this to convert financial disclosure images to simple PDFs.

curl 'http://localhost:5050/convert/images/pdf/?sorted_urls=%5B%22https%3A%2F%2Fcom-courtlistener-storage.s3-us-west-2.amazonaws.com%2Ffinancial-disclosures%2F2011%2FA-E%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11_Page_1.tiff%22%2C+%22https%3A%2F%2Fcom-courtlistener-storage.s3-us-west-2.amazonaws.com%2Ffinancial-disclosures%2F2011%2FA-E%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11%2FArmstrong-SB%2520J3.%252009.%2520CAN_R_11_Page_2.tiff%22%5D' \
    -X POST \
    -o image.pdf

This returns the binary data of the pdf.

Endpoint: /convert/pdf/thumbnail/

Thumbnail takes a pdf and returns a png thumbnail of the first page.

curl 'http://localhost:5050/convert/pdf/thumbnail/' \
 -X 'POST' \
 -F "file=@doctor/test_assets/image-pdf.pdf" \
 -o test-thumbnail.png

This returns the binary data of the thumbnail.

Keep in mind that this curl will also write the file to the current directory.

Endpoint: /convert/pdf/thumbnails/

Given a PDF and a range or pages, this endpoint will return a zip file containing thumbnails for each page requested. This endpoint also takes an optional parameter called max_dimension, this property scales the long side of each thumbnail (width for landscape pages, height for portrait pages) to fit in the specified number of pixels.

For example if you want thumbnails for the first four pages:

curl 'http://localhost:5050/convert/pdf/thumbnails/' \
 -X 'POST' \
 -F "file=@doctor/test_assets/vector-pdf.pdf" \
 -F 'pages="[1,2,3,4]"' \
 -F 'max_dimension=350' \
 -o thumbnails.zip

This will return four thumbnails in a zip file.

Endpoint: /convert/audio/mp3/

This endpoint takes an audio file and converts it to an MP3 file. This is used to convert different audio formats from courts across the country and standardizes the format for our end users.

This endpoint also adds the SEAL of the court to the MP3 file and updates the metadata to reflect our updates.

curl 'http://localhost:5050/convert/audio/mp3/?audio_data=%7B%22court_full_name%22%3A+%22Testing+Supreme+Court%22%2C+%22court_short_name%22%3A+%22Testing+Supreme+Court%22%2C+%22court_pk%22%3A+%22test%22%2C+%22court_url%22%3A+%22http%3A%2F%2Fwww.example.com%2F%22%2C+%22docket_number%22%3A+%22docket+number+1+005%22%2C+%22date_argued%22%3A+%222020-01-01%22%2C+%22date_argued_year%22%3A+%222020%22%2C+%22case_name%22%3A+%22SEC+v.+Frank+J.+Custable%2C+Jr.%22%2C+%22case_name_full%22%3A+%22case+name+full%22%2C+%22case_name_short%22%3A+%22short%22%2C+%22download_url%22%3A+%22http%3A%2F%2Fmedia.ca7.uscourts.gov%2Fsound%2Fexternal%2Fgw.15-1442.15-1442_07_08_2015.mp3%22%7D' \
 -X 'POST' \
 -F "file=@doctor/test_assets/1.wma"

This returns the audio file as a file response.

Endpoint: /convert/audio/ogg/

This endpoint takes an audio file and converts it to an OGG file. The conversion process downsizes files by using a single audio channel and fixing the sampling rate to 8 kHz.

This endpoint also optimizes the output for voice over IP applications.

curl 'http://localhost:5050/convert/audio/ogg/' \
 -X 'POST' \
 -F "file=@doctor/test_assets/1.wma"

This returns the audio file as a file response.

Testing

Testing is designed to be run with the docker-compose.dev.yml file. To see more about testing checkout the DEVELOPING.md file.

Sentry Logging

For debugging purposes, it's possible to set your Sentry DSN to send events to Sentry. By default, no SENTRY_DSN is set and no events will be sent to Sentry. To use Sentry set the SENTRY_DSN environment variable to your DSN. Using Docker you can set it with:

docker run -d -p 5050:5050 -e SENTRY_DSN=<https://yout-sentry-dsn> freelawproject/doctor:latest

doctor's People

Contributors

albertisfu avatar dependabot-preview[bot] avatar drewsilcock avatar erosendo avatar flooie avatar grossir avatar johnludwigm avatar mlissner avatar pre-commit-ci[bot] avatar quevon24 avatar trashhalo avatar troglodite2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

doctor's Issues

Changelog versions and releases to docker hub are untethered

I guess this is no surprise, but we need to do something to make our release notes match up with our releases to docker hub.

For example, a few days ago the changelog says we released version 0.3.0. That's cool, but we stopped tagging our docker builds that way and instead use git hashes now.

This doesn't affect us in prod because we just deploy the correct tag from docker when we deploy, BUT for anybody reading our release notes, this is not good, since they can't choose which version they want.

I guess we need to either:

  • Go all in on the git hashes and start putting them into the release notes (and thus give up semantic versioning).
  • Figure out how to put the version number into the repo somehow and make sure that we remember to update it when needed. Also, add the tag to the builds we push to docker.

Some PDFs use symbol fonts

One such example opinion illustrates the issue:

remanded for a new trial, holding that defendant had Aadequately
alleged plain error@ where the trial court abused its discretion in
…
442, 446. We granted the State=s petition for leave to appeal under

Open double quotes are A, closing double quotes are @, curly apostrophes are =. Ideally these would be transcribed using their Unicode counterparts.

The upstream PDF, both the original and the CL copy, display correctly:

But they don't copy-paste correctly! The highlighted section pastes as Aadequately alleged plain error@, exactly as described. So what's going on?

Those incorrect characters are rendered using a font named ABCGOP+WPTypographicSymbols, which is as the name indicates is in fact a symbol font:

Some component inside CL converts PDFs to plain text. That component should be extended to recognize this font, and then to apply a translation table mapping the symbols in this font to their Unicode equivalents.

Everything is slow

BTE runs slow. Too Slow.

For a process that should take < 8 seconds it takes nearly 118.

Requirements needs to be updated because of LXML changes

doctor          |   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
doctor          |   File "/opt/app/doctor/urls.py", line 3, in <module>
doctor          |     from . import views
doctor          |   File "/opt/app/doctor/views.py", line 34, in <module>
doctor          |     from doctor.tasks import (
doctor          |   File "/opt/app/doctor/tasks.py", line 18, in <module>
doctor          |     from lxml.html.clean import Cleaner
doctor          |   File "/usr/local/lib/python3.10/site-packages/lxml/html/clean.py", line 18, in <module>
doctor          |     raise ImportError(
doctor          | ImportError: lxml.html.clean module is now a separate project lxml_html_clean.
doctor          | Install lxml[html_clean] or lxml_html_clean directly.

I'm updating doctor to better handle images or annotations inside a PDF but I came across our new friend - the removal of Cleaner functionality from lmxl.

We should remove and replace this code - I assume it's already causing issues that we haven't noticed yet or may soon enough.

IndexError getting PACER doc number from PDF headers

Sentry Issue: DOCTOR-P

IndexError: list index out of range
  File "django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "doctor/views.py", line 350, in get_document_number
    document_number = get_document_number_from_pdf(fp)
  File "doctor/tasks.py", line 596, in get_document_number_from_pdf
    document_number = [dn for dn in document_number_matches[0] if dn]

Be lenient in audio metadata we accept

We do some fun stuff in the set_mp3_meta_data function to make the file better, but in doing so, we make our code fragile. If small pieces of data are missing, we crash and fail to make an mp3. For example, the error below is because we don't have the date argued.

Sentry Issue: DOCTOR-V

KeyError: 'date_argued'
  File "doctor/views.py", line 353, in convert_audio
    set_mp3_meta_data(audio_data, filepath)
  File "doctor/tasks.py", line 496, in set_mp3_meta_data
    date_argued = audio_data["date_argued"]

Lame.

Seals Rookery req for audio processing

Need to include seals rookery which is a separate docker image.

This should be solved using docker-compose and linking volumes correctly. Fortunately seals rookery is python 2/3 compatible, although the installer currently is not.

Improve Readme

The readme is getting better but still needs some work.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 8811: invalid start byte

Sentry Issue: DOCTOR-E

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 8811: invalid start byte
  File "doctor/views.py", line 101, in extract_doc_content
    content, err, returncode = extract_from_html(fp)
  File "doctor/tasks.py", line 339, in extract_from_html
    content = f.read()

This is linked to the courtlistener Sentry issue https://freelawproject.sentry.io/issues/5017932231/?project=5257254, the events were registered at almost the same time.

Also, is one of the causes of this issue freelawproject/courtlistener#3811

Filed by @grossir

Timeouts in tests

Need to add timeouts to tests in Doctor.

We had timeouts that were too short in CL and we would've known this if we had added timeouts to Doctor Tests.

DependencyError: PyCryptodome is required for AES algorithm

Sentry Issue: DOCTOR-D

DependencyError: PyCryptodome is required for AES algorithm
(8 additional frame(s) were not displayed)
...
  File "PyPDF2/_reader.py", line 1146, in get_object
    retval = self._encryption.decrypt_object(
  File "PyPDF2/_encryption.py", line 741, in decrypt_object
    return cf.decrypt_object(obj)
  File "PyPDF2/_encryption.py", line 182, in decrypt_object
    obj[dictkey] = self.decrypt_object(value)
  File "PyPDF2/_encryption.py", line 176, in decrypt_object
    data = self.strCrypt.decrypt(obj.original_bytes)
  File "PyPDF2/_encryption.py", line 141, in decrypt
    raise DependencyError("PyCryptodome is required for AES algorithm")

Unable to convert audio file due to encoding issue

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 162: ordinal not in range(256)

Sentry Issue: DOCTOR-N

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 162: ordinal not in range(256)
(3 additional frame(s) were not displayed)
...
  File "doctor/tasks.py", line 479, in set_mp3_meta_data
    audio_file.tag.audio_source_url = audio_data["download_url"]
  File "eyed3/id3/tag.py", line 797, in audio_source_url
    self._setUrlFrame(frames.URL_AUDIOSRC_FID, url)
  File "eyed3/id3/tag.py", line 759, in _setUrlFrame
    self.frame_set[fid] = frames.UrlFrame(fid, url)
  File "eyed3/id3/frames.py", line 421, in __init__
    self.url = url
  File "eyed3/id3/frames.py", line 432, in url
    url.encode(ISO_8859_1)  # Likewise, it must encode

Improvements to text extraction needed

The Needs OCR function needs to be improved. Currently we do this to determine if something that is OCR eligible should be OCRd.

The Situation

if content.strip() == "" or pdf_has_images(path):
    return True

The content is generated from pdftotext

using this code

process = subprocess.Popen(
    ["pdftotext", "-layout", "-enc", "UTF-8", path, "-"],
    shell=False,
    stdout=subprocess.PIPE,
    stderr=subprocess.DEVNULL,
)
content, err = process.communicate()
return content.decode(), err, process.returncode

later down stream - on CL we take the content - and say - are we sure we didn't need to OCR this and we do this

for line in content.splitlines():
    line = line.strip()
    if line.startswith(("Case", "Appellate", "Appeal", "USCA")):
        continue
    elif line:
        # We found a line with good content. No OCR needed.
        return False

# We arrive here if no line was found containing good content.
return True

Where we look for any row that doesnt appear to be a bates stamp. And as long as we find any text - garbled or otherwise we say we are good to go.

This leads unfortunately to some seriously garbled plain text in our Recap - and potentially our opinion db.

Examples

I dont want to rag on pdftotext it has done an admirable job for the most part but I do not think it is the best way to approach what we dealing with now. For one - we are attempting to extract out content and place it into a plain text db field. This is challenging because a good amount of documents contain pdf objects, such as /widgets, /annotations, /freetext, /Stamp and /Popup. Although this is not an exhaustive list we see links and signatures, and I'm sure more types.

In addition to the complexity of handling documents that contain pdf stream objects, we also have to deal with images inserted into PDFs or even worse, the first or maybe just the last page being a rasterized PDF pages while the middle 30 odd pages being vector PDFs.

In this case - our checks fail and have no way to catch them because - after we iterate beyond the bates stamp on page 2 we get good text. See: gov.uscourts.nysd.411264.100.0.pdf

This also fails when - for example, a free text widget is added on to the PDF page of an image that crosses out content or adds content to the page.

Here is an example - of a non image pdf page - containing Free Text widget (widget I think, it could be something different) meant to cross out the PROPOSED part.
Screenshot 2024-04-19 at 11 29 59 AM

This is not the perfect example, because the underlying content appears to contain text but is corrupted and looks like this

Screenshot 2024-04-19 at 11 32 02 AM

In fact, williams-v-t-mobile

Side by Side comparison of Williams v T-Mobile

Note the proposed - is incorrectly added here to the text frustrating the adjustment made by the court. Which is noted in the document itself.

Screenshot 2024-04-19 at 11 35 25 AM Screenshot 2024-04-19 at 11 36 36 AM

Angled, Circular, and Sideways Text

Not to be out done - many judges - 👋 CAND likes to use Stamps with circular text. These stamps are often at the end of the document but not exclusively. In doing that the courts introduce gibberish into our documents when we extract the text or OCR them.

For example gov.uscourts.cand.16711.1203.0.pdf and another file have them adjacent to the text. One of these is stamped into an image pdf and the other is in a regular pdf and garbles it.

Screenshot 2024-04-19 at 11 48 19 AM Screenshot 2024-04-19 at 11 47 37 AM

In both cases - the content that is generated makes the test for OCR fail to identify a needed OCR.

Sideways Text

We also run into this problem where - pdftotext does an amazing job of figuring out the text on the side and writing it into the text. But here is the result - this is just a fancy thing some courts - and some firms like to do.
Screenshot 2024-04-19 at 11 55 32 AM
But look at the result. It unnaturally expands the plain text - and frustrates plain text searches for sure.
Screenshot 2024-04-19 at 11 54 41 AM

In this case and in others see below.

Margin Text

Occasionally the use of margin text in small font causes some weird creations in text. which again cause extra wide text that is hard to view and display and which I think make it hard to query or search for the content you may be looking for.

Screenshot 2024-04-19 at 11 57 01 AM Screenshot 2024-04-19 at 11 59 15 AM

Final complaint (Bates Stamps)

Bates stamps on every page are ingested into the content and dont reflect the document that was generated. I would not expect to see bates stamps or sidebar content in a published book so I dont think we should display it in the plain text.

What should we do

If you've read this far @mlissner I know you must be dying to hear what I think the solution happens to be.

We should drop (I think) pdftotext for you guessed it pdfplumber.

Pdfplumber can both sample the pdfs better to determine if the entire page is likely an image - while correctly guessing that lines or signatures are in the document and leaving be. Additionally, we can easily extract out the pure text of the document while avoiding the pitfalls contained.

We should drop the check in CL and just make all the assessments done here in doctor as well.

Solutions coming in the next post.

Add Sentry to Doctor

We had sentry for BTE but I dont think it made the transfer over to Doctor. This is probably a mistake.

`fetch_audio_duration` is not working properly

I am copying this graph from freelawproject/courtlistener#440 which shows that for the same duration, we get different file sizes when querying the actual bucket (and checking the length of the downloaded bytes)

image
Another more colorful graph that takes the year from the date_created shows that the problem is from late 2019 to the present
image

Examples of wrong and correct durations:

  • one created in 2023 with duration 3028 (3028/60=50.46), but lasts 58:46 on the audio player

  • a correct one created in 2014 with duration 3029 that lasts 50:25 on the audio player (which roughly matches 3029/60 = 50.48)

  • one created in 2023 with duration 2000 (2000/60=33.33) but lasts 38:49 on the audio player.

  • a correct one created in 2014 with duration 2001 that lasts 33:18 minutes in the audio player

Code that needs correcting:

def fetch_audio_duration(request) -> HttpResponse:

MP3: Encoding issues with non latin-1 compatible inputs

Sentry Issue: DOCTOR-X

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 162: ordinal not in range(256)
(3 additional frame(s) were not displayed)
...
  File "doctor/views.py", line 353, in convert_audio
    set_mp3_meta_data(audio_data, filepath)
  File "doctor/tasks.py", line 503, in set_mp3_meta_data
    audio_file.tag.audio_source_url = audio_data["download_url"]

UTF-8 Encoding Latin-1 Bug in HTML files

Unfortunately, we are experiencing occasional crashes on CL from the inability to extract HTML from certain HTML files.

These occur mostly around HTML files downloads from the NY lower courts, with messages like

*Error: 'utf-8' codec can't decode byte 0xe9 in position 437. Having gone thru a number of them I think this is a
latin-1 utf8 encoding bug.

Soo...

I've reviewed the HTML text extraction and found atleast one bug in the unicode/decode encoding part of the loop. If utf8 fails it doesnt bother looping thru the other encodings. But I'm not sure that is the source of the error. So, for now I've updated the error messages, fixed that bug and I expect to wait and see if my improved error messages can catch the bug more directly.

ResourceWarnings cont.

ResourceWarning: Enable tracemalloc to get the object allocation traceback

57 Resource Warnings appeared after the switch to py3.8. We had the same issue in the switch to python 3 for CL.

ResourceWarning: Enable tracemalloc to get the object allocation traceback
13
/opt/hostedtoolcache/Python/3.8.6/x64/lib/python3.8/unittest/case.py:630: ResourceWarning: unclosed <socket.socket fd=5, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, raddr=/var/run/docker.sock>
14```

Doctor not identifying PACER headers very well

I noticed a couple cases today where the OCR didn't trigger but should have:

https://www.courtlistener.com/docket/63348437/1/navarro-v-pelosi/

https://www.courtlistener.com/docket/5319662/1/unicorn-investment-bank-v-kuruvilla/

In both cases, the text that's extracted looks like:

Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 1 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 2 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 3 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 4 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 5 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 6 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 7 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 8 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 9 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 10 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 11 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 12 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 13 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 14 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 15 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 16 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 17 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 18 of 18

We used to have a kinda flakey regex for this, but perhaps it's not working? Or needs a tweak?

Support generating multiple thumbnails from a PDF

When we finally get back to working on the Big Cases Bot, we'll probably want to be able to generate multiple thumbnails for a single PDF. The old bot would tweet with four thumbnails from the PDF.

I guess the two ways to do this would be to either:

  1. Have an API where you say which pages you want and it somehow gives you back multiple pages. (Does HTTP support this?)
  2. Have an API where you say which page (singular) you want. If you don't say, it gives you page one. If you do, it gives you the page you requested.

Gunicorn only has one worker and thus not much performance

I might be wrong about this, but....Sentry isn't very happy with us right now because, unless I'm missing something, Doctor's gunicorn config is set up with only one worker to serve requests:

https://github.com/freelawproject/doctor/blob/main/docker/docker-entrypoint.sh#L2

In CL, we have 48 workers configured to serve the website, and Doctor certainly needs more than one, particularly since it's often tied up with tesseract stuff.

Here's the config: https://docs.gunicorn.org/en/stable/settings.html#workers

The Sentry issue is: COURTLISTENER-2BA

It's triggering now because we doing a lot of crawling at the moment.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.