dhmit / computation_hist Goto Github PK

Archival History of the MIT Computation Center

License: BSD 3-Clause "New" or "Revised" License

CSS 4.03% HTML 2.26% JavaScript 77.89% Python 7.44% Shell 0.15% Jinja 8.22%

computation_hist's Issues

Create a document display page

We need a page that displays both the document pdf and the document metadata.

Started the code:
The pdfs need to be split into pdfs and png files. These files should be stored in a directory that is auto-generated. There should be a separate png and pdf for each page and a separate pdf for each document.
directory should be as follows

/data/web_test_set/
/'folder1 name'/
/doc1/
doc1.pdf
/page1/
page1.png
page1.pdf

Code below has not been pushed as it doesn't complete the task but it makes all of the directory. It can take a pdf and separate it down to the pagei.png s

@ -1,7 +1,12 @@
import os
from pathlib import Path, PurePath
from django.db import models
from common import make_searchable_pdf
import sys
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdf2image import convert_from_path


base_path = Path(os.path.abspath(os.path.dirname(__file__)))
sys.path.insert(0, PurePath.joinpath(base_path.parent, "djweb"))

@ -12,12 +17,14 @@ from dj_comp_hist.models import Person, Document, Box, Folder, Organization, Pag
path_to_boxes = PurePath.joinpath(base_path.parent,"computation_hist","data","web_test_set")



def create_sub_folders(path_to_boxes, foldername_short):
   """
   To run this code :
   sys.path
   sys.path.insert(0, '/Users/ifeademolu-odeneye/Documents/GitHub/computation_hist
   /computation_hist') - replace with your file path
   import dj_comp_hist
   from  dj_comp_hist import models
   import sys
   sys.path.insert(0, '/Users/ifeademolu-odeneye/Documents/GitHub/computation_hist/computation_hist')
   import sort_pdfs
   sort_pdfs.create_sub_folders(sort_pdfs.path_to_boxes, "rockefeller")

@ -27,18 +34,53 @@ def create_sub_folders(path_to_boxes, foldername_short):
   :return:
   """

   box = str(models.Folder.objects.get(name=foldername_short).box) #not this is a string

   box = str(models.Folder.objects.get(name=foldername_short).box) #note this is a string
   root = PurePath.joinpath(path_to_boxes,  box)
   if not os.path.exists(root):
       Path.mkdir(root)
   path_folder_pdf = PurePath.joinpath(root, foldername_short)
   if not os.path.exists(path_folder_pdf):
       Path.mkdir(path_folder_pdf)
   associated_documents = models.Folder.objects.get(name=foldername_short).document_set.all()


   split_folder_to_doc(path_folder_pdf, associated_documents, foldername_short)

   root = PurePath.joinpath(root, foldername_short)
   if not os.path.exists(root):
       Path.mkdir(root)

   for doc in models.Folder.objects.get(name=foldername_short).document_set.all():
       Path.mkdir(PurePath.joinpath(root, "doc_" + str(doc.id)))
       for i in range(1,doc.number_of_pages+1):
           Path.mkdir(PurePath.joinpath(root, "doc_" + str(doc.id), "page_"+str(i)))
def split_doc_to_page(pdf_path, folder_name):
   print("********************")
   print(PurePath.joinpath(pdf_path,"1_08_raw_rockefeller.pdf"))
   # ------------------to be changed next line
   pages = convert_from_path(PurePath.joinpath(pdf_path,"1_08_raw_rockefeller.pdf"))

   for page in range(1, len(pages)+1):
       Path.mkdir(PurePath.joinpath(pdf_path, "page_" + str(page)))

       page_path = PurePath.joinpath(pdf_path, "page_" + str(page), folder_name + '_' + str(page) + '.png')
       pages[page-1].save(page_path, 'PNG')#saves page to the directory


def split_folder_to_doc(pdf_path, associated_documents, folder_name):
   """

   :param pdf_path: the path up to the folder containing the pdfs
   :param associated_documents:
   :return:
   """
   #splits the folder pdfs
   start_pages = []
   pdf_location = PurePath.joinpath(pdf_path, "1_08_raw_rockefeller.pdf")
   for single_doc in associated_documents:
       start_pages.append(single_doc.first_page)
       list.sort(start_pages)
   folder_pdf = PdfFileReader(open(pdf_location, "rb"))
   for doc in associated_documents:
       if not os.path.exists(PurePath.joinpath(pdf_path, "doc_" + str(doc.id))):
           Path.mkdir(PurePath.joinpath(pdf_path, "doc_" + str(doc.id)))
       output = PdfFileWriter()
       for i in range(doc.first_page,doc.last_page):
           output.addPage(folder_pdf.getPage(i))
       with open("doc" + str(doc.id) + ".pdf", "wb") as outputStream:
           output.write(outputStream)

       split_doc_to_page(pdf_path, folder_name)

Pop Up for Syntax Errors on Simulations

Navbar Design

How should we design the navbar(s?) for the Digital Archive, Simulations, and other parts of the site?

Integrate pdf.js to display pdfs on the website

In the medium term, we want to move from displaying the documents as png image files to displaying them as pdf files to mak
PDF.js seems like a good candidate for this: https://mozilla.github.io/pdf.js/

Network Visualization - Develop matrix of associations

Develop a graph/chart/visualization of associations that show which people corresponded in circles the most.

See https://bost.ocks.org/mike/miserables/ for examples

Make general IBM Assembler nicer looking

Meta-Issue: Make Issues

Everyone continuing from last term: make at least one issue that you know needs to be done.
IAP group: make two!

Assign either yourself or someone else (w/ their permission) or leave "Assignees" blank.

Choose one or more appropriate Labels from the "Labels" group.

Make assembly simulator object-oriented

Integrate World Map of Letters

Develop a map that allows users to sort for letters from a certain place (i.e. letters from professors at other universities, areas, etc.)

Populate from Metadata docs

Need docs on how to populate from Metadata

Merge "unknown" and "No name" in the organization field of the metadata organization field

"No name" should be "unknown"
This can probably be done directly in the google sheet.

Advanced Search + Basic Search functionality

@Carol217 — @felixtran39 #182 changed the Advanced Search hyperlink so that it would not throw an error, but now search functionality is redundant across basic and advanced search. Can you pull the basic Search back to not have the Advanced Search boilerplate on top?

Assemble metadata

We need help from everyone in the lab in assembling the metadata of our corpus.
You can find a guide on how to enter metadata here: https://github.com/dhmit/computation_hist/blob/master/computation_hist/documentation/metadata.md
However, if you're new in the lab, feel free to ask one of the continuing lab members to show you what to do--there are a lot of edge cases.

create an advanced search webpage

create an advanced search option that allows you to look search model fields (author, document, folder, etc) for specific inputs. Ex: search for 'Verzuh' in recipient field only.

AND/OR

create a filter-able page of texts (ex: by adjusting preferred page length, so only documents with 3 or more pages show up, etc)

Annotations about Pages

Potentially adding relevant info about the page itself that can't be portrayed in a text file (ex. crossed out by hand, handwritten notes on the page, etc)

Recognizing Similar Names

Optimally, this will be best in the importing metadata step. Names that are similar in nature will be modified to the correct name. I noticed there are different variations in the same name.

For example:

"Morse, Philip M."
"Morse, Philip"

This function will identify those two names as being the same and modify the second one to Morse, Philip M.

Another example: a person accidentally spelled "Phillip" instead of "Philip" this function will recognize that they are the same person.

Metadata number of rows tracker

Could someone write a little script on the .csv metadata file that makes a list of people and number of scans? Like:

stephan: 56
lisa: 54
myke: 20
bob: 1

this would help staff to figure out how we're doing (and nice bragging rights for the top five :-) ).

Pseudoinstructions For Storing Numbers in IBM 704

Currently numbers are just placed in memory by Javascript. Would be nice to have pseudoinstructions like DEC to demonstrate that.

Pressing Clear on Some Demos Breaks the Demo

Add IBM Simulator to Django Config

Add full text to document model

Modify the document model to store the full text of the document so it can be used in a fulltext search.

?Presumably? (@mscuthbert ) this means telling django to store the text in an FTS4 table.

Sample code to read the text for one file:

txt_path = get_file_path(box=box_id, folder=folder_id, foldername_short=foldername_short,                                                   
                         doc_id=doc_id, path_type='absolute', file_type='txt')
try:
    with open(txt_path, 'r') as f:
        text = f.read()
except FileNotFoundError:
    print(f'skipped {txt_path}')
    continue

Simulator STO function is slightly wrong

STO function is storing literal binary with truncation, instead of the fixed point property of the accumulator, which will probably cause problems with negative numbers.

Where is the scan of Box 1, Folder 13, (Computation Center budget Rockefeller)?

Folder 1.13 seems to be the only one missing on Google Drive. Does anyone know what happened to it? @elenaboal maybe?

Site Favicon

Display implemented operations on general assembler page

Unknown/No Name organizations

As of current, documents that are not associated with any organization can be accessed through the list of organizations unknown and no name. This should not be possible and access to the documents should be through other methods like author or recipient. (Make unknown/no name now show up when listing organizations).

Survey similar sites for design possibilities

Looking at further, related projects might provide more ideas for features and layout.

Some examples could be:

Mapping the Republic of Letters (Stanford):
http://republicofletters.stanford.edu

Six Degrees of Francis Bacon:
http://www.sixdegreesoffrancisbacon.com/?ids=10000473&min_confidence=60&type=network

"Mitford's Worlds," in Digital Mitford: The Mary Russell Mitford Archive:
http://digitalmitford.org/visual.html

Vincent van Gogh: The Letters:
http://vangoghletters.org/vg/

Will keep adding to this list!

Learn (MSC + Staff) and demo Django in PyCharm

MSC will learn how to better integrate Django into Pycharm.

Simulations Home Page

Would anyone be willing to create a rudimentary home page for the simulations app that automatically lists links to all the simulations? Also a link in the navbar to these simulations. @elenaboal and @ktmurray1999 , @srisi mentioned you guys currently are relatively free; if not, I can cover it.

Community Documentation

Improved readme file (adding a history section, like in https://github.com/tesseract-ocr/tesseract), opportunity to consider the ambiguity of new_developers.md file for both in-lab and public collaborators, bring in code of conduct, consider renaming important_info directory, contributing file, potential to include tutorials in the future

Categorize documents by their type (letters, floor plans, etc.)

Make IBM Simulator Handle Tags

Info on how tags work with the index registers in this document:
http://bitsavers.org/pdf/ibm/704/24-6661-2_704_Manual_1955.pdf

Track Mentions of People

Use the OCR text to track mentions of people, not just author and correspondents. Possibly use NLTK analysis like in Gender Novels to compare how they are described, referred to?

CLA doesn't store negative numbers properly

I have already fixed this on my own branch; it will be fixed on master after the next pull request.

Root Directory

I think it would make the most sense to have it redirect to /dj_comp_hist/

how to clearly distinguish transcript from original on website

Build Cubbies

IKEA box on the floor -- MechEng folks!

Floating Point Operations on IBM 704

Network visualization — Force-directed graph

Create network diagram to map all of the key players in correspondence

D3 example: https://observablehq.com/@d3/force-directed-graph

IBM 704 Simulator Tutorial

Loop example for Assembly Simulator

Implement a transfer function and create a demo of a loop for the assembly simulator.

Scan and upload folder-level pdfs on Google Drive

Currently missing folders on google drive (we may still have some unprocessed scans on USB sticks):

Box 1

Folder 13
Folders 22-26
Folder 28

Box 2

Complete

Box 3

Folders 10-31
Folders 33-37

Find and Solve Common OCR Mistakes

OCR makes our lives a lot easier, but even the best documents can sometimes confuse it. Often times, very common phrases/words can get jumbled - "M.I.T." is sometimes translated as "M.1.T.", for instance. It would help if there is a quasi-reliable way of catching some of the most common mistakes in order to help our analysis.

dhmit / computation_hist Goto Github PK

computation_hist's Issues

Box 1

Box 2

Box 3

Recommend Projects

Recommend Topics

Recommend Org