dhmit / computation_hist Goto Github PK

View Code? Open in Web Editor NEW

5.0 2.0 17.0 349.72 MB

Archival History of the MIT Computation Center

License: BSD 3-Clause "New" or "Revised" License

CSS 4.03% HTML 2.26% JavaScript 77.89% Python 7.44% Shell 0.15% Jinja 8.22%

computation_hist's People

Stargazers

Watchers

Forkers

mgallegos3015 emilycaragay ifeife123 ltagliaferri sophiazhi samimak37 elenaboal aculber elsamobi cminsky honey-ai hdacosta400 kmerrill18 asselism ssundaram21 felixtran39 erica02139

computation_hist's Issues

create an advanced search webpage

create an advanced search option that allows you to look search model fields (author, document, folder, etc) for specific inputs. Ex: search for 'Verzuh' in recipient field only.

AND/OR

create a filter-able page of texts (ex: by adjusting preferred page length, so only documents with 3 or more pages show up, etc)

Create a document display page

We need a page that displays both the document pdf and the document metadata.

Pop Up for Syntax Errors on Simulations

Annotations about Pages

Potentially adding relevant info about the page itself that can't be portrayed in a text file (ex. crossed out by hand, handwritten notes on the page, etc)

Simulator STO function is slightly wrong

STO function is storing literal binary with truncation, instead of the fixed point property of the accumulator, which will probably cause problems with negative numbers.

Meta-Issue: Make Issues

Everyone continuing from last term: make at least one issue that you know needs to be done.
IAP group: make two!

Assign either yourself or someone else (w/ their permission) or leave "Assignees" blank.

Choose one or more appropriate Labels from the "Labels" group.

We need help from everyone in the lab in assembling the metadata of our corpus.
You can find a guide on how to enter metadata here: https://github.com/dhmit/computation_hist/blob/master/computation_hist/documentation/metadata.md
However, if you're new in the lab, feel free to ask one of the continuing lab members to show you what to do--there are a lot of edge cases.

Categorize documents by their type (letters, floor plans, etc.)

Make assembly simulator object-oriented

Storing PDFs

Started the code:
The pdfs need to be split into pdfs and png files. These files should be stored in a directory that is auto-generated. There should be a separate png and pdf for each page and a separate pdf for each document.
directory should be as follows

/data/web_test_set/
/'folder1 name'/
/doc1/
doc1.pdf
/page1/
page1.png
page1.pdf

Code below has not been pushed as it doesn't complete the task but it makes all of the directory. It can take a pdf and separate it down to the pagei.png s

@ -1,7 +1,12 @@
import os
from pathlib import Path, PurePath
from django.db import models
from common import make_searchable_pdf
import sys
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdf2image import convert_from_path


base_path = Path(os.path.abspath(os.path.dirname(__file__)))
sys.path.insert(0, PurePath.joinpath(base_path.parent, "djweb"))

@ -12,12 +17,14 @@ from dj_comp_hist.models import Person, Document, Box, Folder, Organization, Pag
path_to_boxes = PurePath.joinpath(base_path.parent,"computation_hist","data","web_test_set")



def create_sub_folders(path_to_boxes, foldername_short):
   """
   To run this code :
   sys.path
   sys.path.insert(0, '/Users/ifeademolu-odeneye/Documents/GitHub/computation_hist
   /computation_hist') - replace with your file path
   import dj_comp_hist
   from  dj_comp_hist import models
   import sys
   sys.path.insert(0, '/Users/ifeademolu-odeneye/Documents/GitHub/computation_hist/computation_hist')
   import sort_pdfs
   sort_pdfs.create_sub_folders(sort_pdfs.path_to_boxes, "rockefeller")

@ -27,18 +34,53 @@ def create_sub_folders(path_to_boxes, foldername_short):
   :return:
   """

   box = str(models.Folder.objects.get(name=foldername_short).box) #not this is a string

   box = str(models.Folder.objects.get(name=foldername_short).box) #note this is a string
   root = PurePath.joinpath(path_to_boxes,  box)
   if not os.path.exists(root):
       Path.mkdir(root)
   path_folder_pdf = PurePath.joinpath(root, foldername_short)
   if not os.path.exists(path_folder_pdf):
       Path.mkdir(path_folder_pdf)
   associated_documents = models.Folder.objects.get(name=foldername_short).document_set.all()


   split_folder_to_doc(path_folder_pdf, associated_documents, foldername_short)

   root = PurePath.joinpath(root, foldername_short)
   if not os.path.exists(root):
       Path.mkdir(root)

   for doc in models.Folder.objects.get(name=foldername_short).document_set.all():
       Path.mkdir(PurePath.joinpath(root, "doc_" + str(doc.id)))
       for i in range(1,doc.number_of_pages+1):
           Path.mkdir(PurePath.joinpath(root, "doc_" + str(doc.id), "page_"+str(i)))
def split_doc_to_page(pdf_path, folder_name):
   print("********************")
   print(PurePath.joinpath(pdf_path,"1_08_raw_rockefeller.pdf"))
   # ------------------to be changed next line
   pages = convert_from_path(PurePath.joinpath(pdf_path,"1_08_raw_rockefeller.pdf"))

   for page in range(1, len(pages)+1):
       Path.mkdir(PurePath.joinpath(pdf_path, "page_" + str(page)))

       page_path = PurePath.joinpath(pdf_path, "page_" + str(page), folder_name + '_' + str(page) + '.png')
       pages[page-1].save(page_path, 'PNG')#saves page to the directory


def split_folder_to_doc(pdf_path, associated_documents, folder_name):
   """

   :param pdf_path: the path up to the folder containing the pdfs
   :param associated_documents:
   :return:
   """
   #splits the folder pdfs
   start_pages = []
   pdf_location = PurePath.joinpath(pdf_path, "1_08_raw_rockefeller.pdf")
   for single_doc in associated_documents:
       start_pages.append(single_doc.first_page)
       list.sort(start_pages)
   folder_pdf = PdfFileReader(open(pdf_location, "rb"))
   for doc in associated_documents:
       if not os.path.exists(PurePath.joinpath(pdf_path, "doc_" + str(doc.id))):
           Path.mkdir(PurePath.joinpath(pdf_path, "doc_" + str(doc.id)))
       output = PdfFileWriter()
       for i in range(doc.first_page,doc.last_page):
           output.addPage(folder_pdf.getPage(i))
       with open("doc" + str(doc.id) + ".pdf", "wb") as outputStream:
           output.write(outputStream)

       split_doc_to_page(pdf_path, folder_name)

IBM 704 Instruction Labels

Provide short descriptions of highlighted instructions at top of page while running.

Integrate World Map of Letters

Develop a map that allows users to sort for letters from a certain place (i.e. letters from professors at other universities, areas, etc.)

Let IBM 704 simulator Handle Operations with Negative Codes

Site Favicon

Track Mentions of People

Use the OCR text to track mentions of people, not just author and correspondents. Possibly use NLTK analysis like in Gender Novels to compare how they are described, referred to?

Merge "unknown" and "No name" in the organization field of the metadata organization field

"No name" should be "unknown"
This can probably be done directly in the google sheet.

Network visualization — Force-directed graph

Create network diagram to map all of the key players in correspondence

D3 example: https://observablehq.com/@d3/force-directed-graph

Make IBM Simulator Handle Tags

Info on how tags work with the index registers in this document:
http://bitsavers.org/pdf/ibm/704/24-6661-2_704_Manual_1955.pdf

Pressing Clear on Some Demos Breaks the Demo

Recognizing Similar Names

Optimally, this will be best in the importing metadata step. Names that are similar in nature will be modified to the correct name. I noticed there are different variations in the same name.

For example:

"Morse, Philip M."
"Morse, Philip"

This function will identify those two names as being the same and modify the second one to Morse, Philip M.

Another example: a person accidentally spelled "Phillip" instead of "Philip" this function will recognize that they are the same person.

IBM 704 Simulator Tutorial

Display implemented operations on general assembler page

Community Documentation

Improved readme file (adding a history section, like in https://github.com/tesseract-ocr/tesseract), opportunity to consider the ambiguity of new_developers.md file for both in-lab and public collaborators, bring in code of conduct, consider renaming important_info directory, contributing file, potential to include tutorials in the future

Survey similar sites for design possibilities

Looking at further, related projects might provide more ideas for features and layout.

Some examples could be:

Mapping the Republic of Letters (Stanford):
http://republicofletters.stanford.edu

Six Degrees of Francis Bacon:
http://www.sixdegreesoffrancisbacon.com/?ids=10000473&min_confidence=60&type=network

"Mitford's Worlds," in Digital Mitford: The Mary Russell Mitford Archive:
http://digitalmitford.org/visual.html

Vincent van Gogh: The Letters:
http://vangoghletters.org/vg/

Will keep adding to this list!

Where is the scan of Box 1, Folder 13, (Computation Center budget Rockefeller)?

Folder 1.13 seems to be the only one missing on Google Drive. Does anyone know what happened to it? @elenaboal maybe?

Root Directory

I think it would make the most sense to have it redirect to /dj_comp_hist/

Scan and upload folder-level pdfs on Google Drive

Currently missing folders on google drive (we may still have some unprocessed scans on USB sticks):

Box 1

Folder 13
Folders 22-26
Folder 28

Box 2

Complete

Box 3

Folders 10-31
Folders 33-37

CLA doesn't store negative numbers properly

I have already fixed this on my own branch; it will be fixed on master after the next pull request.

Simulations Home Page

Would anyone be willing to create a rudimentary home page for the simulations app that automatically lists links to all the simulations? Also a link in the navbar to these simulations. @elenaboal and @ktmurray1999 , @srisi mentioned you guys currently are relatively free; if not, I can cover it.

Make general IBM Assembler nicer looking

Unknown/No Name organizations

As of current, documents that are not associated with any organization can be accessed through the list of organizations unknown and no name. This should not be possible and access to the documents should be through other methods like author or recipient. (Make unknown/no name now show up when listing organizations).

Getting Floating Point Numbers on IBM 704 Words Always Returns Positive

Just noticed this and fixed it on my branch. This will be fixed in the next pull request.

Create a timeline

Create an interactive timeline that captures the dates of the various documents in our archive.

Possible model: https://www.darwinproject.ac.uk/letters/darwins-letters-timeline

Add IBM Simulator to Django Config

Network Visualization - Develop matrix of associations

Develop a graph/chart/visualization of associations that show which people corresponded in circles the most.

See https://bost.ocks.org/mike/miserables/ for examples

Learn (MSC + Staff) and demo Django in PyCharm

MSC will learn how to better integrate Django into Pycharm.

Add full text to document model

Modify the document model to store the full text of the document so it can be used in a fulltext search.

?Presumably? (@mscuthbert ) this means telling django to store the text in an FTS4 table.

Sample code to read the text for one file:

txt_path = get_file_path(box=box_id, folder=folder_id, foldername_short=foldername_short,                                                   
                         doc_id=doc_id, path_type='absolute', file_type='txt')
try:
    with open(txt_path, 'r') as f:
        text = f.read()
except FileNotFoundError:
    print(f'skipped {txt_path}')
    continue

Advanced Search + Basic Search functionality

@Carol217 — @felixtran39 #182 changed the Advanced Search hyperlink so that it would not throw an error, but now search functionality is redundant across basic and advanced search. Can you pull the basic Search back to not have the Advanced Search boilerplate on top?

track buzzwords over time

Populate from Metadata docs

Need docs on how to populate from Metadata

Build Cubbies

IKEA box on the floor -- MechEng folks!

Floating Point Operations on IBM 704

Metadata number of rows tracker

Could someone write a little script on the .csv metadata file that makes a list of people and number of scans? Like:

stephan: 56
lisa: 54
myke: 20
bob: 1

this would help staff to figure out how we're doing (and nice bragging rights for the top five :-) ).

Find Photos

Look for photos of different main characters.
This can be put on their webpage.

Find and Solve Common OCR Mistakes

OCR makes our lives a lot easier, but even the best documents can sometimes confuse it. Often times, very common phrases/words can get jumbled - "M.I.T." is sometimes translated as "M.1.T.", for instance. It would help if there is a quasi-reliable way of catching some of the most common mistakes in order to help our analysis.