dhmit / computation_hist Goto Github PK
View Code? Open in Web Editor NEWArchival History of the MIT Computation Center
License: BSD 3-Clause "New" or "Revised" License
Archival History of the MIT Computation Center
License: BSD 3-Clause "New" or "Revised" License
We need a page that displays both the document pdf and the document metadata.
Started the code:
The pdfs need to be split into pdfs and png files. These files should be stored in a directory that is auto-generated. There should be a separate png and pdf for each page and a separate pdf for each document.
directory should be as follows
/data/web_test_set/
/'folder1 name'/
/doc1/
doc1.pdf
/page1/
page1.png
page1.pdf
Code below has not been pushed as it doesn't complete the task but it makes all of the directory. It can take a pdf and separate it down to the pagei.png s
@ -1,7 +1,12 @@
import os
from pathlib import Path, PurePath
from django.db import models
from common import make_searchable_pdf
import sys
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdf2image import convert_from_path
base_path = Path(os.path.abspath(os.path.dirname(__file__)))
sys.path.insert(0, PurePath.joinpath(base_path.parent, "djweb"))
@ -12,12 +17,14 @@ from dj_comp_hist.models import Person, Document, Box, Folder, Organization, Pag
path_to_boxes = PurePath.joinpath(base_path.parent,"computation_hist","data","web_test_set")
def create_sub_folders(path_to_boxes, foldername_short):
"""
To run this code :
sys.path
sys.path.insert(0, '/Users/ifeademolu-odeneye/Documents/GitHub/computation_hist
/computation_hist') - replace with your file path
import dj_comp_hist
from dj_comp_hist import models
import sys
sys.path.insert(0, '/Users/ifeademolu-odeneye/Documents/GitHub/computation_hist/computation_hist')
import sort_pdfs
sort_pdfs.create_sub_folders(sort_pdfs.path_to_boxes, "rockefeller")
@ -27,18 +34,53 @@ def create_sub_folders(path_to_boxes, foldername_short):
:return:
"""
box = str(models.Folder.objects.get(name=foldername_short).box) #not this is a string
box = str(models.Folder.objects.get(name=foldername_short).box) #note this is a string
root = PurePath.joinpath(path_to_boxes, box)
if not os.path.exists(root):
Path.mkdir(root)
path_folder_pdf = PurePath.joinpath(root, foldername_short)
if not os.path.exists(path_folder_pdf):
Path.mkdir(path_folder_pdf)
associated_documents = models.Folder.objects.get(name=foldername_short).document_set.all()
split_folder_to_doc(path_folder_pdf, associated_documents, foldername_short)
root = PurePath.joinpath(root, foldername_short)
if not os.path.exists(root):
Path.mkdir(root)
for doc in models.Folder.objects.get(name=foldername_short).document_set.all():
Path.mkdir(PurePath.joinpath(root, "doc_" + str(doc.id)))
for i in range(1,doc.number_of_pages+1):
Path.mkdir(PurePath.joinpath(root, "doc_" + str(doc.id), "page_"+str(i)))
def split_doc_to_page(pdf_path, folder_name):
print("********************")
print(PurePath.joinpath(pdf_path,"1_08_raw_rockefeller.pdf"))
# ------------------to be changed next line
pages = convert_from_path(PurePath.joinpath(pdf_path,"1_08_raw_rockefeller.pdf"))
for page in range(1, len(pages)+1):
Path.mkdir(PurePath.joinpath(pdf_path, "page_" + str(page)))
page_path = PurePath.joinpath(pdf_path, "page_" + str(page), folder_name + '_' + str(page) + '.png')
pages[page-1].save(page_path, 'PNG')#saves page to the directory
def split_folder_to_doc(pdf_path, associated_documents, folder_name):
"""
:param pdf_path: the path up to the folder containing the pdfs
:param associated_documents:
:return:
"""
#splits the folder pdfs
start_pages = []
pdf_location = PurePath.joinpath(pdf_path, "1_08_raw_rockefeller.pdf")
for single_doc in associated_documents:
start_pages.append(single_doc.first_page)
list.sort(start_pages)
folder_pdf = PdfFileReader(open(pdf_location, "rb"))
for doc in associated_documents:
if not os.path.exists(PurePath.joinpath(pdf_path, "doc_" + str(doc.id))):
Path.mkdir(PurePath.joinpath(pdf_path, "doc_" + str(doc.id)))
output = PdfFileWriter()
for i in range(doc.first_page,doc.last_page):
output.addPage(folder_pdf.getPage(i))
with open("doc" + str(doc.id) + ".pdf", "wb") as outputStream:
output.write(outputStream)
split_doc_to_page(pdf_path, folder_name)
How should we design the navbar(s?) for the Digital Archive, Simulations, and other parts of the site?
In the medium term, we want to move from displaying the documents as png image files to displaying them as pdf files to mak
PDF.js seems like a good candidate for this: https://mozilla.github.io/pdf.js/
Develop a graph/chart/visualization of associations that show which people corresponded in circles the most.
See https://bost.ocks.org/mike/miserables/ for examples
Everyone continuing from last term: make at least one issue that you know needs to be done.
IAP group: make two!
Assign either yourself or someone else (w/ their permission) or leave "Assignees" blank.
Choose one or more appropriate Labels from the "Labels" group.
Develop a map that allows users to sort for letters from a certain place (i.e. letters from professors at other universities, areas, etc.)
Need docs on how to populate from Metadata
"No name" should be "unknown"
This can probably be done directly in the google sheet.
@Carol217 — @felixtran39 #182 changed the Advanced Search hyperlink so that it would not throw an error, but now search functionality is redundant across basic and advanced search. Can you pull the basic Search back to not have the Advanced Search boilerplate on top?
We need help from everyone in the lab in assembling the metadata of our corpus.
You can find a guide on how to enter metadata here: https://github.com/dhmit/computation_hist/blob/master/computation_hist/documentation/metadata.md
However, if you're new in the lab, feel free to ask one of the continuing lab members to show you what to do--there are a lot of edge cases.
create an advanced search option that allows you to look search model fields (author, document, folder, etc) for specific inputs. Ex: search for 'Verzuh' in recipient field only.
AND/OR
create a filter-able page of texts (ex: by adjusting preferred page length, so only documents with 3 or more pages show up, etc)
Potentially adding relevant info about the page itself that can't be portrayed in a text file (ex. crossed out by hand, handwritten notes on the page, etc)
Optimally, this will be best in the importing metadata step. Names that are similar in nature will be modified to the correct name. I noticed there are different variations in the same name.
For example:
"Morse, Philip M."
"Morse, Philip"
This function will identify those two names as being the same and modify the second one to Morse, Philip M.
Another example: a person accidentally spelled "Phillip" instead of "Philip" this function will recognize that they are the same person.
Could someone write a little script on the .csv metadata file that makes a list of people and number of scans? Like:
stephan: 56
lisa: 54
myke: 20
bob: 1
this would help staff to figure out how we're doing (and nice bragging rights for the top five :-) ).
Currently numbers are just placed in memory by Javascript. Would be nice to have pseudoinstructions like DEC to demonstrate that.
Modify the document model to store the full text of the document so it can be used in a fulltext search.
?Presumably? (@mscuthbert ) this means telling django to store the text in an FTS4 table.
Sample code to read the text for one file:
txt_path = get_file_path(box=box_id, folder=folder_id, foldername_short=foldername_short,
doc_id=doc_id, path_type='absolute', file_type='txt')
try:
with open(txt_path, 'r') as f:
text = f.read()
except FileNotFoundError:
print(f'skipped {txt_path}')
continue
STO function is storing literal binary with truncation, instead of the fixed point property of the accumulator, which will probably cause problems with negative numbers.
Folder 1.13 seems to be the only one missing on Google Drive. Does anyone know what happened to it? @elenaboal maybe?
As of current, documents that are not associated with any organization can be accessed through the list of organizations unknown and no name. This should not be possible and access to the documents should be through other methods like author or recipient. (Make unknown/no name now show up when listing organizations).
Looking at further, related projects might provide more ideas for features and layout.
Some examples could be:
Mapping the Republic of Letters (Stanford):
http://republicofletters.stanford.edu
Six Degrees of Francis Bacon:
http://www.sixdegreesoffrancisbacon.com/?ids=10000473&min_confidence=60&type=network
"Mitford's Worlds," in Digital Mitford: The Mary Russell Mitford Archive:
http://digitalmitford.org/visual.html
Vincent van Gogh: The Letters:
http://vangoghletters.org/vg/
Will keep adding to this list!
MSC will learn how to better integrate Django into Pycharm.
Would anyone be willing to create a rudimentary home page for the simulations app that automatically lists links to all the simulations? Also a link in the navbar to these simulations. @elenaboal and @ktmurray1999 , @srisi mentioned you guys currently are relatively free; if not, I can cover it.
Improved readme file (adding a history section, like in https://github.com/tesseract-ocr/tesseract), opportunity to consider the ambiguity of new_developers.md
file for both in-lab and public collaborators, bring in code of conduct, consider renaming important_info
directory, contributing file, potential to include tutorials in the future
Info on how tags work with the index registers in this document:
http://bitsavers.org/pdf/ibm/704/24-6661-2_704_Manual_1955.pdf
Use the OCR text to track mentions of people, not just author and correspondents. Possibly use NLTK analysis like in Gender Novels to compare how they are described, referred to?
I have already fixed this on my own branch; it will be fixed on master after the next pull request.
I think it would make the most sense to have it redirect to /dj_comp_hist/
IKEA box on the floor -- MechEng folks!
Create network diagram to map all of the key players in correspondence
D3 example: https://observablehq.com/@d3/force-directed-graph
Implement a transfer function and create a demo of a loop for the assembly simulator.
Currently missing folders on google drive (we may still have some unprocessed scans on USB sticks):
Folder 13
Folders 22-26
Folder 28
Complete
Folders 10-31
Folders 33-37
OCR makes our lives a lot easier, but even the best documents can sometimes confuse it. Often times, very common phrases/words can get jumbled - "M.I.T." is sometimes translated as "M.1.T.", for instance. It would help if there is a quasi-reliable way of catching some of the most common mistakes in order to help our analysis.
Just noticed this and fixed it on my branch. This will be fixed in the next pull request.
Create an interactive timeline that captures the dates of the various documents in our archive.
Possible model: https://www.darwinproject.ac.uk/letters/darwins-letters-timeline
Provide short descriptions of highlighted instructions at top of page while running.
Look for photos of different main characters.
This can be put on their webpage.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.