Giter Club home page Giter Club logo

computation_hist's People

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

computation_hist's Issues

create an advanced search webpage

create an advanced search option that allows you to look search model fields (author, document, folder, etc) for specific inputs. Ex: search for 'Verzuh' in recipient field only.

AND/OR

create a filter-able page of texts (ex: by adjusting preferred page length, so only documents with 3 or more pages show up, etc)

Annotations about Pages

Potentially adding relevant info about the page itself that can't be portrayed in a text file (ex. crossed out by hand, handwritten notes on the page, etc)

Simulator STO function is slightly wrong

STO function is storing literal binary with truncation, instead of the fixed point property of the accumulator, which will probably cause problems with negative numbers.

Meta-Issue: Make Issues

Everyone continuing from last term: make at least one issue that you know needs to be done.
IAP group: make two!

Assign either yourself or someone else (w/ their permission) or leave "Assignees" blank.

Choose one or more appropriate Labels from the "Labels" group.

Storing PDFs

Started the code:
The pdfs need to be split into pdfs and png files. These files should be stored in a directory that is auto-generated. There should be a separate png and pdf for each page and a separate pdf for each document.
directory should be as follows

/data/web_test_set/
/'folder1 name'/
/doc1/
doc1.pdf
/page1/
page1.png
page1.pdf

Code below has not been pushed as it doesn't complete the task but it makes all of the directory. It can take a pdf and separate it down to the pagei.png s

@ -1,7 +1,12 @@
import os
from pathlib import Path, PurePath
from django.db import models
from common import make_searchable_pdf
import sys
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdf2image import convert_from_path


base_path = Path(os.path.abspath(os.path.dirname(__file__)))
sys.path.insert(0, PurePath.joinpath(base_path.parent, "djweb"))

@ -12,12 +17,14 @@ from dj_comp_hist.models import Person, Document, Box, Folder, Organization, Pag
path_to_boxes = PurePath.joinpath(base_path.parent,"computation_hist","data","web_test_set")



def create_sub_folders(path_to_boxes, foldername_short):
   """
   To run this code :
   sys.path
   sys.path.insert(0, '/Users/ifeademolu-odeneye/Documents/GitHub/computation_hist
   /computation_hist') - replace with your file path
   import dj_comp_hist
   from  dj_comp_hist import models
   import sys
   sys.path.insert(0, '/Users/ifeademolu-odeneye/Documents/GitHub/computation_hist/computation_hist')
   import sort_pdfs
   sort_pdfs.create_sub_folders(sort_pdfs.path_to_boxes, "rockefeller")

@ -27,18 +34,53 @@ def create_sub_folders(path_to_boxes, foldername_short):
   :return:
   """

   box = str(models.Folder.objects.get(name=foldername_short).box) #not this is a string

   box = str(models.Folder.objects.get(name=foldername_short).box) #note this is a string
   root = PurePath.joinpath(path_to_boxes,  box)
   if not os.path.exists(root):
       Path.mkdir(root)
   path_folder_pdf = PurePath.joinpath(root, foldername_short)
   if not os.path.exists(path_folder_pdf):
       Path.mkdir(path_folder_pdf)
   associated_documents = models.Folder.objects.get(name=foldername_short).document_set.all()


   split_folder_to_doc(path_folder_pdf, associated_documents, foldername_short)

   root = PurePath.joinpath(root, foldername_short)
   if not os.path.exists(root):
       Path.mkdir(root)

   for doc in models.Folder.objects.get(name=foldername_short).document_set.all():
       Path.mkdir(PurePath.joinpath(root, "doc_" + str(doc.id)))
       for i in range(1,doc.number_of_pages+1):
           Path.mkdir(PurePath.joinpath(root, "doc_" + str(doc.id), "page_"+str(i)))
def split_doc_to_page(pdf_path, folder_name):
   print("********************")
   print(PurePath.joinpath(pdf_path,"1_08_raw_rockefeller.pdf"))
   # ------------------to be changed next line
   pages = convert_from_path(PurePath.joinpath(pdf_path,"1_08_raw_rockefeller.pdf"))

   for page in range(1, len(pages)+1):
       Path.mkdir(PurePath.joinpath(pdf_path, "page_" + str(page)))

       page_path = PurePath.joinpath(pdf_path, "page_" + str(page), folder_name + '_' + str(page) + '.png')
       pages[page-1].save(page_path, 'PNG')#saves page to the directory


def split_folder_to_doc(pdf_path, associated_documents, folder_name):
   """

   :param pdf_path: the path up to the folder containing the pdfs
   :param associated_documents:
   :return:
   """
   #splits the folder pdfs
   start_pages = []
   pdf_location = PurePath.joinpath(pdf_path, "1_08_raw_rockefeller.pdf")
   for single_doc in associated_documents:
       start_pages.append(single_doc.first_page)
       list.sort(start_pages)
   folder_pdf = PdfFileReader(open(pdf_location, "rb"))
   for doc in associated_documents:
       if not os.path.exists(PurePath.joinpath(pdf_path, "doc_" + str(doc.id))):
           Path.mkdir(PurePath.joinpath(pdf_path, "doc_" + str(doc.id)))
       output = PdfFileWriter()
       for i in range(doc.first_page,doc.last_page):
           output.addPage(folder_pdf.getPage(i))
       with open("doc" + str(doc.id) + ".pdf", "wb") as outputStream:
           output.write(outputStream)

       split_doc_to_page(pdf_path, folder_name)

Integrate World Map of Letters

Develop a map that allows users to sort for letters from a certain place (i.e. letters from professors at other universities, areas, etc.)

Track Mentions of People

Use the OCR text to track mentions of people, not just author and correspondents. Possibly use NLTK analysis like in Gender Novels to compare how they are described, referred to?

Recognizing Similar Names

Optimally, this will be best in the importing metadata step. Names that are similar in nature will be modified to the correct name. I noticed there are different variations in the same name.

For example:

"Morse, Philip M."
"Morse, Philip"

This function will identify those two names as being the same and modify the second one to Morse, Philip M.

Another example: a person accidentally spelled "Phillip" instead of "Philip" this function will recognize that they are the same person.

Community Documentation

Improved readme file (adding a history section, like in https://github.com/tesseract-ocr/tesseract), opportunity to consider the ambiguity of new_developers.md file for both in-lab and public collaborators, bring in code of conduct, consider renaming important_info directory, contributing file, potential to include tutorials in the future

Survey similar sites for design possibilities

Looking at further, related projects might provide more ideas for features and layout.

Some examples could be:

Mapping the Republic of Letters (Stanford):
http://republicofletters.stanford.edu

Six Degrees of Francis Bacon:
http://www.sixdegreesoffrancisbacon.com/?ids=10000473&min_confidence=60&type=network

"Mitford's Worlds," in Digital Mitford: The Mary Russell Mitford Archive:
http://digitalmitford.org/visual.html

Vincent van Gogh: The Letters:
http://vangoghletters.org/vg/

Will keep adding to this list!

Root Directory

I think it would make the most sense to have it redirect to /dj_comp_hist/

Simulations Home Page

Would anyone be willing to create a rudimentary home page for the simulations app that automatically lists links to all the simulations? Also a link in the navbar to these simulations. @elenaboal and @ktmurray1999 , @srisi mentioned you guys currently are relatively free; if not, I can cover it.

Unknown/No Name organizations

As of current, documents that are not associated with any organization can be accessed through the list of organizations unknown and no name. This should not be possible and access to the documents should be through other methods like author or recipient. (Make unknown/no name now show up when listing organizations).

Add full text to document model

Modify the document model to store the full text of the document so it can be used in a fulltext search.

?Presumably? (@mscuthbert ) this means telling django to store the text in an FTS4 table.

Sample code to read the text for one file:

txt_path = get_file_path(box=box_id, folder=folder_id, foldername_short=foldername_short,                                                   
                         doc_id=doc_id, path_type='absolute', file_type='txt')
try:
    with open(txt_path, 'r') as f:
        text = f.read()
except FileNotFoundError:
    print(f'skipped {txt_path}')
    continue

Metadata number of rows tracker

Could someone write a little script on the .csv metadata file that makes a list of people and number of scans? Like:

stephan: 56
lisa: 54
myke: 20
bob: 1

this would help staff to figure out how we're doing (and nice bragging rights for the top five :-) ).

Find Photos

Look for photos of different main characters.
This can be put on their webpage.

Find and Solve Common OCR Mistakes

OCR makes our lives a lot easier, but even the best documents can sometimes confuse it. Often times, very common phrases/words can get jumbled - "M.I.T." is sometimes translated as "M.1.T.", for instance. It would help if there is a quasi-reliable way of catching some of the most common mistakes in order to help our analysis.

Navbar Design

How should we design the navbar(s?) for the Digital Archive, Simulations, and other parts of the site?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.