Giter Club home page Giter Club logo

labtwinchallenge's Introduction

LabTwinChallenge

Explanations and instructions

The answers to the questions are in Labtwin answers.pdf.
All the code is in labtwin_data_normalization.py.

The output from when I ran it on my computer is in papers/papers_aggregated_result.txt.

Running instructions:

It uses the following libraries which need to be installed to run it:

  • codecs
  • unicodedata
  • BeautifulSoup
  • functools
  • os
  • re
  • inflect
  • string
  • At the top, there are two variables that represent the output file (originally empty) and the folder it is in. Their values might need to be changed by the user before running the program.
  • result_name = "papers_aggregated.txt"

  •    # 'result_name' is the name of the file which will contain the aggregate of the files 
    
  •    # intially it should be an empty file
    
  • path = "C:/Users/Valdi/Documents/papers/"

  •    # 'path' is the folder which contains the papers and an initially empty text file whose name is the value of 'result_name'
    
  • Then it can be run from the command line with "python labtwin_data_normalization.py".
  • After running the output should be in path/result_name.

Explanations of the script :

- The script has 7 functions that do some string operations (with docstrings for detailed explanation) .

- It contains one loop that goes through the directory in 'path' and parses each html file in the directory in alphabetical order. 

- Inside the loop, the program
*	first reads each html file in using codecs, 
*	then uses BeautifulSoup to extract the paragraphs from it,
*	changes the numbers to spoken form using the help functions and inflect, 
*	changes it back into one string using the reduce operation from functools and a help function,
*	converts the result into printed ascii using unicodedata,
*	throws away URLs using a regular expression,
*	removes punctuation using a help function
*	and finally writes the result to the file, along with the following seperator before each paper:
-		"\n" + "*** Original paper file name: " + file_name  + " ***\n\n"

*More details are in the comments in the script. 

labtwinchallenge's People

Watchers

James Cloos avatar Valdimar Ágúst Eggertsson avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.