Giter Club home page Giter Club logo

nltk's Introduction

Data Carpentry with NLTK and IPython


Join the chat at https://gitter.im/resbaz/resbaz

This is the repository for teaching materials and additional resources used by Research Platforms Services to teach Python, IPython and the Natural Language Toolkit (NLTK).

All the materials used in NLTK workshops are in this repository. In fact, cloning this repository will be our first activity together as a group. To do that, just open your terminal and type/paste:

git clone https://github.com/resbaz/nltk.git

Though we'll be working with blank notebooks in our training sessions, everthing we cover lives as a complete notebook in the resources/completed-notebooks directory. These notebooks are useful for remembering or extending what you learned in during training. Alternatively, they may be useful for those who cannot attend our sessions face-to-face.

Below is a basic overview of the four-session lesson plan. You can click the headings to view complete versions of the IPython Notebooks we'll be using in each sessions. The materials are always evolving, and pull requests are always welcome.

In this session, you will learn how to use IPython Notebooks, as well as how to complete basic tasks with Python/NLTK.

  • Getting up and running
  • What exactly are Python, IPython and NLTK?
  • Introductions to IPython Notebook
  • Overview of basic Python concepts: significant whitespace, input/output types, commands and arguments, etc.
  • Introduction to NLTK
  • Quickstart: US Inaugural Addresses Corpus
  • Plot key terms in the inaugural addresses longitudinally
  • Discussion: Why might we want to use NLTK? What are its limitations?

In this session, we devote more time to the fundamentals of Python, learning how to create and manipulate different kinds of data. In the first half of the session, we discuss:

  • Working with variables
  • Writing functions
  • Creating frequency distributions

In the second half of the session, we put our existing skills to work in order to investigate the corpora that come bundled with NLTK. The major kinds of analysis we cover are:

  • Sentence splitting
  • Tokenisation
  • Keywords
  • n-grams
  • Collocates
  • Concordancing

By this point, we're familiar with what NLTK is and how to use it. It's time to put it to work on a novel dataset. We've chosen a corpus of Malcolm Fraser's speeches. In this session, we begin by:

  • Introducing the corpus
  • Exploring corpus metadata
  • Data structuring by metadata feature
  • Getting keywords, n-grams, and collocates
  • Part-of-speech tagging and parsing the data

Next, we'll use some purpose-built tools called corpkit to look for longitudinal changes in the language of Malcolm Fraser's speeches. These tools help us with:

  • Searching syntax trees
  • Interrogating each subcorpus
  • Visualising results
  • Viewing and editing results

We'll leave some time at the end of this session for exploring the Fraser Corpus, and for discussing what we found.

So, we've learned some great skills! But, we need to know how to put these skills into practice within our own work. In this final session, we discuss:

  • What kind(s) of data we're all working with
  • Storing your data and results
  • Using what you've learned here
  • Developing your skills further
  • Summarising and saying goodbye

nltk's People

Contributors

interrogator avatar lachlansimpson avatar

Watchers

 avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.