Giter Club home page Giter Club logo

igp_2023's People

Contributors

gromule avatar justs-viduss avatar omegatro avatar

Stargazers

 avatar

Watchers

 avatar

igp_2023's Issues

Project meeting 4

Agenda

  • Module progress report
  • Meeting with project coordinator: 21 - 23 - 24 of november (email the coordinator) - @Gromule

Project meeting 1

Agenda

  • Introduction to github
  • Introduction to zotero
  • Initial literature review and project proposal

Preemptive meeting

Agenda

  • Project choice
  • Using Git/Github as collaboration platform
  • Initial data review
  • Discussed potential downsampling
  • List of useful notebook and other shared resources

Establish uniform structure of the project code

In order to keep the code well organized, it is useful to negotiate some kind of template structure. This will allow to make the code more easily manageable. Specifically by having a dedicated place where certain functional units of code (variables, functions and classes) are expected to be, it will hopefully reduce the time one needs to find and fix bugs.

Initial structure proposal will be placed into project_structure branch of the repository.

Project meeting 9

Agenda

  • Finalize LightFM
  • Finalize Implicit
  • Finalize project portfolio

Git setup

Ensure that team members can access github and use git to collaborate on the project

Project meeting 3

Agenda

  • Finalizing project proposal
  • Finalizing literature review
  • Selecting libraries to study

Sharing literature through zotero

Bibliography manager software to share common library of publications among group of people

In order to add resources to library and use Zotero to create citations, please follow the guide here:

Project meeting 2

Agenda

  • Project proposal progress report
  • Literature review progress report

Develop methods to work with batches of data, without loading the entire dataset to RAM

Working with data this big requires a special approach to deal with limited memory of a local machine. One way is to use batch processing. This will imply loading a fraction of the full dataset, conducting the required operations and storing the results in a file, that can later be used by the downstream steps without the need to repeat upstream workflow. Different data types will require different methods.

Methods

  • load_batch - load batch of <= N records/texts/images from the dataset into python object (e.g. dataframe, numpy array, lists etc.)
  • serialize_batch - convert data from the format obtained after processing step into a format that can be saved into a file
  • deserialize_batch - convert serialized data from file into python data structure, that can be passed to downstream task

Expected effect on the workflow

  • Each step of the workflow - preprocessing, feature engineering, model training and evaluation will operate on batches
  • Batches can be passed directly from previous step (e.g. load_batch > preprocess > extract_features > train_model)
  • Batches can be passed from files, that were pre-generated by previous steps (e.g. run the entire dataset through preprocessing & feature extraction, save results to file, then train the model by passing batches of pre-processed data from files)

Pros (apart from RAM benefit)

  • Modularity - each step of the workflow can be executed and tested without the need to rerun previous steps
  • Batches of test data is easier to share in collaboration
  • Easier to identify specific test cases, e.g. if a model has trouble with predicting for values of a specific batch
  • Easier train-test split using different batches

Cons

  • Takes time to develop
  • Takes additional space to store intermediate processing results
  • If pre-processing step changes, saved results from downstream steps need to be updated

Review the list of useful notebooks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.