Giter Club home page Giter Club logo

gufi's Introduction

This file is part of GUFI, which is part of MarFS, which is released
under the BSD license.


Copyright (c) 2017, Los Alamos National Security (LANS), LLC
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software without
specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

-----
NOTE:
-----

GUFI uses the C-Thread-Pool library.  The original version, written by
Johan Hanssen Seferidis, is found at
https://github.com/Pithikos/C-Thread-Pool/blob/master/LICENSE, and is
released under the MIT License.  LANS, LLC added functionality to the
original work.  The original work, plus LANS, LLC added functionality is
found at https://github.com/jti-lanl/C-Thread-Pool, also under the MIT
License.  The MIT License can be found at
https://opensource.org/licenses/MIT.


From Los Alamos National Security, LLC:
LA-CC-15-039

Copyright (c) 2017, Los Alamos National Security, LLC All rights reserved.
Copyright 2017. Los Alamos National Security, LLC. This software was produced
under U.S. Government contract DE-AC52-06NA25396 for Los Alamos National
Laboratory (LANL), which is operated by Los Alamos National Security, LLC for
the U.S. Department of Energy. The U.S. Government has rights to use,
reproduce, and distribute this software.  NEITHER THE GOVERNMENT NOR LOS
ALAMOS NATIONAL SECURITY, LLC MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR
ASSUMES ANY LIABILITY FOR THE USE OF THIS SOFTWARE.  If software is
modified to produce derivative works, such modified software should be
clearly marked, so as not to confuse it with the version available from
LANL.

THIS SOFTWARE IS PROVIDED BY LOS ALAMOS NATIONAL SECURITY, LLC AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL LOS ALAMOS NATIONAL SECURITY, LLC OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE.








--- development

Initial development was done at LANL via an internal git-hosting service.
As of rev 0.1.0, we are moving GUFI to github.  We invite anyone to
contribute suggestions, new features, bug-reports, bug-fixes,
feature-requests, etc, through github:

    https://github.com/mar-file-system/GUFI

We will attempt to respond to requests, but we can't make promises about
the level of resources that will be dedicated to GUFI maintenace.

All bug-reports, issues, requests, etc, should go through github.
For other conversation about GUFI (and MarFS), there is:

    [email protected]

See other components of MarFS at:

    https://github.com/mar-file-system.




--- requirements

sqlite3 for all
mysql   for bfm*
db2     for bfd*



--- builds ok on OSX and Linux

The build seems to be working on OSX (10.12.6), and Linux (CentOS 7).



--- directory structure

/ .c .h makefile, README, Notes, TBD for only the currently supported things
/misc  all old stuff/unsupported stuff
/test a test directory with all the current functional tests and a testdir input tree etc.
/test/old all old tests
/scripts  handy scripts for doing various things
/C-Thread-Pool thread pool package



--- build

  # Only needed once  (C-Thread-Pool is a git submodule)
  git submodule init
  git submodule update

  # suggested:
  make clean

  # build (into local dir)
  make        -> libgufi.a, bfwi, bfti, bfq, querydb, querydbn, make_testdirs

  # if desired, do any/all of these:
  make mysql  -> bfmi
  make dfw    -> dfw
  make tools  -> querydb, querydbn, make_testdirs

  # for debugging:
  make clean
  make <some_target> DEBUG=1

  # to reset the 'test' directory (e.g. after testing):
  make clean_test



--- run

  # example_run builds the GUFI software, then generates a dummy
  # directory-tree in test/in, extracts the corresponding GUFI-tree, and
  # runs a simple query.

  ./example_run


The following are detailed descriptions of the individual tools:

...........................................................................


bfwi - breadth first walk of input tree to list the tree, or create a GUFI index-tree


Usage: bfwi [options] input_dir to_dir
options:
  -h              help
  -H              show assigned input values (debugging)
  -p              print files as they are encountered
  -n <threads>    number of threads
  -d <delim>      delimiter (one char)
  -x              pull xattrs from source file-sys into GUFI
  -P              print directories as they are encountered
  -b              build GUFI index tree
  -o <out_fname>  output file (one-per-thread, with thread-id suffix)

future options:
  -U              create by user summary per directory
  -G              create by group summary per directory


Flow:
input directory is put on a queue
output file(s) are opened one per thread
threads are started
loop assigning work (directories) from queue to threads
each thread lists the directory readdir/stat and xattr if called for
  if directory put it on the queue and duplicate the directory if making a gufi
  if link or file print it to screen or out file 
      and build an entries table with entries and keep a sum for the directory
  close directory
  write directory summary table
end
close output files if needed
you can end up with an output file per thread



Location of GUFI-tree:

    bfwi will re-create the dir-path of <input_dir> underneath <to_dir>.
    For example, if <input_dir> is /a/b/c and <to_dir> is /q/r/s, the GUFI-tree will
    be created at /q/r/s/a/b/c.

    This would be the path to provide to other commands that take a
    GUFI-tree as input.  We have debated whether the GUFI-tree should just
    be build at /q/r/s/c.  That may come in a future release.  We're open
    to discussion.


...........................................................................

bfmi - walk robinhood mysql db and list tree and/or create output gufi tree

# Usage: bfmi [options] robin_in
# options:
#   -h              help
#   -H              show assigned input values (debugging)
#   -p              print file-names
#   -n <threads>    number of threads
#   -d <delim>      delimiter (one char)  [use 'x' for 0x1E]
#   -x              pull xattrs from source file-sys into GUFI
#   -P              print directories as they are encountered
#   -b              build GUFI index tree
#   -t <to_dir>     dir GUFI-tree should be built
#   -o <out_fname>  output file (one-per-thread, with thread-id suffix)
# 
# robin_in          file containing custom RobinHood parameters
#                      example contents:
#                      /path  - top dir-path (RH doesnt have a name for the root)
#                      0x200004284:0x11:0x0 - fid of the root
#                      20004284110          - inode of root
#                      16877                - mode of root
#                      1500000000           - atime=mtime=ctime of root
#                      localhost            - host of mysql
#                      ggrider              - user of mysql
#                      mypassword           - password for user of mysql
#                      institutes           - name of db of mysql
# 
# future options:
#   -U              create by user summary record
#   -G              create by group summary record
# 


Flow:
open robinhood input file that has how to communicate with mysql and info
 about root directory
root directory is put on a queue
output file(s) are opened one per thread
mysql connections are made, on for each thread
threads are started
loop assigning work (directories) from queue to threads
each thread processes a directory by querying all records with parentid=id 
 for that directory
  if directory put it on the queue and duplicate the directory if making a gufi
  if link or file print it to screen or out file 
      and build an entries table with entries and keep a sum for the directory
  close directory
  write directory summary table
end
close output files if needed
close mysql connections
you can end up with an output file per thread



...........................................................................

bfti - walks breadth first below the input directory path and summarizes
    all directories below it into a tree summary table record by reading
    all the directory summaries in the tree below

# Usage: bfti [options] GUFI_tree
# options:
#   -h              help
#   -H              show assigned input values (debugging)
#   -P              print directories as they are encountered
#   -n <threads>    number of threads
#   -s              generate tree-summary table (in top-level DB)
# 
# GUFI_tree         path to GUFI tree-dir
# 
# future options:
#   -U              create by user summary record
#   -G              create by group summary record
# 


Flow:
input directory is put on a queue
threads are started
loop assigning work (directories) from queue to threads
each thread reads the directory and the  summary table for each the dir 
  if directory put it on the queue 
  accumulate each directory summary into a global summary for all directories below the input directory
  close directory
end
open/create and write tree summary record into treesummary table that summarizes all the directories below it


NOTE: The input <GUFI_tree> should've already been created via 'bfwi'.
    See the note under bfwi, regarding location of created GUFI-trees.


...........................................................................

bfq - breadth first walk of a GUFI tree.  We optionally perform queries
    against the tree-summary, the directory-summary, and/or the directory
    contents tables, per directory encountered (obeys POSIX permissions for
    access to directory info).  You supply your own SQL statements for
    tree, summary, and entries, and select AND/OR logic between
    tree/directory/entry queries.  The traversal can write its output to
    stdout, one file per thread, or one database output per thread. SQL
    init allows you to run an SQL statement per thread associated to the
    output db before starting the walk, and an SQL statement per thread on
    the output db after the walk.


# Usage: bfq [options] GUFI_tree
# options:
#   -h              help
#   -H              show assigned input values (debugging)
#   -T <SQL_tsum>   SQL for tree-summary table
#   -S <SQL_sum>    SQL for summary table
#   -E <SQL_ent>    SQL for entries table
#   -P              print directories as they are encountered
#   -a              AND/OR (SQL query combination)
#   -p              print file-names
#   -n <threads>    number of threads
#   -o <out_fname>  output file (one-per-thread, with thread-id suffix)
#   -d <delim>      delimiter (one char)  [use 'x' for 0x1E]
#   -O <out_DB>     output DB
#   -I <SQL_init>   SQL init
#   -F <SQL_fin>    SQL cleanup
# 
# GUFI_tree         find GUFI index-tree here
# 


Flow:
input directory is put on a queue
output file(s) are opened one per thread if needed
output dbs are opened if needed one per thread
if init SQL provided run once per thread
threads are started
loop assigning work (directories) from queue to threads
each thread lists the directory and queries the gufi tables for that directory 
  if directory put it on the queue 
  if treesql input run query on treesummary table
  and/or applied on whether to continue
  if dirsql input run query on summary table - if printdir - print, if output to db do that
  and/or applied on whether to continue
  if entsql input run query on entries table - if print - print, if output to db to that
  close directory
end
close output files if needed
if fin SQL provided run that per thread
close outputdb
you can end up with an output file per thread and/or a dbfile per thread



...........................................................................


dfw - depth first walk of tree using readdir, or readdir/stat, or
      readdir/xattr or readdir/stat/xattr based on passed-in flags,
      just writes output to stdout

[Not recently maintained.]

# Usage: dfw inputpath statflag(0/1) xattrflag(0/1)


Flow:
single threaded depth first walk, using readdir+, or stat depending on input parms and optionally get xattrs, write output to stdout



=== scripts

Some useful scripts
[Not recently maintained.]


deluidgidsummaryrecs - delete uid and gid summary records in summary tables.

gengidsummaryavoidentriesscan - generate gid summary records in summary
    tables using shortcut if all files in directory are in same group

genuidsummaryavoidentriesscan - generate uid summary records in summary
    tables using shortcut if all files in directory are in same user

generategidsummary - generate gid summary records via entries scan

generateuidsummary - generate uid summary records via entries scan

listschemadb - display schema of a gufi db

listtablesdb - display tables in a gufi db

groupfilespacehogusesummary - generate gid summary records via summary and entries

userfilespacehogusesummary - generiate uid summary records via summary and entries

groupfilespacehog - do parallel query to create output dbs and list group file space hogs

userfilespacehog - do parallel query to create output dbs and list user file space hog

oldbigfiles - do parallel query to craete output dbs and list old/big files



=== SQL info

-- whenever you can, provide SQL statements to programs like bfq and querydb.
   the following functions are not normal SQL functions but can be used

path() - path you are working in

epath() - last dirname in the path you are working in

uidtouser(uid) - converts uid to username - GUFI stores in uid

gidtogroup(gid) - coinverts gid to groupname - GUFI stores gid



-- a useful built-in sqlite function is

datetime(mtime,'unixepoch') 

GUFI stores its date/time in unixepoch form and datetime() will return a
human readable date/time - of course there are many formats of date/time
output that can be used

When providing SQL statements to bfq and querydb you can put more than one
SQL statement in the same string using semicolons at the end of each
statement, however the only SQL statement that will have output displayed
if you have chosen to display output is the last SQL statement in the
string



=== schemas

-- This is the schema for the entries table (one per dir) one record for each file/link

NOTE: We don't yet support incremental additions to a GUFI index tree.
    Rerunning bfwi on top of an existing tree would not remove entries that
    had been deleted, etc.  In order to prevent loss of old data by an
    ill-advised rerun of bfwi, we suppress the table-drop from the
    following table-create SQL.  The result is that rerunning bfwi on top
    of an existing tree will exit with an error.

char *esql = // "DROP TABLE IF EXISTS entries;"
             "CREATE TABLE entries(name TEXT PRIMARY KEY, type TEXT, inode INT64, mode INT64, nlink INT64, uid INT64, gid INT64, size INT64, blksize INT64, blocks INT64, atime INT64, mtime INT64, ctime INT64, linkname TEXT, xattrs TEXT, crtime INT64, ossint1 INT64, ossint2 INT64, ossint3 INT64, ossint4 INT64, osstext1 TEXT, osstext2 TEXT);";


-- This is the schema for the directory/user/group summary records (summary per directory) the ossint* fields are for use by non posix storage systems

NOTE: Tools other than bfwi are reading from entries in DBs in an existing
    GUFI tree and generating summary data.  Unlike the <esql> query above,
    performaing these operations against an existing GUFI tree should
    simply recreate the original summary data.  Therefore, we don't
    suppress the table-drop, or view-drop.

char *ssql = "DROP TABLE IF EXISTS summary;"
             "CREATE TABLE summary(name TEXT PRIMARY KEY, type TEXT, inode INT64, mode INT64, nlink INT64, uid INT64, gid INT64, size INT64, blksize INT64, blocks INT64, atime INT64, mtime INT64, ctime INT64, linkname TEXT, xattrs TEXT, totfiles INT64, totlinks INT64, minuid INT64, maxuid INT64, mingid INT64, maxgid INT64, minsize INT64, maxsize INT64, totltk INT64, totmtk INT64, totltm INT64, totmtm INT64, totmtg INT64, totmtt INT64, totsize INT64, minctime INT64, maxctime INT64, minmtime INT64, maxmtime INT64, minatime INT64, maxatime INT64, minblocks INT64, maxblocks INT64, totxattr INT64,depth INT64, mincrtime INT64, maxcrtime INT64, minossint1 INT64, maxossint1 INT64, totossint1 INT64, minossint2 INT64, maxossint2 INT64, totossint2 INT64, minossint3 INT64, maxossint3 INT64, totossint3 INT64,minossint4 INT64, maxossint4 INT64, totossint4 INT64, rectype INT64, pinode INT64);";

The directory summary table can have several records
1 rectype=0 is the overall summary for the directory
N rectype=1 is the summary per user
M rectype=2 is the summary per group
there are views below that assist with using this multifunction table


-- This is the schema for the tree directory/user/group summary records (summary representing tree below) the ossint* fields are for use by non posix storage systems

char *tsql = "DROP TABLE IF EXISTS treesummary;"
             "CREATE TABLE treesummary(totsubdirs INT64, maxsubdirfiles INT64, maxsubdirlinks INT64, maxsubdirsize INT64, totfiles INT64, totlinks INT64, minuid INT64, maxuid INT64, mingid INT64, maxgid INT64, minsize INT64, maxsize INT64, totltk INT64, totmtk INT64, totltm INT64, totmtm INT64, totmtg INT64, totmtt INT64, totsize INT64, minctime INT64, maxctime INT64, minmtime INT64, maxmtime INT64, minatime INT64, maxatime INT64, minblocks INT64, maxblocks INT64, totxattr INT64,depth INT64, mincrtime INT64, maxcrtime INT64, minossint1 INT64, maxossint1 INT64, totossint1 INT64, minossint2 INT64, maxossint2 INT64, totossint2 INT64, minossint3 INT64, maxossint3 INT64, totossint3 INT64, minossint4 INT64, maxossint4 INT64, totossint4 INT64,rectype INT64, uid INT64, gid INT64);";

The tree summary table can have several records
1 rectype=0 is the overall summary for the directories
N rectype=1 is the summary per user
M rectype=2 is the summary per group
there are views below that assist with using this multifunction table


-- this is a vew that gets parent inode for entries table queries.  pinode is NOT on each entries table record because that would be a rename nightmare

char *vesql = "DROP VIEW IF EXISTS pentries;"
              "create view pentries as select entries.*, summary.inode as pinode from entries, summary where rectype=0;";

-- this is the directory summary table view for overall directory summary
char *vssqldir = "DROP VIEW IF EXISTS vsummarydir;"
                 "create view vsummarydir as select * from summary where rectype=0;";

-- this is the directory summary table view for per user directory summary
char *vssqluser = "DROP VIEW IF EXISTS vsummaryuser;"
                  "create view vsummaryuser as select * from summary where rectype=1;";

-- this is the directory summary table view for per group directory summary
char *vssqlgroup = "DROP VIEW IF EXISTS vsummarygroup;"
                   "create view vsummarygroup as select * from summary where rectype=2;";

-- this is the tree summary table view for overall directory summary
char *vtssqldir = "DROP VIEW IF EXISTS vtsummarydir;"
                  "create view vtsummarydir as select * from treesummary where rectype=0;";

-- this is the tree summary table view for per user directory summary
char *vtssqluser = "DROP VIEW IF EXISTS vtsummaryuser;"
                   "create view vtsummaryuser as select * from treesummary where rectype=1;";

-- this is the tree summary table view for per group directory summary
char *vtssqlgroup = "DROP VIEW IF EXISTS vtsummarygroup;"
                    "create view vtsummarygroup as select * from treesummary where rectype=2;";



--- tools

make_testdirs

# Usage: make_testdirs [ options ] dirname
# options:
#   -d <n_dirs>    number of subdirs
#   -f <n_files>   number of files per subdir
#   -h             help



querydb - run an SQL query against a table in a GUFI db - just point it at
    the directory containing the db

# Usage: querydb [options] [-s] DB_path SQL
# options:
#   -h              help
#   -H              show assigned input values (debugging)
#   -N              print column-names (header) for DB results
#   -V              print column-values (rows) for DB results
# 
# DB_path           path to dir containinng db.db.*
# SQL               arbitrary SQL on DB




querydbn - run an SQL query against an output db from bfq, it can be one
    per thread, so provide prefix of the dbname, the program will creat a
    union of the tables from each of the db files and it will get a view
    name adding a v to the beginning of your table name

# Usage: querydbn [options] [-s] DB_path DB_count SQL tabname
# options:
#   -h              help
#   -H              show assigned input values (debugging)
#   -N              print column-names (header) for DB results
#   -V              print column-values (rows) for DB results
#   -p              print file-names
# 
#   -s              dir-summary (currently-unused internal functionality)
# 
# DB_path           path to dir containinng db.db.*
# DB_count          number of DBs (should match thread-count used in 'bfq')
# SQL               arbitrary SQL on each DB (unified into single view)
# table_name        name of view table = 'v<table_name>'



--- testtools

runbfmi - run bfmi to read from a robinhood mysql db and list and/or create
    a gufi tree - requires an input file on how to talk to mysql


-- The following functional tests all work together, create a gufi index
   from input testdir into testdirdup and run queries/etc.

runbfwi - run bfwi and create gufi testdirdup from input testdir
runbfti - run bfti and create a tree index at the top of testdirdup
runbfq - run various queries on testdirdup/testdir gufi tree including output files and output dbs
runquerydb - do various queries on one of the gufi dbs in testdirdup/testdir  
runquerydbn - do various queries on outdb.* output dbs using the union capability of querydbn

rundfw - do various walks on the testdir tree
runlistschemadb - list the schema in a gufi db
runlisttablesdb - list the tables in a gufi db
rungroupfilespacehogusesummary - generate gid summary records via summary and entries
runuserfilespacehogusesummary - generiate uid summary records via summary and entries
rungroupfilespacehog - do parallel query to create output dbs and list group file space hogs
runuserfilespacehog - do parallel query to create output dbs and list user file space hog
runoldbigfiles - do parallel query to craete output dbs and list old/big files

gufi's People

Contributors

jti-lanl avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.