liuliu / s4nnc Goto Github PK

View Code? Open in Web Editor NEW

66.0 7.0 9.0 3.36 MB

Swift for NNC

Home Page: https://libnnc.org

License: BSD 3-Clause "New" or "Revised" License

Starlark 0.82% Swift 98.09% Shell 0.48% Python 0.54% C 0.02% Objective-C 0.05%

s4nnc's Introduction

Swift for NNC

s4nnc is a Swift interface for libnnc library.

From the very start, libnnc is meant to be a common runtime that supports many language bindings. It becomes apparent during the development that for deep learning, a raw C interface would be unwieldy complex to use in real-life. For example, the training loop of a transformer model for sentiment analysis takes more than 400 lines of code: https://github.com/liuliu/ccv/blob/unstable/bin/nnc/imdb.c#L268

A high-level language that delegates most of the work to the C runtime seems to be a win in brevity and usefulness. The same training loop of a transformer model in Swift takes less than 100 lines: https://github.com/liuliu/s4nnc/blob/main/examples/imdb/main.swift#L197

Because the heavy-lifting is done in the libnnc library, the Swift portion can be light and even automatically generated. At the moment, we have about 3,000 lines of Swift code to run quite a few models on GPU, complete with data feeders, model specifications and optimizers.

Live Documentation

https://liuliu.github.io/s4nnc/documentation/nnc

Runtime API

Currently, s4nnc works better under Linux with CUDA 11, CuDNN and NCCL. The API for s4nnc wraps around Level-4 and Level-5 C APIs.

Tensor

public struct Tensor<Element> {
  init(_ kind: DeviceKind, _ shapeFormat: TensorShapeFormat)
  init<S: Sequence>(_ sequence: S, _ kind: DeviceKind, _ shapeFormat: TensorShapeFormat) where S.Element == Element
}

This method initialize a raw tensor that resides either on CPU or GPU with a given dimensions. Alternatively, you can initialize a tensor from native Swift array. Basic usage looks like this:

var tensor = Tensor<Float>(.CPU, .HWC(1, 1, 2))
tensor[0, 0, 0] = 1
tensor[0, 0, 1] = 2

There are very limited functionalities associated with raw tensors. Mostly, you can only reshaped or toGPU / toCPU.

DynamicGraph

DynamicGraph is where you associate most computations with tensors. The DynamicGraph operates on tensor variables / constants, not the raw tensors. Initializing a tensor variable / constant is very similar to initializing a raw tensor:

let graph = DynamicGraph()
let variable: DynamicGraph.Tensor<Float> = graph.variable(.CPU, .HWC(1, 1, 2))

A tensor variable can participate computations, for example:

let x: DynamicGraph.Tensor<Float> = graph.variable(.CPU, .C(1))
let y: DynamicGraph.Tensor<Float> = graph.variable(.CPU, .C(1))
x[0] = 2
y[0] = -1
let z = x .* y
print(z[0])

Because these are tensor variables, you can also do automatic differentiation:

x.requiresGrad = true
z.backward(to: [x])
print(DynamicGraph.Tensor<Float>(x.grad!)[0])

requiresGrad in above code merely denotes we need to populate the grad property of x. It doesn't carry other significance unlike in PyTorch.

Tensor variables memory management is automatic. If there is no reference to it (as defined by no automatic differentiation requires the given tensor variable's participation), the memory will be freed. Hence, unlike PyTorch, you don't need to worry about no_grad annotation most of the time.

Model and Optimizer

Computations on DynamicGraph with tensor variables are stateless. s4nnc also provided stateful Model that contains trainable parameters. You can use Model to construct complex computation unit and train them.

func TwoLayerLinearModel() {
  let x = Input()
  let y = Dense(count: 2)(x)
  let z = Dense(count: 1)(y)
  return Model([x], [z])
}

let twoLayerModel = TwoLayerLinearModel()
let z = twoLayerModel(inputs: x)
print(z)

You can train the model with optimizers.

let sgd = SGDOptimizer(graph, nesterov: false, rate: 0.0001, scale: 1, decay: 0, momentum: 0.9, dampening: 0)
sgd.parameters = [twoLayerModel.parameters]
for _ 0..<100 {
  let x: DynamicGraph.Tensor<Float> = graph.variable(.CPU, .C(1))
  x[0] = 1
  let target: DynamicGraph.Tensor<Float> = graph.variable(.CPU, .C(1))
  target[0] = 0
  let z = twoLayerModel(inputs: x)
  let binaryLoss = SigmoidBinaryCrossEntropyLoss()
  let loss = binaryLoss(z, target: target)
  loss[0].backward(to: [x])
  sgd.step()
}

Because Model can express complex computations statically, it is recommended to have most of your computations expressed as Model.

ModelBuilder

Sometimes, your Model can change its shape based on the inputs. ModelBuilder can take the input, and generate appropriate model. However, these models need to match on parameters. For example, if you have different length of text input to your transformer model, ModelBuilder can be helpful.

DataFrame

DataFrame provides an easy way to construct data feeder into your computation. The data feeder is memory and computation efficient, meaning for each column you try to pull, only that column will be materialized. Hence, if you loop through a list of file names and declare a column to be the loaded images, only one image loaded at a time when you loop through the DataFrame.

let df = DataFrame(from: [filename1, filename2, filename3])
df["image"] = df["0"]!.toLoadImage()
for tensor in df["image", Tensor<UInt8>.self] {
  print(tensor)
}

We only load one image at a time, and the previous image is freed as soon as the next image pulled in.

The DataFrame object also provided basic functionalities to load from a CSV file. The CSV reader is considered to be fastest multi-core reader at the moment.

StreamContext

Unlike PyTorch, s4nnc doesn't associate with implicit asynchronous stream when execute on GPU. To leverage asynchronous stream to improve computation efficiency, you can associate a StreamContext explicitly.

let computeStream = StreamContext(.GPU(0))
var z: DynamicGraph.Tensor<Float>? = nil
graph.withStream(computeStream) {
  let x: DynamicGraph.Tensor<Float> = graph.variable(.CPU, .C(1))
  let y: DynamicGraph.Tensor<Float> = graph.variable(.CPU, .C(1))
  x[0] = 2
  y[0] = -1
  z = x .+ y
}
computeStream.joined()
print(z[0]) // Result only available after joined the computeStream

Storing Models and Tensors

A simple SQLite based data storage is provided from s4nnc. It is a key-value based storage for tensors, tensor variables and models. You can:

graph.openStore("filePath") { store in
  let aTensor = store.read("a")
  store.write("b", variable: z)
  store.write("2layer", model: twoLayerModel)
}

Group

Multiple tensor variables can be grouped together for computations.

let xGroup = DynamicGraph.Group(x0, x1)
let yGroup = DynamicGraph.Group(y0, y1)
let zGroup = xGroup .* yGroup

This is useful because if tensor variables are on different GPUs, this can compute simultaneously. With Model and Optimizer, it is a transparent way to apply data parallelism to speed up your training loop.

Example

Below are the training loop to train an sentiment analysis transformer model with s4nnc. It trains the model with multiple GPUs. You can find comparable PyTorch code from Transformers from Scratch. You can find the rest of the code in https://github.com/liuliu/s4nnc/blob/main/examples/imdb/main.swift.

Setup the Data Feeder Pipeline

var trainData = dataFromDisk(filePath: trainListFile)
// Extract tensors from ImdbText struct.
trainData["tensor"] = trainData["main", ImdbText.self].map(\.tensor)
trainData["mask"] = trainData["main", ImdbText.self].map(\.mask)
trainData["c"] = trainData["main", ImdbText.self].map(\.c)
// Create one hot tensor out of the scalar.
trainData["oneHot"] = trainData["c", Int.self].toOneHot(Float32.self, count: 2)

let deviceCount = DeviceKind.GPUs.count

// Batching tensors together. 
var batchedTrainData = trainData["tensor", "mask", "oneHot"].combine(size: batchSize, repeating: deviceCount)
for i in 0..<deviceCount {
  batchedTrainData["truncTensor_\(i)"] = batchedTrainData["tensor_\(i)"]!.toTruncate(batchedTrainData["mask_\(i)"]!)
  batchedTrainData["squaredMask_\(i)"] = batchedTrainData["mask_\(i)"]!.toOneSquared(maxLength: maxLength)
  // Move the tensors from CPU to GPU.
  let toGPUTrain = batchedTrainData["truncTensor_\(i)", "oneHot_\(i)", "squaredMask_\(i)"].toGPU(i)
  batchedTrainData["tensorGPU_\(i)"] = toGPUTrain["truncTensor_\(i)"]
  batchedTrainData["oneHotGPU_\(i)"] = toGPUTrain["oneHot_\(i)"]
  batchedTrainData["squaredMaskGPU_\(i)"] = toGPUTrain["squaredMask_\(i)"]
}

The Training Loop

let graph = DynamicGraph()

let vocabVec: DynamicGraph.Group<DynamicGraph.Tensor<Float32>> = DynamicGraph.Group((0..<deviceCount).map { graph.variable(.GPU($0), .NC(vocabSize, embeddingSize)) })
let seqVec: DynamicGraph.Group<DynamicGraph.Tensor<Float32>> = DynamicGraph.Group((0..<deviceCount).map { graph.variable(.GPU($0), .NC(maxLength, embeddingSize)) })
vocabVec.rand(-1...1)
seqVec.rand(-1...1)
var adamOptimizer = AdamOptimizer(graph, rate: 0.0001, betas: (0.9, 0.98), decay: 0, epsilon: 1e-9)
adamOptimizer.parameters = [vocabVec, seqVec, transformer.parameters]
var overallAccuracy = 0.0
for epoch in 0..<10 {
  batchedTrainData.shuffle()
  var columns = [String]()
  for i in 0..<deviceCount {
    columns += ["tensorGPU_\(i)", "oneHotGPU_\(i)", "squaredMaskGPU_\(i)"]
  }
  let computeStream = StreamContext(.GPU(0))
  for (i, batch) in batchedTrainData[columns].enumerated() {
    adamOptimizer.rate = 0.0001 * min(Float(adamOptimizer.step - 1) / (10000.0 / Float(batchSize)), 1) * Float(deviceCount)
    let tensorGPU = (0..<deviceCount).map { batch[$0 * 3] as! Tensor<Int32> }
    let oneHotGPU = (0..<deviceCount).map { batch[$0 * 3 + 1] as! Tensor<Float32> }
    let squaredMaskGPU = (0..<deviceCount).map { batch[$0 * 3 + 2] as! Tensor<Int32> }
    let batchLength = tensorGPU[0].dimensions[1]
    let output = graph.withStream(computeStream) { () -> DynamicGraph.Group<DynamicGraph.AnyTensor> in
      let wordIndices = graph.variable(tensorGPU.reshaped(.C(batchSize * batchLength)))
      let wordVec = Functional.indexSelect(input: vocabVec, index: wordIndices)
      var seqIndicesCPU = Tensor<Int32>(.CPU, .C(batchSize * batchLength))
      for i in 0..<batchSize {
        for j in 0..<batchLength {
          seqIndicesCPU[i * batchLength + j] = Int32(j)
        }
      }
      let seqIndicesGPU = (0..<deviceCount).map { seqIndicesCPU.toGPU($0) }
      let seqIndices = graph.constant(seqIndicesGPU)
      let posVec = Functional.indexSelect(input: seqVec, index: seqIndices)
      let selectVec = wordVec + posVec
      let inputVec = selectVec.reshaped(.CHW(batchSize, batchLength, embeddingSize))
      let masked = graph.constant(squaredMaskGPU.reshaped(.CHW(batchSize, batchLength, batchLength)))
      let output = transformer(inputs: inputVec, masked)[0]
      let softmaxLoss = SoftmaxCrossEntropyLoss()
      let target = graph.variable(oneHotGPU)
      let loss = softmaxLoss(output, target: target)
      loss.backward(to: [vocabVec, seqVec])
      adamOptimizer.step()
      return output
    }
    computeStream.joined()
    var correct = 0
    for k in 0..<deviceCount {
      let oneHot = oneHotGPU[k].toCPU()
      let output = DynamicGraph.Tensor<Float32>(output[k]).toCPU()
      for i in 0..<batchSize {
        let truth = oneHot[i, 1] > oneHot[i, 0]
        let prediction = output[i, 1] > output[i, 0]
        if truth == prediction {
          correct += 1
        }
      }
    }
    let accuracy = Double(correct) / Double(batchSize * deviceCount)
    overallAccuracy = overallAccuracy * 0.9 + accuracy * 0.1
    if adamOptimizer.step % 50  == 0 {
      print("epoch \(epoch) (\(i)/\(batchedTrainData.count)), training accuracy \(overallAccuracy)")
    }
  }
}

s4nnc's People

Contributors

Stargazers

Watchers

Forkers

hggz augustrush brappier philipturner michaeleisel mkll mseriukov molotov-y jksmx

s4nnc's Issues

Does ScaledDotProductAttention support backward pass?

How to assign values to a tensor?

Say ten is Tensor<Float16> . how do i assign some values to it?
I tried

ten.storage[0,0,0,0] = 0.0

I get

a.swift:118:35: error: cannot convert value of type 'Int' to expected argument type 'Element.Type'
                  ten.storage[0,0,0,0] = 0.0
                                  ^
a.swift:118:42: error: cannot assign value of type 'Double' to subscript of type 'AnyTensorStorage'
                  ten.storage[0,0,0,0] = 0.0

Quantizing q8p

I was quantizing weights using :

graph.openStore(
  full_f16_path, flags: .truncateWhenClose
) { store in
  let keys = store.keys
  graph.openStore(
    f8_path,
    flags: .truncateWhenClose
  ) {
    for key in keys {
      guard let tensor = store.read(key) else { continue }

      print("quantizing  \(key) \(tensor)")

      $0.write(key, tensor: tensor, codec: [.q8p ])
    }
  }
}

but it looks like some params with less number of elements are not being quantized. Like layers with 320 params.
did you add a check or something? where is it ?

How to use it in a project without Bazel

Hi,

I want to use s4nnc in a project but i dont want to use Blaze, is there any way if i am building my project uisng swiftc command.
I wanted to build a .dylib file using swiftc -emit-library but the bazel rules do not support that.

If i cant use swiftc, is there any way that i can use XCode to build my project which uses s4nnc?

ANE support?

s2nnc seems a bit slower compared to coreml. any plans to support ANE?

An easy interface to add custom accelerators / backends?

There should be some easy framework, so that i can easily add my ops for custom accelerator / framework.
I wanted to see if i can add easily port it to a proprietary chip aiming to outcompete M1 GPU.

I really like TFs framework for adding custom backends but its too big.

Loading the model with quantized weights , two times corrupts the model

to reproduce

call the load weights function two times and run the model . you get NaNs.
Does not happen with normal fp16/32 weights


graph.openStore(sdxl_model_path) {
    $0.read("unet", model: unet , codec: [.q6p, .q8p, .jit, .ezm7] )
  }

graph.openStore(sdxl_model_path) {
    $0.read("unet", model: unet , codec: [.q6p, .q8p, .jit, .ezm7] )
  }

How to see the parameters of a model

If I have loaded in a model using DynamicGraph.read(), how can I print the weights of the model for debugging purposes?

For example, I loaded a textModel and when I print

print(textModel.parameters)

I get NNC.Model.Parameters

But is there a way to see the actual tensor values for each layer in the model? or something similar to PyTorch where I can get a dictionary of key: values for each parameter in the model?

Thank you!

How to copy data to a variable from a tensor?

i tried

graph!.variable(Tensor<Float16>(ten))[0..<320, 0..<320, 0..<3, 0..<3] =  graph!.variable(Tensor<Float16>(cached_weights[name]))[0..<320, 0..<320, 0..<3, 0..<3]

both ten and cached_weights[name] are any tensors.

I get this error:
examples/dylib_interface.swift:114:56: error: cannot assign through subscript: function call returns immutable value

How to load weights to a model from a dictionary in memory

I have a dictionary mapping from key -> Data . How can I load weights to the model from the dictionary?
We can assume that we have keys of the same name as sqllite datastore. and we also have the dims of each weight.

is there any snippet where, i can set the i'th layer / paramater of a Model given some data in memory?

Meta Flash Attention min OS version?

What is the minimum OS version for Metal Flash Attention?
It does not seem to work with MacOS 12 or 11

How to combine models and save weights of the single combined model

Say i have

enc = Enc()
enc.compile()
enc.load_weights()

dec= Enc()
dec.compile()
dec.load_weights()

inp = Input()
x = enc(inp)
out = dec(x)

auto_enc = Model(inp , out )

auto_enc.save_weights()

is there any way to do this in NNC?

Scaled dot product attention backward does not work with MacOS

snippet:

  let keys = tokeys(c).reshaped(.NHWC(b, t, h, k)).identity()
  var queries = (  toqueries(x)).reshaped(.NHWC(b, hw, h, k)).identity().identity()
  var values = tovalues(c).reshaped(.NHWC(b, t, h, k))
  
  let scaledDotProductAttention = ScaledDotProductAttention(
    scale: 1.0 / Float(k).squareRoot(), multiHeadOutputProjectionFused: true)
  var out = scaledDotProductAttention(queries, keys, values).reshaped([b, hw, h * k])

error when MFA is enabled :

loc("mps_matmul_1"("(mpsFileLoc): ../Files/MPSGraphUtilities.mm":39:0)): error: incompatible dimensions
loc("mps_matmul_1"("(mpsFileLoc): ..Files/MPSGraphUtilities.mm":39:0)): error: invalid shape

error when mfa is disabled

Assertion failed: (isStaticMPSType(type)), function setStaticJITypeForValue, file MPSRuntime_Project.h, line 501.

fwd works fine but in backward i get error

How to finetune a model with some weights fronzen and some weights having different learning rate?

Is there any example?

Critical Error!! , setting a variable does not work

import Foundation
import NNC
import PNG

let graph = DynamicGraph()
let  xIn_temp = graph.variable(.GPU(0), .NC(1,1), of: Float16.self )
xIn_temp.full(1) 
print( xIn_temp[0,0] )

i get all sorts of values when i print this. whaaat?
running with MPS

How to multipy a large tensor with smaller tensor

I have

a = Input()
b = Input()

c = a * b

After building the model, i pass
a -> NCHW((2,3,4,5)
b -> .N(1)

i get a shape mismatch error

Basically i want to build a model which takes a scaler and tensor as input and multiplies them and return a tensor.
How to do that?

SPM support

It would be nice to have s4nnc available through SPM. How much work do you think it would take?

Unable to include s4nnc inside bazel project

When adding the dependency to s4nnc in my bazel project for iOS, my compilation errors with

ld: Undefined symbols:
  _cblas_dgemm, referenced from:
      _ccv_gemm in libccv_algebra.a[2](ccv_algebra.o)
  _cblas_sgemm, referenced from:
      __ccv_nnc_gbmm_and_bias in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gbmm_and_bias in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gbmm in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      __ccv_nnc_gemm_back_cpu_sys in libcmd.a[10](_ccv_nnc_gemm_cpu_sys.o)
      ...

I am running on a Mac M2, and my WORKSPACE includes the following

git_repository(
    name = "s4nnc",
    commit = "f4a6c3255163a2acc9b75e8e850fd8e346667bae",
    remote = "https://github.com/liuliu/s4nnc.git"
)
load("@s4nnc//:deps.bzl", "s4nnc_deps")
s4nnc_deps()

My BUILD file is as follows:

load("@build_bazel_rules_apple//apple:ios.bzl", "ios_application")
load("@build_bazel_rules_swift//swift:swift.bzl", "swift_library")


swift_library(
    name = "diffusion",
    srcs = glob(["src/*.swift"]),
    module_name = "Diffusion",
    visibility = ["//visibility:public"],
    deps = [
        "@SwiftNumerics//:Numerics",
        "@s4nnc//nnc",
    ],
)

ios_application(
    name = "iOSApp",
    bundle_id = "build.bazel.rules-apple-example",
    families = [
        "iphone",
        "ipad",
    ],
    infoplists = ["Resources/Info.plist"],
    minimum_os_version = "17.0",
    visibility = ["//visibility:public"],
    deps = [":diffusion"],
)

any tips for debugging this would be greatly appreciated, thank you!

Feature request

python bindings for s4nnc.
Looks like it wont be hard to implement.
a python wrapper on top of swift wrapper seems like a good idea.

How to initialize a tensor from a data pointer?

I have an Array ( or an unsafe pointer ) containing data. How do i set the data of a Tensor from that memory location. Putting a for loop is slow. Any faster way?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.