Giter Club home page Giter Club logo

intel-analytics / ipex-llm Goto Github PK

View Code? Open in Web Editor NEW
6.0K 243.0 1.2K 226.34 MB

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, etc.

Home Page: https://ipex-llm.readthedocs.io

License: Apache License 2.0

Shell 2.24% Python 97.20% Dockerfile 0.34% PowerShell 0.13% Batchfile 0.09%
pytorch llm transformers gpu

ipex-llm's People

Contributors

cyita avatar dding3 avatar gc-fu avatar hkvision avatar hoshibara avatar hzjane avatar jason-dai avatar jasonzzt avatar jenniew avatar jerryyanwan avatar jinbridger avatar lalalapotter avatar le-zheng avatar leonardozcm avatar liu-shaojun avatar meousker77 avatar oscilloscope98 avatar plusbang avatar psyyz10 avatar qiuxin2012 avatar qiyuangong avatar rnwang04 avatar sgwhat avatar shane-huang avatar theaperdeng avatar uxito-ada avatar weiguanghan avatar yangw1234 avatar zhengjin-wang avatar zhentaocc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ipex-llm's Issues

make-dist.sh error

1, rm -r $DIST_DIR/* (if nothing under this folder)
2, cp $BASEDIR/scripts/bigdlvars.sh $BIN_DIR/ (shell file doesn't exist)

Add a noop CompressedTensor

Implement a a noop CompressedTensor (that can be used in place of FP16CompressedTensor & FP16SplitsCompressedTensor) that does no compression at all.

Make Module:backward() final

We should make Module:backward final, and overload the updategradInput and accGradParameters methods in its subclasses instead.

Support MKL2017 DNN API

Intel MKL release 2017 version and it contains a DNN API, which provide DNN operation optimized for IA architecture. We will add new layers which leverage these new APIs to get a better performance on CPU.

tensor_apply3 apply on discontiguous tensor will throw an ArrayIndexOutOfBoundsException.

Test code:

    val x = Tensor[Float](2, 1).fill(2f)
    val y = Tensor(Storage(Array(1f, 2, 3, 4, 5, 6)), 1, Array(2, 3))
    x.expandAs(y)
    val z = Tensor[Float](2, 3).zero()
    z.cmul(x, y)  //will call apply3

Exception:

java.lang.ArrayIndexOutOfBoundsException: 2
    at scala.runtime.ScalaRunTime$.array_apply(ScalaRunTime.scala:76)
    at com.intel.analytics.sparkdl.tensor.DenseTensorMath$$anon$32.apply(DenseTensorMath.scala:60)
    at com.intel.analytics.sparkdl.tensor.DenseTensorApply$.apply3(DenseTensorApply.scala:177)
    at com.intel.analytics.sparkdl.tensor.DenseTensorMath$.cmul(DenseTensorMath.scala:63)
    at com.intel.analytics.sparkdl.tensor.DenseTensor.cmul(DenseTensor.scala:841)

Save model should not save its buffers

I find that when save googlenet model, the model size is 5.7G, while its parameter size should be only Megabytes level. The root cause is that the buffer of the model is also saved to the file. We should mark these buffer field as transient, so it won't be a part of the model persistent file.

Add checking to give a more precise exception instead of NPE

val outputWidth = (inputWidth + 2 * padW - kernelW) / strideW + 1
val outputHeight = (inputHeight + 2 * padH - kernelH) / strideH + 1

Adding checking for these two fields which might be negative if user incorrectly input the parameters.

Exception in thread "main" java.lang.NullPointerException
at com.intel.analytics.bigdl.tensor.DenseTensor.fill(DenseTensor.scala:226)
at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:123)
at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:30)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)
at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:32)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(LocalOptimizer.scala:117)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:111)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:111)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Support GoogleNet_v2 in Spark-DL

Googlenet_V2 Model reference this paper.

A caffe model is here

We should achieve 80% intel-caffe single node performance, and achieve same top-1(68%) and top-5(88%) error compare to GPU version on local mode and cluster mode.

Code refactory for Engine, Parameter, Optimizer, DataSet, etc.

  1. Engine
    We should implement an Engine object to represent the environment for the training, including:

    • Type of the underlying execution engine: MKL-BLAS vs. MKL-DNN. We can then provide factory functions to automatically create appropriate version of modules (that is, using MKL-DNN or not) based on the specified type; for now, we can throw an error in the factory functions if MKL-DNN is specified but there is no such version of the module.

    • Configurations for distributed training: partition#, worker#, core#, batch size, batchPerPartition, batchPerWorker, batchPerCore, OMP parameters, etc. We can also perform various check on these configurations – e.g., batch size should be a multiple of worker# X core# when using MKL-BLAS

    • Pool of threads for running multiple tasks in a worker: we shouldn’t expose the multi-threading code in the application logic; instead, we can encapsulate the multi-threading code in the Engine object, and its users (such as the Optimizer) can simply call something like

      Engine.parallelInvoke(0 until core#) { i =>
         …
      }
  2. Parameters
    We should rename the ps package to parameters, and implement a Parameter class which represents the shared variables (containing both weights and gradients) for both local and distributed modes (somewhat similar to the Broadcast variable). It should provide the following support:

    • User-specified serializer object (e.g., FP16) and update method

    • getWeights/getGradients methods: these should be non-blocking methods that return a FutureResult; the user can then fetch the value through something like FutureResult.getValue(). In this way, the sync weight operations can be overlapped with other operations as follows:

      val W = parameter.getWeights()
      Engine.parallelInvoke(0 until core#) { i =>
          models(i).zeroGradient()
          ...
          W.getValue(localWeights(i))
          …
      }
    • putWeights/putGradients methods for updating the Parameter

  3. Optimizer and DataSet
    We should implement the Optimizer which only expose DL related concepts to the users; it can be constructed using DataSet, Module, Criterion, OptimMethod, etc.; on top of it, we can implement LocalOptimizer and DistributedOptimizer that accept LocalDataSet and DistributedDataSet respectively. Inside the Optimizer, it can create Parameter objects (LocalParameter or DistribtuedParameter) to manage the local or distributed training respectively.

Give a more general name for the output log

We should not restrict BigDL to only accept Image as input

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: input image smaller than kernel size
at scala.Predef$.require(Predef.scala:233)
at com.intel.analytics.bigdl.nn.SpatialMaxPooling.updateOutput(SpatialMaxPooling.scala:55)
at com.intel.analytics.bigdl.nn.SpatialMaxPooling.updateOutput(SpatialMaxPooling.scala:30)

Support ResNet

We need to support resnet in spark-dl, including single node performance test and tunning. Test convergence on single node and multi-node.

We can start with ResNet50. The model topology should reference facebook torch model and caffe torch model. The only difference between these two models are some convolution stride parameter.

The single node performance goal is to achieve 80% of intel-caffe.

The convergency goal is to achieve the same error rate(top-1 and top-5) on the refrence model.

Need a Flatten layer

Currently we only provide Reshape layer which require user to manually calc the size from the previous layer.

It would be better if we can have a similar layer like Keras:
x = Flatten()(x)

nelements_pre_layer/batch is the size user need to give if we rely on the current Reshape layer.

Refactor ImageClassifier example

  1. change the folder name to ImageClassification
  2. change googlenet to inception
  3. change utils.File to utils.TorchFile
  4. support the case that the input images have no lables
  5. converge the local mode and spark mode implementations

It might be better if we can provide a general batching/shuffling transformer

Batching is quite a common logic, it might be good if user don't need to be bother by that.
We might be able to hide that from user or just give a general Batch transformer.

Ideally What the user need to provide is just a simple Iterator[Sample], we take care of the batching shuffling internally(or this kind of logic can be a out-of-box component which can be picked up easily).
i.e
Iterator[Sample] -----batching---shuffling--> training/validation

Related code:
https://github.com/intel-analytics/BigDL/blob/0b095036ef3e3b45913d0209ee617e7836e40974/dl/src/main/scala/com/intel/analytics/bigdl/dataset/image/RGBImgToBatch.scala#L33

In SpatialConvolution, weight address changed, but weightMM not

In SpatialConvolution, weightMM is the view of weight and won't be updated until it is empty.

When calling getParameter fuc, Module.flatten will change the memory address of weight .

So if we call forward/backward and get weight by getParameter, weight's change won't affect nonempty weightMM.

There is a need to change weightMM updated rule.

Enrich the exception info for the require statement

There are lots of require statement here but without enough info for debugging.
i.e require(1 == this.nDimension, "invalid size")
A better version:
require(1 == this.nDimension, s"nDimension size: ${this.nDimension} should be 1")

Unify data load interface of different dataset

We support use different dataset(imagenet, cifar10 and mnist) to train and test the model. But the code is hard to maintain. We should do some code refactor.

The target is unify the dataload, transform function of different dataset in local or spark cluster mode.

Refactor SparkML example

  1. Instead of use DataSet in the driver side, we should use RDD or DataFrame only in driver
  2. In DLClassifier.process, we may transform the Iterator[Row] to LocalDataSet in mapPartition
  3. In DLClassifier.process, we cannot share models between different partitions; we need to clone a new version of localModel in mapPartition
  4. Move spark.ml.DLClassifier to the utils folder

Performace of Concat layer

With the test of GoogLeNet v1 on sparkdl, there may be a performance issue on the Concat layer.

The time distribution of an iteration is,

  • total, 1471.642277ms
  • forward, 333.476ms
  • backward, 475.903ms

The other time (not forward and backward) is 662.351. That almost is the costs of Concat layer.

In the WebscaleML, I implement the concat copy with some repeated and dirty code, which is one special case for apply2 for DenseTensorApply.
When the tesor1stride and tensor2stride are all 1, it will use System.arrayCopy instead continuous one byte copy.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.