Giter Club home page Giter Club logo

Comments (17)

vbezgachev avatar vbezgachev commented on August 20, 2024

Hello @R-Miner ,
sorry for a delayed reply. That worked for me too. First I read the files and loaded images into the list:

    path = 'performance'
    filenames = [(path + '/' + f) for f in listdir(path) if isfile(join(path, f))]
    files = []
    imagedata = []
    for filename in filenames:
        f = open(filename, 'rb')
        files.append(f)

        data = f.read()
        imagedata.append(data)

and then called the prediction:

    print('In batch mode')
    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'inception'
    request.model_spec.signature_name = 'predict_images'

    request.inputs['images'].CopyFrom(
        tf.contrib.util.make_tensor_proto(imagedata, shape=[len(imagedata)]))
    result = stub.Predict(request, 10.0)  # 10 secs timeout

What kind of error did you get?

from tf_serving_example.

R-Miner avatar R-Miner commented on August 20, 2024

from tf_serving_example.

vbezgachev avatar vbezgachev commented on August 20, 2024

Could you give me a hint, how did you export the model?

from tf_serving_example.

R-Miner avatar R-Miner commented on August 20, 2024

github.com/tensorflow/serving/issues/878#issuecomment-389160104

The above link takes you to an issue I opened when I encountered issues in converting string input to float32.

That is how I do the export.... I dont use the map_fn. Still not getting right preductions though. I am converting string input to float because my model input should be a float32 type.

from tf_serving_example.

vbezgachev avatar vbezgachev commented on August 20, 2024

I suspect you need to create input placeholder and parse the input as described here: https://github.com/tensorflow/serving/blob/7c7fc37878265bda84a857aa45798a16c2617c35/tensorflow_serving/example/inception_saved_model.py#L69-L76.
To my understanding, you need to call tf.parse_example() to work this properly.

from tf_serving_example.

R-Miner avatar R-Miner commented on August 20, 2024

from tf_serving_example.

vbezgachev avatar vbezgachev commented on August 20, 2024

I have also updated the export of my own model and the client to call a server in a batch mode. If it helps you, please take a look: 8407dd1

Regarding your model chain - why do you want to do that? I needed something like this when I wanted to use pre-trained DenseNet model with my own classifier on top of it. Is your case similar to this?

from tf_serving_example.

pharrellyhy avatar pharrellyhy commented on August 20, 2024

Hi @Vetal1977

What if we have 100 different users and each of them sends one request at same time? So in this case, we have to wait until all the requests arrived and batch them?

Could you give some advices that our model can handle more requests simultaneously? In my case, including preparing the data, converting to tensor_proto and inference, processing 500 requests will take about 10s.

BTW, do you know how to create tf serving warmup data file? When launching the tf serving, No warmup data file found at /tensorflow-serving/finger-detection-serving/a4-versions/3/assets.extra/tf_serving_warmup_requests prints on screen, but I can't find any resources to create it. Thanks!

from tf_serving_example.

vbezgachev avatar vbezgachev commented on August 20, 2024

Hi @pharrellyhy

The GPU is a bottleneck. Though HTTP or gRPC server can serve simultaneous requests from multiple users, you run the inference on a single GPU.
Batching is definitely a method to improve performance since you're sticking images together and send them to GPU for prediction at once. In your case, I would prepare batches (say 16 or 32 images) on the client side and send them to TensorFlow server.
I don't know, what is a warmup data file. Where does it come from?

from tf_serving_example.

pharrellyhy avatar pharrellyhy commented on August 20, 2024

Hi @Vetal1977
Thanks. Tensorflow serving will check this 'warmup' file and you can find the output from command line once it started. If I understand correct, it will do the warmup step at the very beginning so our first request will take the same time as the following requests.

Let's stick to single GPU for now. Since the GPU has it own computation power, so running one model for inference will have the same performance compared to multiple models. Am I right?

Another question is can we run multiple tf serving in different processes? I'm trying to use Gunicorn to do that but failed. Do you have any thoughts on this? Thanks!

from tf_serving_example.

pharrellyhy avatar pharrellyhy commented on August 20, 2024

Hi, @Vetal1977
I'm running load testing and I found the GPU utilization is only around 10%. Do you know how to increase the utilization? Thanks!

from tf_serving_example.

vbezgachev avatar vbezgachev commented on August 20, 2024

Hi @pharrellyhy

I think this answer can clarify the usage of a single GPU for a simultaneous run of multiple models. Shortly, it makes not much sense because you have to split the GPU memory between processes, which in its turn slows down training and execution.
You can run tf_serving in different processes - I did that with uswgi. Here you can find a Dockerfile and here - parameters for uswgi.

GPU utilization - how do you measure it? I created a sample application (Angular app + Node.js API + TensorFlow Serving). If I send the images one-by-one without waiting for results, rather relying on a callback then I get a GPU utilization at around 85-90%.

Regarding warm-up data. You are right it is a step at the beginning to prepare a server (otherwise the first request takes much longer than the following requests). It is an asset that should be saved along with the model. I didn't find in the documentation how do I specify assets.extra and which format does it have. The most close answer, I found, is here - see assets_collection

from tf_serving_example.

pharrellyhy avatar pharrellyhy commented on August 20, 2024

Hi, @Vetal1977
Thanks for your kindly reply as always.

I'm using Gunicorn as wsgi and it looks similar to the uwsgi. I also checked your repository and then I got lost. I know uwsgi can create multiple processes but how does it run /serving/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=gan --model_base_path=/serving/gan-export &> gan_log & to create tf serving multiple times? I can't see clearly how this command being executed. Not quite familiar with Docker :( (I'm going to check the docker compose doc.)

Currently I'm using Locust to do load testing. So when I running the load testing at ~200 RPS, the GPU utilization is around ~10% (I'm using nvidia-smi to measure it).

from tf_serving_example.

vbezgachev avatar vbezgachev commented on August 20, 2024

Hi @pharrellyhy

Sorry, my bad, I didn't understand you - I'm using uSWGI on top of a Flask application.

I just made a simple experiment - I ran 2 Docker containers with TensorFlow Serving and started servers in parallel, so I had 2 processes that are able to get the requests. The problem was that one of it always failed since they were competing for a GPU. I do not see any sense in multiple processes of TensorFlow Serving on a machine with a single GPU. Furthermore, I suppose, TensorFlow uses it in exclusive mode for its computational graphs. If we would have 2 GPUs then we could start 2 TensorFlow servers where each of them uses a dedicated GPU. Then we could put Nginx in front of those and load balancing the traffic to the servers; see here.

I used my own test client and just added a loop for issuing the requests for a longer period. I got the utilization over 75% in one-by-one mode and over 85% in batch mode. What do you do in the Locust task? Do you just issue a PredictRequest or a bit more (load image file, for instance)? And you should keep in mind that nvidia-smi has a monitoring cycle. That means it summarizes a statistic for a short period of time (say 500-1000 ms) and if you have an utilization peak for 100 ms, you don't see it clearly in the statistic.

from tf_serving_example.

pharrellyhy avatar pharrellyhy commented on August 20, 2024

Hi @Vetal1977
Sorry if I didn't make it clear.

So, what you are saying is that if we want to serve 2 tf servers, we have to run 2 docker containers and both of them run their own tf server and listen to the same port, right? For now, I tried to run, say, 2 tf servers in the same container. I'm not quite sure if it is working or not since I can't make tf serving to log more useful information. How do you setup the logging for tf serving?

from tf_serving_example.

vbezgachev avatar vbezgachev commented on August 20, 2024

Hi @pharrellyhy

Not exactly - you start 2 Docker containers but they publish different ports, say -p 9000:9000 and -p 9001:9000. In both containers, you start TensorFlow Serving in the same way tensorflow_model_server --port=9000.
You can start 2 instances of TensorFlow Serving in the same container successfully. But when you issue a request from the client application one of them crashes. I had such errors as: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_blas.cc:459] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED and Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR.
If you want to log more you can do that by setting the environment variable in running Docker container export TF_CPP_MIN_VLOG_LEVEL=3 as described here

from tf_serving_example.

pharrellyhy avatar pharrellyhy commented on August 20, 2024

Thanks @Vetal1977 . It's really helpful. I'm going to give it a try and let's see what I get. Thanks again!

from tf_serving_example.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.