I see multiple options on the internet to optimize inference, and i don’t know which would be the best fit for me. My goal is to maximize throughput on GPU, and preferably reduce GPU memory usage.
I have a reinforcement learning project, where i have multiple cpu processes generating input data in batches and sending them over to a single GPU for inference. Each process loads the same resnet model with two different weight configurations at a time. The weights used get updated about every 30 minutes and get distributed between the processes. I use Python and Tensorflow 2.7 on Windows(don’t judge) and the only optimization is use right now is the built-in XLA optimizations. My GPU does not support FP-16.
I have seen TensorRT being suggested to optimize inference, i have also seen TensorflowLite, Intel has an optimization tool too, and then there is Tensorflow Serve. What option do you think would fit my needs best?