Correct way to prune a model for faster CPU inference?

I have been following the tensorflow examples on how to set up a model for pruning and quantise it in order to improve inference. What I noticed however was:

1) the sparse model resulted from pruning has no faster inference benefits

2) the quantisation makes the model even slower (I know that this is probably due to TFlite not being optimised for x86).

What is the method you use to prune your models?

submitted by /u/ats678
[visit reddit] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *