I am trying to fine-tune a Bert model using the well-known Movie Review dataset on M1 Chip.
The ETA for an epoch is estimated at 10 hours to refine all 66M of parameters.
In order to reduce the ETA, I thought to set the first two layers as `trainable=False`, so the trainable parameters now are 2K.
Even if I dropped the trainable parameters, nothing is changed, ETA is still 10h.
Do you think it is normal or there is something wrong on my side?