Hi Everyone, so I kept reading online that RNNs cannot be trained in parallel because of their inherent sequential nature, but today finally common sense kicked in and I began to wonder why
So if we consider the case of data parallelism, I can see that any map reduce function can easily aggregate the overall gradients and average them, which is what would’ve happened regardless even if it was trained sequentially.
In the case of model parallelism as well, it makes sense for the gradients to flow along each part of the model as long as the RNNs remain stateless
Are my assertions incorrect? If yes, can anyone please share resources for this?