I have a somewhat unique issue that I cannot solve because nothing on Google comes up.
My data are one-hot encoded DNA sequences of VARYING length. This is easily stored in a jagged NumPy array (4 x n x m), where n = number of samples, m = length of sequence (may vary.) However, the size requirements after zero-padding the entire array (padded by max sequence length) is insane and I need to avoid doing that.
The solution I have thought up is as follows:
- Generated jagged numpy array (varying input lengths)
- Extract k sequences from this large array where k = batch size
- Zero-pad the batch
- Pass to model
- Repeat from step 2
Any help would be greatly appreciated. Thanks!