Manipulating batches prior to sending them to the model

I have a somewhat unique issue that I cannot solve because nothing on Google comes up.

My data are one-hot encoded DNA sequences of VARYING length. This is easily stored in a jagged NumPy array (4 x n x m), where n = number of samples, m = length of sequence (may vary.) However, the size requirements after zero-padding the entire array (padded by max sequence length) is insane and I need to avoid doing that.

The solution I have thought up is as follows:

Generated jagged numpy array (varying input lengths)
Extract k sequences from this large array where k = batch size
Zero-pad the batch
Pass to model
Repeat from step 2

Any help would be greatly appreciated. Thanks!

submitted by /u/RAiD78
[visit reddit] [comments]

Leave a Reply Cancel reply