Hey all, I trained a “RoastBot” a while ago using a dataset I scraped from /r/RoastMe. The inputs are images of people and the outputs are high rated comments that are “roasts” of the people.
I use Inceptionv3 to preprocess the images into latent vectors, and then I use a recurrent decoder with visual attention to create the sequences. This works good enough to come up with something decent every now and again, but the model just seems like it would do better if it started the training process already knowing about grammar and syntax.
I was thinking I could replace my decoder with a pre-trained BERT model, but BERT and any other transformer models only take text as input, right? I think at least BERT preprocesses the text, I’m not sure how though.
My latent tensors are of shape (8,8,2048), and I imagine that the input text tensors for BERT are (num_tokens, 1). I guess I can flatten my tensor to be of shape (882048, 1), but also I don’t know if BERT will even do a good job going from image data to text…
If I could find a large model for image captioning that would be perfect for fine-tuning, but I don’t think it exists.