I don't understand how it works , to auto caption in have to analyze the images first caption them and then train, but I don't think is doing that, may be is the way that organize its weights, and it auto detect that the patterns relates to the weights that overlaps with the concept of man, but really idk