I'm trying to use observational memory with multi-modal chats, and noticing that with base64-encoded images the way tokens are being counted is wrong, causing observation to kick in immediately. It seems like maybe the text length is being considered the token count?