INSERT Statement Truncation?

What: When making larger text column insertions into the DB (using Kysely, the neon serverless driver, and kysely-neon dialect) I get errors that seem to be the result of my SQL statements being truncated. Why: I am storing document embeddings for an LLM application. How: I am using the Langchain Document model in conjunction with the Unstructured.io to clean and partition my document texts into a number of rows representing the full contents of a given document (PDF, docx, etc). The inserted rows consist of strings of text (possibly in the form of HTML tables), vector embeddings, and optional metadata. I am inserting these rows within a transaction using Kysely and the other technologies mentioned earlier. Notably, when I chunk my insertions into multiple calls to the db, this problem no longer occurs, yet the limit on the number of rows I can insert without error seems relatively low (~10) all things considered. The text I'm inserting is not overly long (no more than 1000 characters) and the total size of the row is roughly equivalent to the following JSON (code samples and logs to follow):

6 Replies

rival-blackOP•2y ago

{"pageContent": "Line Assets: Amount 1 Checking and Savings Accounts $ 2 Stocks and Bonds $ 3 Accounts and Notes Receivable $ 4 Liquid Assets (Total Lines 1 through 3) $250,000 5 401K IRA and Pension Funds $ 6 Equity in Businesses Owned $ 7 Value of Primary Residence $1.7m 8 Value of Other Real Estate $1.1m 9 Other Assets $ 10 Total Assets (Total Lines 4 through 10) $2,900,000 est Liabilities and Net Worth Amount 11 Mortgage Balance on Primary Residence $486k 12 Mortgage Balance on Other Real Estate $270k 13 Other Liabilities $ 14 Total Liabilities (Total Lines 11 through 13) $800k 15 Net Worth (Line 10 less Line 14 $2,000,000 est Sources of Income (Annual) 16 Salary, Bonus, Commission – Self $currently not working 17 Salary, Bonus, Commissions – Spouse $250,000 est + 18 Income from Business Owned $ 19 Other Income $ 20 Total Income (Total Lines 16 through 19) $250,000",
"metadata": {
  "type": "NLP.FileContentDocument",
  "data": {
    "type": "Table",
    "fileUploadId": "456",
    "documentIndex": { "index": 28, "total": 32 },
    "fileVersionId": "abc",
    "filename": "foo-bar.docx",
    "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "textAsHtml": "...roughly 1000 characters of html (removed due to posting limits)",
    "languages": ["eng"],
    "pageNumber": 1,
    "orgId": "123"
   }
 }
}

{"pageContent": "Line Assets: Amount 1 Checking and Savings Accounts $ 2 Stocks and Bonds $ 3 Accounts and Notes Receivable $ 4 Liquid Assets (Total Lines 1 through 3) $250,000 5 401K IRA and Pension Funds $ 6 Equity in Businesses Owned $ 7 Value of Primary Residence $1.7m 8 Value of Other Real Estate $1.1m 9 Other Assets $ 10 Total Assets (Total Lines 4 through 10) $2,900,000 est Liabilities and Net Worth Amount 11 Mortgage Balance on Primary Residence $486k 12 Mortgage Balance on Other Real Estate $270k 13 Other Liabilities $ 14 Total Liabilities (Total Lines 11 through 13) $800k 15 Net Worth (Line 10 less Line 14 $2,000,000 est Sources of Income (Annual) 16 Salary, Bonus, Commission – Self $currently not working 17 Salary, Bonus, Commissions – Spouse $250,000 est + 18 Income from Business Owned $ 19 Other Income $ 20 Total Income (Total Lines 16 through 19) $250,000",
"metadata": {
  "type": "NLP.FileContentDocument",
  "data": {
    "type": "Table",
    "fileUploadId": "456",
    "documentIndex": { "index": 28, "total": 32 },
    "fileVersionId": "abc",
    "filename": "foo-bar.docx",
    "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "textAsHtml": "...roughly 1000 characters of html (removed due to posting limits)",
    "languages": ["eng"],
    "pageNumber": 1,
    "orgId": "123"
   }
 }
}

Relevant application code:

await db.transaction().execute(async (db) => {
    const chunkedEmbeddings = Arrays.chunk(embeddings, CHUNK_SIZE);
    const chunkedDocuments = Arrays.chunk(
      documents.map(({ metadata: { data } }, i) => ({
        embedding_id: embeddings[i].id,
        type: data.type,
        org_id: data.orgId.value,
        ordinal_index: data.documentIndex.index,
        total_documents: data.documentIndex.total,
        file_upload_id: data.fileUploadId.value,
        file_version_id: data.fileVersionId,
        filename: data.filename,
        filetype: data.filetype,
        text_as_html: data.textAsHtml,
        file_directory: data.fileDirectory,
        is_continuation: data.isContinuation,
        languages: data.languages ? array(data.languages) : data.languages,
        page_number: data.pageNumber,
        page_name: data.pageName,
        sent_from: data.sentFrom,
        sent_to: data.sentTo,
        subject: data.subject,
        section: data.section,
        attached_to_filename: data.attachedToFilename,
        header_footer_type: data.headerFooterType,
      })),
      CHUNK_SIZE
    );

    await Promise.all(
      chunkedEmbeddings.map(
        (embeddings) =>
          embeddings.length &&
          db.insertInto('nlp.embedding').values(embeddings).execute()
      )
    );

    await Promise.all(
      chunkedDocuments.map(
        (documents) =>
          documents.length &&
          db.insertInto('nlp.file_content').values(documents).execute()
      )
    );
});

// Works when CHUNK_SIZE = 10
// Fails when CHUNK_SIZE = 20

await db.transaction().execute(async (db) => {
    const chunkedEmbeddings = Arrays.chunk(embeddings, CHUNK_SIZE);
    const chunkedDocuments = Arrays.chunk(
      documents.map(({ metadata: { data } }, i) => ({
        embedding_id: embeddings[i].id,
        type: data.type,
        org_id: data.orgId.value,
        ordinal_index: data.documentIndex.index,
        total_documents: data.documentIndex.total,
        file_upload_id: data.fileUploadId.value,
        file_version_id: data.fileVersionId,
        filename: data.filename,
        filetype: data.filetype,
        text_as_html: data.textAsHtml,
        file_directory: data.fileDirectory,
        is_continuation: data.isContinuation,
        languages: data.languages ? array(data.languages) : data.languages,
        page_number: data.pageNumber,
        page_name: data.pageName,
        sent_from: data.sentFrom,
        sent_to: data.sentTo,
        subject: data.subject,
        section: data.section,
        attached_to_filename: data.attachedToFilename,
        header_footer_type: data.headerFooterType,
      })),
      CHUNK_SIZE
    );

    await Promise.all(
      chunkedEmbeddings.map(
        (embeddings) =>
          embeddings.length &&
          db.insertInto('nlp.embedding').values(embeddings).execute()
      )
    );

    await Promise.all(
      chunkedDocuments.map(
        (documents) =>
          documents.length &&
          db.insertInto('nlp.file_content').values(documents).execute()
      )
    );
});

// Works when CHUNK_SIZE = 10
// Fails when CHUNK_SIZE = 20

unwilling-turquoise•2y ago

I am using neon to store scraped web pages as html in a text field I only use the neon serverless driver and don't have this problem. I just put one of my docs in a char counter one doc has up to 11,000 chars

rival-blackOP•2y ago

Interesting... I can try again without the kysely-neon dialect, though I'm loathe to move away from kysely in its entirety since we use it everywhere. Also I never had this issue when using AWS RDS

multiple-amethyst•2y ago

@Willem any luck reproducing with a different dialect? It would be good to know if this is on the Kysley or Neon side

rival-blackOP•2y ago

@ShinyPokemon Hey, just got finished looking through this – it's a Kysely issue (fixed in https://github.com/kysely-org/kysely/pull/202) thanks for the help!

multiple-amethyst•2y ago

Oooh, that's a nasty one. Thanks for letting us know!

Gaming

Programming

INSERT Statement Truncation?

Did you find this page helpful?