Limiting OpenAI’s token usage isn’t merely desirable because it is cheaper and faster, but also because it increases the size of the overall context. OpenAI caps this at 8.000 tokens for most models with a possibility of up to 32.000 tokens (if you are lucky to get invited, presumably). I haven’t seen anybody offering 32K tokens computations nor showcasing it. So in the meantime optimizing the token usage makes even more sense.

I am bootstrapping a small startup, that generates SQL using natural language input. To enhance accuracy of the generated SQL, users can add their database schema, either by static import or by connecting the database. The database schema is included in the OpenAI input prompt.

Additionally users can run the generated SQL directly on the database. This gives users powerful business intelligence tool. However the limited context size does cause some issues, both currently and with regards to future use:

  • huge database schemas exceeds maximum context size (though this can currently be remedied by excluding tables manually, but this isn’t ideal)

  • it limits future use cases like running large business relevant simulations

Optimizing the prompt

I have a large collection of prompt templates that are used to generate input prompts for OpenAI like generate SQL, explain SQL or optimize SQL. These prompts include the database schema if it has been added. The prompt templates roughly follows the format:

<!-- Initial instructions -->

<!-- Database schema in the format: table_name: (column_1, column_2, column_3) -->

<!-- Final instructions -->

In this example I will use the dvdrental sample database but it could be any database schema or other datasets with similar common structure. I have removed surrounding input prompt instructions and only show a few tables since the pattern is the same. The database schema is included like:

                  """customer_list: (notes, city, zip code, id, address, sid, country, name, phone)
inventory: (inventory_id, last_update, store_id, film_id)
address: (postal_code, last_update, phone, address2, address_id, district, address, city_id)
staff_list: (country, address, id, zip code, city, sid, phone, name)

Even in current state, it looks rather trimmed down but by removing excessive line indentation (caused by template literals indentation for code readability), parentheses and commas it can trimmed even further:

"""customer_list: notes city zip code id address sid country name phone
inventory: inventory_id last_update store_id film_id
address: postal_code last_update phone address2 address_id district address city_id
staff_list: country address id zip code city sid phone name

This small change cuts the token usage for the included database schema from 551 tokens to 381. That is a 30% reduction in token usage (can be checked using OpenAI’s tokenizer tool). It increases the size of database schema able to be included and saves 30% of token cost. So avoid unnecessary punctuation when including large datasets in your OpenAI prompts. Just add enough for OpenAI to understand it.