· Paul Brown · blog · 19 min read

AI for Drilling Engineers - How and When (and why you shouldn't)

Too much noise not enough signal. What you should know.

Too much noise not enough signal. What you should know.

There’s a lot of hype about AI (particularly the text-generation kind), and a lot of money in oil, so it stands to reason that there are a lot of people trying to sell AI solutions to the oil industry. This article goes over the general way in which the AI knowledge systems commonly being touted in engineering work, so you can hopefully better judge for yourself what’s involved with some (probably expensive) solution.

How AI Works

First, you start with a model: A proper AI company (more on ‘proper’ in a minute) like OpenAI will spend huge sums of money to create these models - tens or hundreds of millions of dollars to run supercomputers which churn through unfathomable amounts of text from books, websites, articles, transcripts, etc. The models contain mathematical representations of how one word associates with a large number of other words. So there might be 1500 parameters associated with each word: one parameter is black/white, one is hot/cold, and after reading the word ‘shoe’ a trillion times the model says the word ‘shoe’ is best represented by being -0.1 black (a tiny bit black - maybe they’re mostly black?) and 3.0 cold (more cold than hot). At the same time that it’s working out where ‘shoe’ stands in the model, it’s also adjusting where black/white and hot/cold stand relative to every other thing due to the effect the current context of the word ‘shoe’ is having.

Now, I just made those numbers and parameters up - the actual parameters/words/values will be a lot more arbitrary, abstract, and complex, but you get the idea. For our purposes - AI in the oilfield - how these models are made is largely irrelevant, because here’s a secret the AI salesmen who are sending you all those emails don’t want you to know: They’re not really AI companies. They use AI as a tool, yes, and maybe its their main tool, but they’re using those models developed by the big boys, and they are specifically designed to be not very difficult. Here’s how the engineering AI software works 99% of the time:

OpenAI/Meta/Google/whoever spend lots of money making a model. They also build systems that allow you to interact with these models. When you interact, say by sending a question, their system converts your text input into the numerical format that the model understands, and then comes up with an answer.

So, you pick a model - let’s say we choose GPT-3.5 (each company will have various models - OpenAI has GPT-4, GPT3.5, whatever. ChatGPT is the name of the service by which the public interacts with their various models), and you go to their website and read the instructions on how to interact with their models programmatically (via their interface - their ‘API’). You make a simple program that takes a user’s question and sends it to GPT-3.5, then GPT-3.5 sends your program the answer and you show the user the response. This is a simple process, but it is the basis of all these ‘AI’ companies’ systems.

So how to apply this for an engineering case? Let’s say you’re a drilling company. During operations, your team creates daily reports and lessons you’ve learned. You have many years worth of operations and you want to use AI to get insights into your data. How would you do this?

Using your operational data

First, you need to get your business data into a format that can be used. You can do this quickly and easily by getting the text from all those reporting systems/pdfs/spreadsheets/documents into a database: Make a simple program that sucks them in and extracts all the text, then filters out all the invalid characters. Maybe you keep things like the document name/author/page number as metadata, and you store each text chunk by page or paragraph or whatever you think best. So now instead of ten thousand pdfs, you have a database (which looks just like a spreadsheet), with each row containing a chunk of text (maybe from a single page) and some bits and bobs of other information. (By the way, it would take maybe ten minutes to suck in and filter about 50 GB of pdfs, and that 50 GB would become about 200 MB when reduced to text only).

Now, we have the data - how can we use AI with this?

Previously, I mentioned that the process of using these AI models was simply asking the AI system a question and waiting for the answer. So that’s what you do. But you can’t just ask “What do I need to do before testing the pumps?” - GPT-3.5 doesn’t know anything about what equipment you normally use, and it’s probably not been trained on a lot of drilling reports. You need to provide it with context. But you can’t provide it with your huge amount of text data - it wouldn’t be able to process that. So how do you get the appropriate context?

First, you need to do some formatting on that operational data you just put in your database: You go through every row in your database, and take every bit of text that you’ve extracted from your documents and send each one to a specific part of GPT-3.5’s system which replies with a ‘vector embedding’, which is a mathematical representation of how that block of text fits into its model (in roughly the same way as I’ve previously mentioned how ‘shoe’ might be represented in a model). You store that vector in the database alongside the chunk of text it’s refers to.

The embedding it returns looks like this, though I’ve removed most of the 1536 total numbers:

[-0.020048857,0.010397183,0.0019686087,-0.041455604,0.0017589345,-0.0058642244,-0.026359055,-0.01404485,0.051573224,-0.027264316,0.038260568...]

You know how I mentioned using these AI models is specifically designed to be as simple as possible? It’s because many of their business models is to have as many users as possible, because you have to pay to use these services (and also because some of them train their models on the data received by them, which makes them more powerful). It can be costly when used in high volumes, but getting mathematical representations of all the text in our database (i.e. our 50 GB of operations reports) would cost around $1.

So now we have a database of all our business data, along with a mathematical representation of how each text blob fits in the context of the model.

Next step: We do the same ‘embedding’ thing, except this time we do it with the question. So we ask GPT-3.5 to tell us what mathematical representation it would use for our question “What do I need to do before testing the pumps?“.

Now that we have the embedding of the question, we have our program search our database for the embeddings that are closest. It does this using an equation that your friendly AI snake oil salesman engineer doesn’t need to know much about, which basically works out how ‘close’ our question vector is to the vectors in our database (‘cosine similarity’). When it finds the ones that are closest, it retrieves the text - so it now has a chunk of your business data that relates to the question. You control this search, so you could return just the closest match, or you could return the top X matches, or you could configure it so that it only returned matches over a certain threshold of closeness.

Asking a question about your data

So now you have your question, and you have some context from your business data. Now you can query the AI model. Again, we’re just using the simple “ask question, get answer” process as mentioned before, we just need to clarify what we want, so how about:

  • Part 1 - Tell it what it’s to do: “You’re an engineering knowledge base. Answer the question using the context provided.”
  • Part 2 - Provide it the question: “What do I need to do before testing the pumps?”
  • Part 3 - Give it the context: Maybe your database search returned “Configured pump test module. Charged flux capacitor. Commenced pump test.”

So your question literally looks like this:

You’re an engineering knowledge base. Answer the question using the context provided:
Question: What do I need to do before testing the pumps?
Context: Configured pump test module. Charged flux capacitor.

So now you have a big question with all those three parts. You send it to the AI model. It comes back with an answer. I ran the exact question through our setup, and got this response:

Before testing the pumps, you should ensure that the pump test module is properly configured and ready for use. Additionally, make sure that the flux capacitor is fully charged as it may be necessary for the operation of the pumps. Once these preparations are complete, you can commence the pump test by following the appropriate procedures for your specific system.

So, the answer refers to the context provided - the context we extracted from our historic operational reports. It might seem simple, but composing that question is really the basis of everything to do with getting knowledge from your textual data. Want a more comprehensive response? Want it to automatically draw schematics? It’s just an iteration of this ‘ask the right question’ process. Here are a few examples:

Get a more detailed answer

The initial answer is relatively basic: I don’t know anything about a pump test module, or a flux capacitor, so how do I get more detail? In our initial question we could tell the AI system to ‘add as much detail as possible’, but there’s only so much detail provided by the context we gave, so that’s not going to work. We could try to get more pieces of context from our initial database search, but we were searching for testing pumps, not flux capacitors, so the context would likely be wrong. Instead, we can iterate on the answer by configuring our question-asking program to do something like this:

  1. Get the intial reply from the system, as above.

  2. Send the reply back to the system, with a new question added: “Divide this response into sub-questions I may have about the response.” Here’s a key part: Tell the AI to respond in a format your program can easily interpret (e.g. a format called “JSON”). So your new question is:

    Take the response below. Create several questions that I may have on the response, that will improve my understanding. Return the questions in JSON format. <<— Add in the initial response here—>>

  3. Send it to the AI service. It will respond with something like this, which is in JSON format:

    { “question 1”: “What is a pump test module?”, “question 2”: “What is a flux capacitor?” }

  4. That ‘JSON’ format, with the curly brackets and the quotations, can be easily read by our program. So now, our program can take each question and search through our operational database for appropriate context. We could provide the AI system with any of this context from our database, and at the same time ask it to supplement our context with its own knowledge where appropriate (so those millions of engineering books it has read will be used for our answer). Then, when we get all the answers, we can join them all together and provide one big comprehensive answer.

Having a chat - AI making decisions

If we were to reply to the first answer (about the pump test) with “can you elaborate?”, we wouldn’t get a good response from the AI service, because it wouldn’t know what it’s elaborating on. Instead, we can provide the initial question/answer as context alongside asking the new “can you elaborate?” question. This way we can have a chat with the AI service about a topic, in the same way that we can with ChatGPT.

Now, if we just input “can you elaborate” to our initial program, then it would do the whole routine of getting the question embeddings, searching our database for business data for context, etc., but that wouldn’t be appropriate - we don’t want to find operational data about the topic “can you elaborate”. So our program needs to decide when to follow that ‘search for operational data’ process, or when to just provide the previous response as context without searching for business information.

To make this ‘decision’, we can make our program do the following before it does anything else:

  • Send a prompt to the AI service. Tell it the question, and ask it to decide which of the following is more appropriate (and again respond in that JSON format that our program can read):
    • Option 1: Search our operational database for context
    • Option 2: Ask a question based on our previous conversation.
  • Based on the AI system’s response, we can have our program choose which process to follow.

This method of having the AI service choose which path to follow is often sold by those AI snake oil salesmen engineers as some ultra advanced capacity that only their brains could conceive, but as you can see it just naturally follows from the AI system’s text generation capability.

Providing References / Citations

AI models have a ‘quirk’ that you may be aware of - they hallucinate. Or, they often just make things up.

This is a by-product of the way they work: Just like a zip file, AI models compress huge amounts of data into a much smaller space, but unlike a zip file (which can re-assemble the original files perfectly when unzipped, i.e. is ‘lossless’), an AI model is ‘loss-y’ - it loses data when it compresses it. Remember that it stores how ‘shoe’ relates to lots of other things - it doesn’t actually store an index of all the makes/colours/types of shoes.

The occasional fabrication might be tolerable if your system deals with something qualitative like customer service, but it’s just not acceptable when dealing with engineering data. You can’t have a report with data that is wrong, or mostly right but with a couple of surprises. And while AI systems can create tremendous value, their output needs to be verified before being relied upon.

Remember that our original system searches our operational data (in our database) for context. In the same rows as the operational data, we can store other information, like: Original document name, file location on a hard drive, page number, author, etc. - whatever we want. So we can retrieve this information when we retrieve the text blobs, and use it to link to sources, or whatever we want.

As a side note, this metadata can be very useful. Not only can you use it to cite sources, but you can engineer it to allow for more specific searches, or to better categorise your data.

Retrieve quantitative data

Let’s say we’re writing a report on previous operations, and we want the AI system to generate a table of data. Maybe it’s a time vs depth graph of the well we’re drilling. Again, there’s that issue of hallucinating - we want accurate data.

If we used our initial program and asked ‘get time and depth data for the Snowbird well so I can generate a time vs depth plot’, it would just try to find text similar to that in our operations, so that wouldn’t work. In every example so far our system has just performed text searches. That’s not going to work in this case. So we need to work out what to ask the AI service:

Let’s say we have a database of operational data. We need all the specific data (time, depth) from the operations where certain criteria are met (the well name), and in a certain order (by date).

So, as context let’s provide the AI service with the structure of our database which contains our operations reports. Ask it to generate a traditional database query we can use to extract the data we need, and return the query in JSON format. So the question to the AI service would look something like this:

I have a database, which has rows with these columns: WELL_NAME, DEPTH, DATE, RIG_NAME, etc…
Provide an SQL query in JSON format that will allow me to answer this question:
Question: “get time and depth data for the snowbird well so I can generate a time vs depth plot”

It will reply with something like this:

{
  "query": "SELECT DATE, DEPTH FROM wells WHERE WELL_NAME = 'Snowbird';"
}

That query in the response is just a standard way to get results from a database. We get our program to execute it, and there we have it - our AI system has converted a question into an old-fashioned database query, which we can use to get even more data from our systems.

Note that this assumes you actually have a database that you can search… too many oilfield companies still use spreadsheets or pdfs. One day (when you realise how much pointless work you have done, and how business you have lost, and how many mistakes you’ve made that could have been avoided) you will regret relying on spreadsheets and legacy reporting systems. That you have read this far would give me hope, if I only knew… but I don’t know, so I despair.

Plotting charts, making diagrams and calculations

Further to the example above, we can get the AI system to generate traditional database searches so we can get all the data we need to plot things (charts, tables, draw diagrams).

Let’s say we want to draw a well diagram of historic data.

We can quite easily get the system to draw the plot itself by doing something like getting the data (using methods outlined above), and making a prompt like this:

Make an oil well schematic from the following data. Provide the response as a HTML5 canvas object in JSON format: Data: One casing string from surface to 300ft, another to 2900ft, another to 10000ft

It will respond with our JSON format that our program can easily interpret, and we can then effectively programmatically paste it onto the user’s browser and it will show a well schematic.

The issue is that the schematic will vary a lot depending on the data and questions provided - maybe sometimes it will be green, sometimes very large, sometimes small, etc., and there would also be very little our program could do to sense-check it before showing the user.

Instead, we could add an function to our program that would draw a well schematic, and just asking the AI system to prepare the data in the format we specify. So our prompt becomes:

Data is provided below. Format the data in a JSON object with this format for each casing string: {depth_from, depth_to, name} Data: One casing string from surface to 300ft, another to 2900ft, another to 10000ft

The response would be data that we can plug into our schematic drawing algorithm (or chart, or table, or calculator, or whatever), and our algorithm would have error checking and display configuration built-in, so we can be confident it would be correct.

Going further

So I’ve covered querying your data, getting the AI to call to decide what to do, providing citations, and getting the AI to make traditional database queries, and then using that data in other programs.

They’re powerful in themselves, but in reality you’d use these techniques in some sort of sequence to get the maximum value out of any system that you build - you’d work out what functions you wanted it to perform, and work out a flowchart of how you would perform them. Then it’s just a matter of codifying your process, which is relatively straightforward.

How much would it cost?

I probably shouldn’t reveal actual costs, but in the interest of honesty here’s a run down of the https://oilfieldsuperstar.com system I made. (It’s an oilfield knowledge base I use for showing people live demos of the process actually works, and for testing some base functionality):

  • Server with vector database, and web hosting: $4/month
  • 50 GB of pdfs/spreadsheets, organised for metadata extraction: 1 day
  • Create system to suck in PDFs and suck text data into database: 1 day
  • Create embeddings of all text in database: 3 hours overnight, $0.30 on OpenAI fees
  • Backend system to handle questions (basic prompt, and breakdown into smaller questions): 2 days
  • Frontend design: 2 days

It’s definitely not the most sophisticated system, but the functionality is there and we’ve built on its base for commercial systems.

How to Implement AI, Step 1: Forget it.

I’m sure you see that AI systems are a powerful toolkit. But they’re just that - a toolkit.

The AI systems people are no doubt trying to sell you in engineering will most likely be implementations of the above - they rely on how you prompt the system, rather than actually creating specific AI models. It’s not that complicated - work out what question you want to ask and ask it.

They can do an awful lot of new things, and they can make it easier and quicker to do old things better, but they are not magic: They can’t extract value from data that doesn’t exist.

Time and time again I have worked with engineering companies which are presumably great at engineering their products, but have put zero thought into how their data and processes and business operations in general are engineered. They are taken in by claims that AI alone will solve their data problems and provide huge insights. But it won’t. It won’t.

Before you even start thinking about implementing AI you need to make sure you are collecting data appropriately. The way I mentioned above to get data into the system - sucking in pdfs and spreadsheets? It might be fine if you’re a law company, but frankly it’s a crap way of doing things for an engineering company - you lose so much valuable data, like the connection between one value and the next, or a lesson learned and the operation it was learned on. You should have started putting your data into a well designed database twenty years ago. Failing that, you should start today. Then in a couple of years you can chat to an AI system about your data. (But in the meantime, there are lots of traditional software techniques you can use to get huge value, so there’s a good chance you might not even need to).

So:

How to improve your business with AI, Step 1: Forget about AI for now. Start engineering your data instead.

Share:
Back to Blog

Related Posts

View All Posts »