INulledMyself: Using AI to find software vulnerabilities in XNU: Security for mere mortals

Note: This work took place in May-Aug of 2022. It just took me this long to finally finish writing this (Too busy playing with my SRD 😅)

Last year I found several vulnerabilities in XNU source code using AI. My actual stated goal was to better understand NLUs, but I ended up with a very nice double win! I had started working at an AI startup (Moveworks.com - it's pretty awesome! [I'm obviously not biased 😉]) and wanted to have a better understanding of how this all worked. And there is no better way to learn anything than doing the work yourself to not only understand the how, but more importantly the why.

While understanding how NLUs worked was my main goal, I also wanted to gain insight and provide data for the following questions:

Can I understand NLPs & NLUs well enough to not look like a complete idiot at work?
How good is AI at finding bugs?
How does it compare to joern, codeql, ripgrep, and grep?
How likely am I to find bugs in well audited open source code such as XNU

Note: I didn't intend to have so much code here, but I think it makes it easier to follow along at home; I don't really explain any code so feel free to ask questions 😀!

Understanding NLPs & NLUs

In my totally unbiased opinion, Moveworks has a great explanation for how NLU and NLP work together to allow computers to understand human natural language. While there is a lot more complexity and much deeper understanding to be had than in one blog post, I do not possess such a deep understanding. So here's a big ole grain of salt before we dive in!

My aim was to make sure I could explain this to my dad: If I can make him understand how it works then I likely actually understand it myself! In my mind, NLP is effectively an incredibly large set of rules used to distill an utterance (A string of text) into mostly useful instructions and actions that the NLU will be able to more accurately understand and then act upon.

The better you can transform and distill typos/adjectives/filler from the actual need the better your AI will be, as it can pull out the actual need you or your end user have. (Figure 3 from the aforementioned blog post has a great explanation of how this works at a high level)

Of course, in my use case there is less to structure than actual natural language, as the C/C++ programming language is somewhat restrictive compared to how we communicate human to human.

Stage 1: Fight! (with OpenAI)

As with most things, I wanted to start with what had the least roadblocks. I already knew about OpenAI, so I signed up for an account and went to experiment with their API. Thankfully, the API was pretty straightforward and the following is what I used to start finding bugs:

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.Completion.create(
  model="text-davinci-002",
  prompt="Is there a vulnerability in this code? If so write the line of vulnerable code out in your response and tell me why it's vulnerable\n\nCODESNIPPETHERE",
  temperature=0.7,
  max_tokens=1500,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0)

My prompt was "Is there a vulnerability in this code? If so write the line of vulnerable code out in your response and tell me why it's vulnerable", followed by a code snippet. But what was the best code snippet to use? Originally I was going to try and parse out files so there would be additional context, but the max_tokens parameter was limited to 4000, so this wasn't an option. Instead, I used sed to split files into functions, and fed each function as part of the prompt to see what the AI would say. Here's an example:

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.Completion.create(
  model="text-davinci-002",
  prompt="Is there a vulnerability in this code? If so write the line of vulnerable code out in your response and tell me why it's vulnerable\n\nint\nspec_kqfilter(vnode_t vp, struct knote *kn, struct kevent_qos_s *kev)\n{\n\tdev_t dev;\n\tassert(vnode_ischr(vp));\n\tdev = vnode_specrdev(vp);\n\n#if NETWORKING\n\t/*\n\t * Try a bpf device, as defined in bsd/net/bpf.c\n\t * If it doesn't error out the attach, then it\n\t * claimed it. Otherwise, fall through and try\n\t * other attaches.\n\t */\n\tint32_t tmp_flags = kn->kn_flags;\n\tint64_t tmp_sdata = kn->kn_sdata;\n\tint res;\n\tres = bpfkqfilter(dev, kn);\n\tif ((kn->kn_flags & EV_ERROR) == 0) {\n\t\treturn res;\n\t}\n\tkn->kn_flags = tmp_flags;\n\tkn->kn_sdata = tmp_sdata;\n#endif\n\tif (major(dev) > nchrdev) {\n\t\tknote_set_error(kn, ENXIO);\n\t\treturn 0;\n\t}\n\tkn->kn_vnode_kqok = !!(cdevsw_flags[major(dev)] & CDEVSW_SELECT_KQUEUE);\n\tkn->kn_vnode_use_ofst = !!(cdevsw_flags[major(dev)] & CDEVSW_USE_OFFSET);\n\tif (cdevsw_flags[major(dev)] & CDEVSW_IS_PTS) {\n\t\tkn->kn_filtid = EVFILTID_PTSD;\n\t\treturn ptsd_kqfilter(dev, kn);\n\t} else if (cdevsw_flags[major(dev)] & CDEVSW_IS_PTC) {\n\t\tkn->kn_filtid = EVFILTID_PTMX;\n\t\treturn ptmx_kqfilter(dev, kn);\n\t} else if (cdevsw[major(dev)].d_type == D_TTY && kn->kn_vnode_kqok) {\n\t\t/*\n\t\t * TTYs from drivers that use struct ttys use their own filter\n\t\t * routines.  The PTC driver doesn't use the tty for character\n\t\t * counts, so it must go through the select fallback.\n\t\t */\n\t\tkn->kn_filtid = EVFILTID_TTY;\n\t\treturn knote_fops(kn)->f_attach(kn, kev);\n\t}\n\t/* Try to attach to other char special devices */\n\treturn filt_specattach(kn, kev);\n}",
  temperature=0.7,
  max_tokens=1500,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0)

During my first run through, I mostly got responses that looked like this:

However, I got a single hit that was, well, I'll let Davinci explain it to you:

Who could compete with this!?!?

Now obviously this isn't accurate, but this was my aha/lightbulb moment: I had seen a bug like this before in XNU, from the evasion7 jailbreak (ptsd_open was the vulnerable function, more info can be found in the jailbreak wiki in the 'Write-up by p0sixninja' section). The TL;DR - Apple didn't properly perform the check to make sure the device passed in would be in the proper minor range, resulting in code execution.

Equipped with this knowledge (After a few hours googling 😅 to refresh my memory as to why my gut started screaming at me that this was on to something), I found the above article and it all clicked into place. This led me to running through the split functions for about 8 hrs, and it almost landed on what was actually wrong:

Now is this a buffer overflow? No. Could you have a negative major number? No (At least, I'm pretty confident that you can't). As you can see by reading the surrounding code this is an OOB read/write due to an off by one, as the line should be major(dev) >= nchrdev instead of major(dev) > nchrdev.

I assumed you would likely exploit it similarly to how p0sixninja did way back in the iOS7 days, but I did not attempt to actually exploit it as that wasn't my goal. This became CVE-2022-32926 , and also led to a few other discoveries; however these are still being worked on by Apple so I won't discuss them here.

Now that I had some results, it was time to step up my game: How do I do this locally without relying solely on text-davinci-002?

OpenAI? We have OpenAI at home

As you're probably aware, we all stand on the shoulders of giants - and I continue the trend here; by using Huggingface's code-bert-score for CPP. I knew code-bert-score models that neulab continued to train would be what I wanted based on me attempting to grok various conversations on twitter and from searching Huggingface, as there weren't a lot of good CPP masked language models(MLMs) available that I could find quickly. (If you know of one let me know!) The original paper can be found here, which was and frankly still is way over my head. The TL;DR - a better way to compare code snippets to each other! Getting it setup to work for their example was easy: install Python3.9, pip install their requirements (And NLTK!) and you're good to go!

Now why did I want to use their model? My initial reaction after seeing the previous results was to ditch this entire idea, but then I thought: how could I turn this into a fuzzing-esque machine? Using fill-mask seemed like a good approach for this, so I started there. While not identical to fuzzing, it has some basic fundamentals: By replacing arbitrary tokens with <mask>, you could have your model return the top N best matches and evaluate the output. Of course I would have to be picky about what I mask, as if I just masked ANYTHING it would overwhelm me with sheer noise. I took a page from GLFuzz (The old WebGL fuzzer that someone wrote, probably Lcamtuf /Halvar?) and decided to only mask on these specific operators: <.>,!=, <=, and >=. Now obviously this reeks of survivorship bias, but that's the point! Thankfully with this in play, we were able to find the same CVE again, though we did have to manually remove the option that was originally in the code, but that was trivial enough 😀!

Sure, it wasn't solely relying on AI to find the bug, but it's still doing the heavy lifting, and more importantly I can automate this to run without having to pay attention unless something interesting is found, much like a fuzzer. (Huge fuzzing fan; as we all should be). With this working, I knew I wanted to improve it as it was simply too noisy.

Here's a heavily revised example so you can play around with it locally:

#!/usr/bin/python3

from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained('neulab/codebert-cpp')
tokenizer = RobertaTokenizer.from_pretrained('neulab/codebert-cpp')

code_example ="""
    if (major(dev) <mask> nchrdev) {
        knote_set_error(kn, ENXIO);
        return 0;
    }
"""

code_ref="""
    if (major(dev) > nchrdev) {
        knote_set_error(kn, ENXIO);
        return 0;
    }
"""

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(code_example,top_k=5)
for output in outputs:
	if (output['sequence'] != code_ref):
		print(output)

The output:

{'score': 0.7711489200592041, 'token': 49333, 'token_str': '!=', 'sequence': '\n    if (major(dev)!= nchrdev) {\n        knote_set_error(kn, ENXIO);\n        return 0;\n    }\n'}
{'score': 0.11615779250860214, 'token': 28696, 'token_str': ' <', 'sequence': '\n    if (major(dev) < nchrdev) {\n        knote_set_error(kn, ENXIO);\n        return 0;\n    }\n'}
{'score': 0.043150343000888824, 'token': 49095, 'token_str': ' >=', 'sequence': '\n    if (major(dev) >= nchrdev) {\n        knote_set_error(kn, ENXIO);\n        return 0;\n    }\n'}
{'score': 0.030456306412816048, 'token': 45994, 'token_str': ' ==', 'sequence': '\n    if (major(dev) == nchrdev) {\n        knote_set_error(kn, ENXIO);\n        return 0;\n    }\n'}

Obviously the "correct" answer here is the third option, which received a score of 0.04; not great! I know I would want something better to fully finish out this experiment, so I decided to do what I told myself I wouldn't do at the start: train my own model!

Fine, I'll do it myself!

Code is below, which is hopefully pretty straight forward. I used the neulab's codebert-c model as my base, and used the codeparrot/github-code-clean as my data set. Nothing super fancy here, as I only did 100K iterations (They originally did an additional 1 MILLION training steps). While this is only a 10% move in the right direction, does it give me more than a 10% gain? Let's find out 😀!

Training code:

#!/usr/bin/python3.9

#The MAJORITY of this code is from the neulab code-bert-score repo
from transformers import AutoTokenizer,  AutoModelForMaskedLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
import numpy as np
import evaluate
import torch

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # preds have the same shape as the labels, after the argmax(-1) has been calculated
    # by preprocess_logits_for_metrics
    labels = labels.reshape(-1)
    preds = preds.reshape(-1)
    mask = labels != -100
    labels = labels[mask]
    preds = preds[mask]
    return metric.compute(predictions=preds, references=labels)

def preprocess_logits_for_metrics(logits, labels):
    if isinstance(logits, tuple):
    # Depending on the model and config, logits may contain extra tensors,
    # like past_key_values, but logits always come first
        logits = logits[0]
    return logits.argmax(dim=-1)

def tokenize_function(examples):
    examples["code"] = [line for line in examples["code"] if len(line) > 0 and not line.isspace()]
    return tokenizer(examples["code"], padding="max_length", truncation=True, max_length=512,return_special_tokens_mask=True)

tokenizer = AutoTokenizer.from_pretrained("neulab/codebert-c")

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

model = AutoModelForMaskedLM.from_pretrained("neulab/codebert-c")
model.resize_token_embeddings(len(tokenizer))

device = torch.device("cuda")
model.to(device)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=.15,
)

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", max_steps=100000)

#w/o streaming you need much larger than 1 TB of space for all the data
#Likely some bias introduced due to validation + training data overlapping
train_dataset = load_dataset("codeparrot/github-code-clean", streaming=True, split='train', languages=['C','C++'])

with training_args.main_process_first(desc="dataset map tokenization"):
    token_train_dataset = train_dataset.map(
    function=tokenize_function,
    batched=True,
    remove_columns="code",
)

#need 2 of these since IterableDataset doesn't support train_test_split - at least not yet!
#Likely some bias introduced due to validation + training data overlapping
eval_dataset = load_dataset("codeparrot/github-code-clean", streaming=True, split='train', languages=['C','C++'])
with training_args.main_process_first(desc="dataset map tokenization"):
    token_eval_dataset = eval_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns="code",
)

metric = evaluate.load("accuracy")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=token_train_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    eval_dataset=token_eval_dataset,
    compute_metrics=compute_metrics,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics
)
#insert checkpoints here if you want to use checkpoints
trainer.train()
trainer.save_model("test_trainer\newModel")

This ran for ~9 days straight on a 3090 at pretty close to max capacity the entire time,


Accurate representation of how I felt as my room 'warmed' up

but it finally finished! Thankfully, (And much to my families joy), I didn't have to run it again. Now that we built it, let's try the new model!

#!/usr/bin/python3 

from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained('test_trainer/newModel')
tokenizer = RobertaTokenizer.from_pretrained('test_trainer/newModel')

code_example ="""
    if (major(dev) <mask> nchrdev) {
        knote_set_error(kn, ENXIO);
        return 0;
    }
"""
code_ref="""
    if (major(dev) > nchrdev) {
        knote_set_error(kn, ENXIO);
        return 0;
    }
"""
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
outputs = fill_mask(code_example,top_k=5)
for output in outputs:
	if (output['sequence'] != code_ref):
		print(output)

Which now gives us:
{'score': 0.6952677965164185, 'token': 49333, 'token_str': '!=', 'sequence': '\n    if (major(dev)!= nchrdev) {\n        knote_set_error(kn, ENXIO);\n        return 0;\n    }\n'}
{'score': 0.08862753957509995, 'token': 49095, 'token_str': ' >=', 'sequence': '\n    if (major(dev) >= nchrdev) {\n        knote_set_error(kn, ENXIO);\n        return 0;\n    }\n'}
{'score': 0.07690384238958359, 'token': 28696, 'token_str': ' <', 'sequence': '\n    if (major(dev) < nchrdev) {\n        knote_set_error(kn, ENXIO);\n        return 0;\n    }\n'}
{'score': 0.016858655959367752, 'token': 45994, 'token_str': ' ==', 'sequence': '\n    if (major(dev) == nchrdev) {\n        knote_set_error(kn, ENXIO);\n        return 0;\n    }\n'}

So I am less precise, but more accurate (The correct answer is second now) - progress 😀! To note, this is also significantly better than a 10% gain for only doing another 100k steps (Which is ~10% more training done post the neulab work!). But what new bugs could I find?

But does it work?

So now I needed a new method for finding bugs that wasn't a single fill-mask for a single token. What about having my AI write an entire line of code for me? Would something like a multi mask work (IE: Multiple <mask>'s being filled one after the other)? And where would I find the code? Thankfully, I was able to find a medium post that described how the author created multi mask filling with RoBERTa, which is what I was using! A very slight tweak ensued and I was on my way!

Of course, there was a problem: A lot of the output I was getting looked this:

'', '{', '(', '_']

Or this:

'', 'if', '>', '(']

Which isn't very useful, though it could've been correct! I decided to do what I normally do when I hit a wall:

Stuck? Just give it another shot 😀

I decided to feed masks in two separate loops, as based on "testing" (IE: Throwing code at the model and observing results) that I got A TON of if statements that were blank inside, and if I manually added that code the model would be pretty accurate in filling it out. I again took a fuzzing approach to this, and the below is a snippet of code from said fuzzer that was modified so it could be run locally 😀!

#!/usr/bin/python3
from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline
import torch
import random

#numMasksToInsert=random.randrange(0,25)
numMasksToInsert=11
model = RobertaForMaskedLM.from_pretrained('test_trainer/newModel')
tokenizer = RobertaTokenizer.from_pretrained('test_trainer/newModel')

maskStringConstant=""
maskReplacementString="MASKREPLACEME"

#based on code from https://ramsrigoutham.medium.com/sized-fill-in-the-blank-or-multi-mask-filling-with-roberta-and-huggingface-transformers-58eb9e7fb0c
def get_prediction (sent):
    token_ids = tokenizer.encode(sent, return_tensors='pt')
    masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
    masked_pos = [mask.item() for mask in masked_position ]

    with torch.no_grad():
        output = model(token_ids)

    last_hidden_state = output[0].squeeze()
    list_of_list =[]
    for index,mask_index in enumerate(masked_pos):
        mask_hidden_state = last_hidden_state[mask_index]
        idx = torch.topk(mask_hidden_state, k=1, dim=0)[1]
        words = [tokenizer.decode(i.item()).strip() for i in idx]
        list_of_list.append(words)
        #print ("Mask ",index+1,"Guesses : ",words)

    best_guess = ""
    for j in list_of_list:
        best_guess = best_guess+" "+j[0]

    return best_guess

code_example ="""
static int
mt_cdev_open(dev_t devnum, __unused int flags, __unused int devtype,
    __unused proc_t p)
{
	int error = 0;
MASKREPLACEME
	mt_device_t dev = mt_get_device(devnum);
	mt_device_lock(dev);
	if (dev->mtd_inuse) {
		error = EBUSY;
	} else {
		dev->mtd_inuse = true;
	}
	mt_device_unlock(dev);

	return error;
}
"""
newCode=code_example.replace(maskReplacementString,maskStringConstant*numMasksToInsert)

predicted_mask = get_prediction(newCode)
predicted_maskList = predicted_mask.split(" ")
print("predicted_maskList is %s" %(predicted_maskList))
newCode=newCode.replace(maskStringConstant,predicted_mask,1)

if "if" in predicted_mask and "(" in predicted_mask and ")" in predicted_mask and "{" in predicted_mask:
    #fix up if statement if we find one, include a few masks in the event the AI gives us "blanks"
    #TODO: Are these actual masks or am I smokin something?
    updateCodeSnippet=predicted_mask.replace("_",maskStringConstant).replace(" ","")
    #Use our original code to insert the newly updated snippet into
    newerCode=code_example.replace(maskReplacementString,updateCodeSnippet)
    second_predicted_mask=get_prediction(newerCode)
    second_predicted_maskList = second_predicted_mask.split(" ")
    print("predicted_maskList is %s" %(second_predicted_maskList))
    newestCode=newerCode.replace(maskStringConstant,second_predicted_mask,1)
    #TODO: Probably shouldn't do this?
    #any masks leftover? Ignore for easier grepping
    newestCode=newestCode.replace(maskStringConstant,"")
    print("GREPFORME: ",newestCode)
else:
    print("no if statement, only perform one round of predicting")
    print("GREPFORME: ",newCode)

Which had the following output:

predicted_maskList is ['', '', '', 'if', '(', '_', '_', '_', '_', ')', '', '{']
predicted_maskList is ['', 'dev', 'num', '==', '0']
GREPFORME:
static int
mt_cdev_open(dev_t devnum, __unused int flags, __unused int devtype,
    __unused proc_t p)
{
        int error = 0;
if( dev num == 0){
        mt_device_t dev = mt_get_device(devnum);
        mt_device_lock(dev);
        if (dev->mtd_inuse) {
                error = EBUSY;
        } else {
                dev->mtd_inuse = true;
        }
        mt_device_unlock(dev);

        return error;
}

And would you look at that! We have an almost legitimate if statement. This one caught my eye very late in this process (I was reading through gigs of output), as it was very reminiscent of the previous output from the pretrained model! After a quick read of mt_get_device I confirmed it was indeed vulnerable. This became CVE-2022-32944, and was the most substantial bug to fall out from this adventure.

Just how good is it?

In the current state of the AIs used (Which could VERY well be due to my misuse of them), I did not find this a compelling use case. Perhaps if I understood things better and had the time/patience/more effort to put into this it could've been better (And likely would be!).

Ripgrep/grep was able to reproduce this with minimal effort, but substantially more noise (As is typically expected of grep). I wouldn't personally recommend immediately jumping to grep for bug hunting, but it's a useful tool to have in your belt to quickly validate a gut feeling you have.

Joern would've required set up and writing a query after ingesting the XNU code base - and in my opinion would've been my fastest route at finding these types of vulns instead of using the AIs. Unlike grep there are a lot less false positives as you can be pretty exacting in your query; however if joern fails to parse your files you are SOOL; and will have to fix them up as needed which can be time consuming or flat out annoying.

CodeQL much like Joern also would've found the majority of these vulnerabilities after building XNU - and likely would offer fewer false positives as it works on your recent built. However, there is a good chance I would've missed CVE-2022-32944 as I didn't have an ARM machine to build on, and I wouldn't have taken the time to do so since this was more about learning than attempting to find all the bugs!

Making it better?

Trying with text-davinci-003, GPT-3.5turbo, or GPT-4 could all be very fruitful as they didn't exist at the time of my testing. Retraining on top of the existing models is likely the best next step if one wanted to take this further. If I were to try and take this further myself, I would start in that direction while looking to see if I could find a better set of training data (Specifically: Annotated data specifically for bug types that I would be interested in)

The other option that I think would be more immediately useful would actually be to use/teach an AI to write you Joern and/or CodeQL queries, and using those to find you bugs. I would also encourage the reader to actually play around with AI (in general) and integrate it into your workflows for non sensitive questions and data: It has greatly sped up some of my for fun dev work, and also been great for getting IDA plugins to like 60% feature completion (Though it does use the old APIs 😞).

Conclusion

This was a lot of fun and a HUGE learning experience for myself. While I didn't expect to actually find any bugs, it was definitely cool to have my ideas actually work! I would encourage more security minded folks to get involved and play around with AIs, as it's an interesting space regardless of which area of security your in, and has implications for us all.

INulledMyself

Tuesday, May 9, 2023

Using AI to find software vulnerabilities in XNU