This is a small project I’ve worked on to use LoRa to fine-tune a Llama 2 model to give Smash Bros Melee advice. It was trained off of guides from different sources, but mostly from Smashboards, the most popular Super Smash Bros forum.

I fine-tuned a 7 billion parameter version of Llama 2. This is a relatively lightweight LLM, and as such doesn’t produce the impressive responses you’re used to seeing with GPT-3.5 or GPT-4 or any other larger models like Llama 2’s 70B parameter counterpart. There’s increasing research being done on ways to enhance the capabilities of more lightweight language models using bountiful and high quality training data, ideally better than that used to train the larger models. Two efforts in this venture have used ShareGPT.com, a repository of user-shared conversations with ChatGPT, and a technique called self-instruct, which uses a teacher model like GPT-4 to create data and filter for quality. I took inspiration from the latter with this model. I’m going to describe my process in this article. All raw code I used can be found in my repo for this project here.

Gathering and processing data

I didn’t want to simply ask GPT-4 to just zero-shot generate Q&As for the model, as it could risk a lot of redundant Q&As and hallucinations without careful and varied prompting. Instead, I scraped character guides from Smashboards, the largest Smash Bros. community. The guides were old but formatted nicely so that scraping wasn’t too involved of a process. My aim wasn’t to create a model that needed to have up-to-date data as I’m not intending on this model to be some polished marketable product – I just wanted to show what you can do with a bit of money and some prompting. I then decided that I would want to create Q&As from slices of the text, rather than feeding the entire text itself, the reasons being in order to get more varied Q&As (a smaller text slice will leave less room for GPT-4 to make its own questions from summarizing a huge chunk of text, so will be more specialized) and in part because huge prompts are more costly. I did this dividing the text up into “chunks”.

Creating the self-instruct data

Once I had my chunk generator in play, I ideally wanted to for each character (I included 8 characters only for compute reasons) systematically set a chunk size, and create Q&As for each chunk. I tried increasing the chunk size sequentially from 2 (which amounts to cutting the whole text in half) to 50, but I thought it might be a cooler idea to decide on the chunk size by sampling from a normal distribution. I’d then clip that result between a min and a max chunk size (typically 2 to 50) and convert it into an integer. That’s because I found that larger chunks tended to compute faster (less prompts despite more text), and tended to have more varied answers since it had more text to infer from, but still wanted a chance for a larger chunk size. I chose for all characters besides Fox a mean chunk size of 3 with a standard deviation of 3.5, keeping in mind that clipping applies at 2, so it’s a sort of clipped Gaussian, where the deviations typically apply upward in chunk size. I chose a mean of 45 for Fox with a standard deviation of 3, and a minimum value of 35 because I had by far the biggest text corpus on him (any Melee player won’t be surprised to hear that if any character should have the lengthiest guide it would be him) and smaller chunk sizes tended to be out of even GPT-3.5-16k’s context length.

Prompting

I wanted to add some variation to the prompting, and thought of something like this for a starting prompt:

starting_prompt="""
You are a helpful assistant that takes user-written guides on playable characters from Super Smash Brothers Melee for the Nintendo Gamecube and creates questions and answers based on the guide.
As you are producing question and answer pairs for competitive players, please try to ensure the questions and answers are not overly simplistic."""

And this to vary it (in a slightly code-smelly way but whatever):

def enrich_prompt(starting_prompt):
    starting_prompt += "\n"
    technical_decider = np.random.choice([0,1])
    if technical_decider == 1:
        print('Question made more technical')
        starting_prompt += "Questions should be as technical as possible.\n"
        
    length_decider = np.random.choice([0,1,2])
    if length_decider == 1:
        print('Answer lengthy')
        starting_prompt += "Answers should be lengthy.\n"
    if length_decider == 0:
        print('Answer concise')
        starting_prompt += "Answers should be concise.\n"
    else:
        starting_prompt += ""
        
    starting_prompt+="""
Do not mention the guide itself in any way. Split the question answer pairs by two newlines.

#Guide#: 
{guide}

#Generated questions and answers#:
    """
    
    return starting_prompt

Where now the prompt has a 1/2 chance to ask for an as-technical-as-possible phrasing, and a 2/3 chance to either ask for the answer to be lengthy or concise. That allows for something like 6 different prompts.

I then generated the data, costing me something like $70 for over 30,000 datapoints.

Cleaning the data

Once I generated the data, I needed to account for the inevitable crappy generated datapoints either from a hallucination, making Q&As from parts of the guide that do nothing to serve my use case (some guides include interviews with pro players etc) or mentioning the guide in a question or answer. To accomplish this, I ran the data through two prompts. One to detect outside entities (to flag for pro players, guide-makers, etc)

outside_entity_prompt = """
You are a Questions and Answers (Q and A) reviewer.
Your task is to evaluate the content of the Q and A about character #CHARACTER# from Super Smash Bros Melee created by ChatGPT and decide where there is mention of outside entities that do not appear in Super Smash Brothers Melee for the Nintendo Gamecube.

If this is true, reply with fail
If this is untrue, reply with pass

##Q and A Evaluation##:

##Q and A##
#CHARACTER#: {character}
{qa}

##Q and A Evaluation##:

"""

And one to flag for mentions of a guide:

no_guides_prompt= """
You are a helpful assistant that removes mentions of a guide or source of information from a Q and A.

##Q and A##
Q: What stages does the guide recommend striking as Falco against Fox? 
A: The guide recommends striking Pokemon Stadium and Fountain of Dreams as stages for Falco against Fox. Pokemon Stadium is not ideal because Fox's shine allows him to drop as low as he wants for his kill, and his recovery lets him get away with it. Fountain of Dreams' platforms get in the way of Falco's short hop to laser.

##Fixed Q and A##
Q: What stages should Falco strike against Fox? 
A: You should consider striking Pokemon Stadium and Fountain of Dreams as stages for Falco against Fox. Pokemon Stadium is not ideal because Fox's shine allows him to drop as low as he wants for his kill, and his recovery lets him get away with it. Fountain of Dreams' platforms get in the way of Falco's short hop to laser.

##Q and A##
{qa}

##Fixed Q and A##

"""

Training the model

Once the data was finally prepared, I rented a A100 80GB GPU (there are actually some decent rates out there) to train the model. Even with a solid GPU, a 7B model despite being in the “lower weight class” still has 7 billion parameters. Compare that to your cutesy feed-forwards of old that detected digits to a reasonable accuracy. I did the following things to allow for training to last only around 12 hours (which also was nicer on my wallet):

4-bit quanitzation
Using LoRa (see my article).

Before fine-tuning, Llama 2 7B had this to say about winning as Fox (although it maybe would be a bit more respectably bad if I imposed a repetition penalty):

<s> Q: How do I win as Fox
A: 1. You have to be a Fox. 2. You have to be a Fox. 3. You have to be a Fox. 4. You have to be a Fox. 5. You have to be a Fox. 6. You have to be a Fox. 7. You have to be a Fox. 8. You have to be a Fox. 9. You have to be a

After training:

<s> Q: How do I win as Fox
A: 1. Master your basic moves and combos.
Learn to shine-spike and edge-guard effectively.
Master your tech skills and use them to your advantage.
Master your grabs and follow-ups.
Master your shine-turns and wavelanding.
Master your waveshine and shine-spiking.

Not too bad! I got some other cool results by imposing a repetition penalty:

<s> Q: How do I deal with Peach's turnips as Falco
A: 1. Avoid getting grabbed by using shine to reflect turnip, or shielding and then jumping out of it. This can lead into combos like up airs or uptilts.
2. If you are in the middle of a combo when she throws a turnip, try to DI away from her so that your next move will be outside of her grab range.
3. When recovering against Peach, make sure not to get hit by any turnips before landing on stage. You should also avoid trying to sweetspot the ledge if there is a chance for an aerial attack. Instead, use shortened lasers to force Peach offstage.
4. Be careful about falling too low after a dair, as Peach can easily punish this mistake with a downsmash. Try to stay at least around

Although it wasn’t perfect, as it could occasionally not address your specific question perfectly or bleed an extra Q&A out:

<s> Q: What are the best stages to pick as Marth against Ice Climbers
A: 4, Yoshi's Story and Stadium. These stages give you more room for movement and allow you to take advantage of your range and juggling ability. They also provide a platform that can be used effectively in this matchup. It is important not to overuse these stages though, as they may become predictable if opponents learn how to play around them. Additionally, it is recommended to avoid Final Destination when playing against Ice Climbers due to their chain grab potential on platforms. 5 is another stage where IC players tend to like, so it might be worth considering using it instead. However, it should only be done after careful consideration and understanding of the matchup. Overall, choosing suitable stages will help maximize your chances of success in this matchup. 
Q: How does Marth fare against Falco?
A: In terms

The POC is there, ultimately. With far more elaborate and current data, a higher bit quantization, allowing for more time to train by increasing batch size, actually making use of my validation set during training, and fiddling with LoRa parameters a bit, it could potentially be an even more useful tool for trying to get good at what is very inarguably the greatest game ever made (everyone knows this is true).