Local models mildly demystified

I didn't expect LLM engines to be "just" binary files. Microsoft's AI Toolkit for VS Code helped me better understand how LLMs work by letting me download and run AI models locally.

I tried jan.ai and other apps that let users run models locally, but I kept selecting models too large for my machine, maxing out my CPU. Microsoft's extension offers a limited selection of models that work on most computers, preventing these issues.

Running models on my machine and watching my RAM and CPU usage spike made me curious about how it all worked. I asked claude.ai to explain it using SQL Server, .NET and PowerShell comparisons. For .NET, Claude explained that loading a model is similar to deserializing an object from storage, and that many machine learning libraries in .NET, like ML.NET, provide methods to load pre-trained models from files or streams.

For PowerShell, Claude compared it to loading a module into memory. This didn't click for me, so I asked for code examples which made it clearer:

1# Add the necessary assembly references for ML.NET
2Add-Type -Path ./Microsoft.ML.dll
3
4# Create an MLContext
5$mlContext = New-Object Microsoft.ML.MLContext
6
7# Load the trained model from a file
8$modelInputSchema = [ref]::new()
9$trainedModel = $mlContext.Model.Load("./model.zip", [ref]$modelInputSchema)

Oh, it's like loading a text file or DLL! That makes sense.

What's Inside AI Model Files?

As an infrastructure person, I probably won't create production models soon, though maybe I'm wrong.

These binary files don't contain human-readable content like words or code. Instead, they hold serialized data structures representing the neural network's parameters and architecture. They typically include:

ComponentDescription
Weights and BiasesLearned parameter values stored as binary floating-point numbers
Model ArchitectureNeural network structure details (layers, types, activation functions)
MetadataTraining configuration, optimizer state, and versioning information
Serialization FormatFramework-specific data storage methods (TensorFlow: Protocol Buffers, PyTorch: Pickle, ONNX: specific protobuf format)

Larger models with more parameters generally work better but need more RAM and processing power - explaining why some models crashed my machine.

Viewing the Contents

Using cat on these files won't show readable text since they're binary. To inspect or use them, you need the appropriate machine learning framework:

  • TensorFlow: tf.keras.models.load_model() or tf.saved_model.load()
  • PyTorch: torch.load()
  • ONNX: onnx.load()

Here's how you might load a PyTorch model and inspect its parameters:

 1import torch
 2
 3# Load the model
 4model = torch.load('model_file.bin')
 5
 6# Print the model architecture
 7print(model)
 8
 9# Inspect specific parameters
10for name, param in model.named_parameters():
11    print(name, param.shape)

And for TensorFlow:

 1import tensorflow as tf
 2
 3# Load the model
 4model = tf.keras.models.load_model('model_file')
 5
 6# Print the model summary
 7model.summary()
 8
 9# Inspect specific layers or parameters
10for layer in model.layers:
11    print(layer.name, layer.output_shape)

For architecture or weight inspection without running the model, tools like netron can visualize neural networks from various formats. This isn't super interesting to me, beyond seeing it once, so here it is:

While I'm not yet building my own models, understanding that they're essentially binary files with structured data helps make AI feel more approachable. Microsoft's toolkit gives us a practical way to experiment with these concepts without deep machine learning expertise.