Notebook 3#

  • Navigating this notebook on Google Colab: There will be text blocks and code blocks throughout the notebook. The text blocks, such as this one, will contain instructions and questions for you to consider. The code blocks, such as the one below, will contain executible code. Sometimes you will have to modify the code blocks following the instructions in the text blocks. You can run the code block by either pressing control/cmd + enter or by clicking the arrow on left-hand side as shown.


  • Saving Work: If you wish to save your work in this .ipynb, we recommend downloading the compressed repository from GitHub, unzipping it, uploading it to Google Drive, opening this notebook from within Google Drive, and setting WITHIN_GDRIVE to True.

  • This notebook doesn't require any modifications inside the code_and_data directory for a base walkthrough, but if you want to work on the optional exercises throughout the notebook, you will need to modify dsl.ipynb which will require saving your work.

  • Finally, this notebook requires a NVIDIA GPU for training. Google Colab allows access to one GPU instance for free (at a given time). Go to Runtime > Manage Sessions > Accelerator: GPU.

# (Optional) Are you loading data from within Google Drive?
WITHIN_GDRIVE = False # otherwise: True
# Setup repository and download toy CalMS21 data
  !git clone /content/Neurosymbolic_Tutorial
  %cd /content/Neurosymbolic_Tutorial/code_and_data
  !gdown 1XPUF4n5iWhQw8v1ujAqDpFefJUVEoT4L && (unzip -o; rm -rf )
  from google.colab import drive
  # Change this path to match the corect destination
  %cd /content/drive/MyDrive/Neurosymbolic_Tutorial/code_and_data/
  import os; assert os.path.exists(""), f"Couldn't find `` at this location {os.getcwd()}. HINT: Are you within `code_and_data`"
  !gdown 1XPUF4n5iWhQw8v1ujAqDpFefJUVEoT4L && (unzip -o; rm -rf )

import os
import matplotlib.pyplot as plt
import numpy as np
(Important!) Convert Notebooks to Python Files#

If you update the DSL in dsl.ipynb or the search algorithm in near.ipynb, you need to run this cell again to update the code in this notebook.

!jupyter nbconvert --to python dsl.ipynb
!jupyter nbconvert --to python near.ipynb
Neural Relaxation Exercise#

Admissible heuristics are heuristics that never overestimate the cost of reaching a goal. These heuristics can be used as part of an informed search algorithm, such as A* search. Here, we use the assumption that sufficiently large neural networks have greater representational power compared to neurosymbolic models or symbolic models, and use this neural relaxation as an admissible heuristic over the program graph search space.


Run the utility code below to set up the training.

!pip install pytorch-lightning # Pytorch lightning is a wrapper around PyTorch.
import os
import torch, numpy as np
import torch.nn as nn
import torch.nn.functional as F
import pytorch_lightning as pl
from sklearn.metrics import f1_score, precision_score, recall_score
# Utility Functions from Notebook 1

class TrainConfig:
    epochs: int = 20
    batch_size: int = 32
    lr: float = 3e-3
    weight_decay: float = 0.0
    train_size: int = 2000 # out of 5000
    val_size: int = 1000 # out of 5000
    test_size: int = 3000 # out of 3000
    num_classes: int = 2

config = TrainConfig()

# Dataloader for the CalMS21 dataset
class Calms21Task1Dataset(
    def __init__(self, data_path, investigations_path, transform=None, target_transform=None): = np.load(data_path)
        self.investigations = np.load(investigations_path)
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(

    def __getitem__(self, idx):
        x =[idx]
        y = self.investigations[idx]
        if self.transform:
            x = self.transform(x)
        if self.target_transform:
            y = self.target_transform(y)
        return x, y

class Calms21Task1DataModule(pl.LightningDataModule):
    def __init__(self, data_dir, batch_size, transform=None, target_transform=None ):
        self.data_dir = data_dir
        self.batch_size = batch_size
        self.train_data_path = os.path.join(data_dir, "train_data.npy")
        self.train_investigations_path = os.path.join(data_dir, "train_investigation_labels.npy")
        self.test_data_path = os.path.join(data_dir, "test_data.npy")
        self.test_investigations_path = os.path.join(data_dir, "test_investigation_labels.npy")
        self.val_data_path = os.path.join(data_dir, "val_data.npy")
        self.val_investigations_path = os.path.join(data_dir, "val_investigation_labels.npy")
        self.transform = transform
        self.target_transform = target_transform

    def setup(self, stage=None):
        self.train_dataset = Calms21Task1Dataset(self.train_data_path, self.train_investigations_path, self.transform, self.target_transform)
        self.val_dataset = Calms21Task1Dataset(self.val_data_path, self.val_investigations_path, self.transform, self.target_transform)
        self.test_dataset = Calms21Task1Dataset(self.test_data_path, self.test_investigations_path, self.transform, self.target_transform)

    def train_dataloader(self):
        return, batch_size=self.batch_size, shuffle=True)
    def val_dataloader(self):
        return, batch_size=self.batch_size, shuffle=False)
    def test_dataloader(self):
        return, batch_size=self.batch_size, shuffle=False)

dm = Calms21Task1DataModule(data_dir="data/calms21_task1/", batch_size=32, transform=None, target_transform=None)

def train(model, datamodule, config):
    trainer = pl.Trainer(gpus=0, max_epochs=config.epochs), datamodule)
    return model

# Evaluate using F1 score.
test_labels = np.load("data/calms21_task1/test_investigation_labels.npy")

def evaluate(model, data_loader, gt_labels):
  predictions = []
  for x,_ in data_loader:
    predictions.append(torch.argmax(model(x), dim = -1))

  predictions =, dim = 0)

  f1 = f1_score(test_labels, predictions, average="binary")
  precision = precision_score(test_labels, predictions, average="binary")
  recall = recall_score(test_labels, predictions, average="binary")

  print("F1 score on test set: " + str(f1))
  print("Precision on test set: " + str(precision))
  print("Recall on test set: " + str(recall))

  return predictions, f1, precision, recall

Exercise: The cost of a program is represented by structural cost + the model performance (error in F1 score). Compare performance between:

  • Program 1: Window13Avg( Or(MinResNoseKeypointDistSelect,AccelerationSelect) )

  • Program 2: Window13Avg( Or(AtomToAtomModule,AtomToAtomModule) )

Which program have lower cost in terms of error in F1 score? AtomToAtomModule from program 2 are neural networks, while MinResNoseKeypointDistSelect and AccelerationSelect from program 1 are feature selects with only a small set of weights and bias that are learned. Are you able to find features in DSL_DICT in dsl.ipynb that enables program 1 to perform better than program 2?

Example of a symbolic program#

# Complete program
from dsl_compiler import ExpertProgram

program = "Window13Avg( Or(MinResNoseKeypointDistSelect,AccelerationSelect) )" = 1e-2
sample_model = ExpertProgram(program, config=config)

# Use gradient descent to find parameters of multi-variable linear regression.
sample_model = train(sample_model, dm, config=config)

predictions_symbolic, _, _, _ = evaluate(sample_model, dm.test_dataloader(), test_labels)
Example of a Neurosymbolic Program#

How does the performance compare to the symbolic program above?

# Neurosymbolic
program = "Window13Avg( Or(AtomToAtomModule,AtomToAtomModule) )" = 1e-2
sample_model = ExpertProgram(program, config=config)

# Use gradient descent to find parameters of multi-variable linear regression.
sample_model = train(sample_model, dm, config=config)

predictions_symbolic, _, _, _ = evaluate(sample_model, dm.test_dataloader(), test_labels)
Example of a Neural Module#

How does the performance compare to the neurosymbolic program?

# This is an RNN = basically fully neural. How does the error compare?
program = "ListToAtomModule" = 1e-2
sample_model = ExpertProgram(program, config=config)

# Use gradient descent to find parameters of multi-variable linear regression.
sample_model = train(sample_model, dm, config=config)

predictions_symbolic, _, _, _ = evaluate(sample_model, dm.test_dataloader(), test_labels)
Visualizing Runtime vs. Accuracy#

To evaluate your NEAR runs, we can take the saved log files in code_and_data/results and plot the total runtime against accuracy.

Optional Exercise: Some hyperparameters that affect runtime and accuracy are listed below. Try changing a few of these and save a performance plot of your results.

  • neural_epochs: number of epochs to train the neural network approximator

  • min_num_units and max_num_units: minimum and maximum number of units in the neural network. The network is smaller as the search gets deeper.

  • max_num_children: max number of children for a node

  • frontier_capacity: capacity of frontier maintained by the search algorithm

How does the runtime and performance of NEAR compare to enumeration?

# Plotting utility functions

def parse_runtime_f1_from_logs(log_files):

  runtime = []
  f1 = []

  runtime_key = 'Total time elapsed is'
  f1_key = 'F1 score achieved is'

  for item in log_files:
    # If there's a list of list of files corresponding to different random seeds,
    # we take the average
    if len(item[0]) > 1:
      seed_runtime = []
      seed_f1 = []
      for seed in item:
        with open(os.path.join('results', seed, 'log.txt')) as f:
            lines = f.readlines()

            curr_runtimes = []
            for line in lines:
              if runtime_key in line:
                if line.split(runtime_key)[-1].startswith(':'):
              if f1_key in line:
      # There's only 1 seed per run
      with open(os.path.join('results', item, 'log.txt')) as f:
        lines = f.readlines()

      curr_runtimes = []
      for line in lines:
        if runtime_key in line:
            if line.split(runtime_key)[-1].startswith(':'):
        if f1_key in line:

  return runtime, f1

def plot_runtime_f1(runtime, f1, labels):
  assert(len(runtime) == len(f1) == len(labels))

  fig = plt.figure()
  for i, item in enumerate(labels):
    if len(item[0]) > 1:
      item = item[0]
    plt.scatter(runtime[i], f1[i], label = item.split('_sd')[0])

  plt.xlim([10, 400])
  plt.ylim([0.3, 0.75])  
  plt.xlabel("Runtime (s)")
  plt.ylabel("F1 score")    
  plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1 + 0.1*len(labels)))
# Directory names to plot inside near_code/results
run_names_to_plot = ['investigation_base_enumeration_sd_1_001', 

runtime, f1 = parse_runtime_f1_from_logs(run_names_to_plot)

plot_runtime_f1(runtime, f1, run_names_to_plot)

[Optional] Open-Ended Exploration#

Can you improve the performance of neurally-guided program search? Submit your runs on seed 1 (default) to Feel free to make any changes to the code! Below are some suggestions:

  • Modifying neural heuristic: looking at the neural modules in dsl.ipynb, such as ListToAtomModule or AtomToAtomModule and the hyperparameters for training the neural approximators, are you able to improve the neural heuristic we currently use? The ideal neural heuristic would be able to be trained quickly, while approximating the program performance closely and is admissible.

  • Modify search space: are there any modifications you can make to the search space, for example, through modifying the DSL or min/max program depth that leads to better runtime and performance?

[Optional] Modify Architecture of AtomToAtom Heuristic#

In dsl.ipynb, the architecture of the neural heuristics used are defined in the Section called Neural Functions. In this section, the AtomToAtomModule in dsl.ipynb uses a 2-layer network in FeedForwardModule as the neural approximator. Modify the network architecture to add a third layer and run NEAR below. Do you observe any changes in performance? Why or why not?

Important: after modifying dsl.ipynb, you will need to re-convert your new dsl to python files, included in the cell below.

!jupyter nbconvert --to python dsl.ipynb
!jupyter nbconvert --to python near.ipynb
!yes| python \
--algorithm astar-near \
--exp_name investigation_morlet_3layer \
--trial 1 \
--seed 1 \
--dsl_str "morlet" \
--train_data "data/calms21_task1/train_data.npy" \
--test_data "data/calms21_task1/test_data.npy" \
--valid_data "data/calms21_task1/val_data.npy" \
--train_labels "data/calms21_task1/train_investigation_labels.npy" \
--test_labels "data/calms21_task1/test_investigation_labels.npy" \
--valid_labels "data/calms21_task1/val_investigation_labels.npy" \
--input_type "list" \
--output_type "atom" \
--input_size 18 \
--output_size 1 \
--num_labels 1 \
--lossfxn "bcelogits" \
--frontier_capacity 8 \
--max_num_children 10 \
--max_depth 5 \
--max_num_units 16 \
--min_num_units 8 \
--learning_rate 0.0001 \
--neural_epochs 4 \
--symbolic_epochs 12 \
--class_weights "2.0"
[Optional] NEAR: IDDFS Search - Morlet DSL#

The admissible heuristics used by NEAR is compatible with different search strategies - here, we use iterative deepening depth-first search (IDDFS) instead of A* search through the program space. IDDFS is a search strategy where depth-limited version of depth-first search is run repreatedly with increasing depth limits.

How does the performance of IDDFS compare to A* on this dataset?

!yes| python \
--algorithm iddfs-near \
--exp_name investigation_morlet \
--trial 1 \
--seed 1 \
--dsl_str "morlet" \
--train_data "data/calms21_task1/train_data.npy" \
--test_data "data/calms21_task1/test_data.npy" \
--valid_data "data/calms21_task1/val_data.npy" \
--train_labels "data/calms21_task1/train_investigation_labels.npy" \
--test_labels "data/calms21_task1/test_investigation_labels.npy" \
--valid_labels "data/calms21_task1/val_investigation_labels.npy" \
--input_type "list" \
--output_type "atom" \
--input_size 18 \
--output_size 1 \
--num_labels 1 \
--lossfxn "bcelogits" \
--frontier_capacity 5 \
--max_num_children 10 \
--max_depth 5 \
--max_num_units 16 \
--min_num_units 8 \
--learning_rate 0.0001 \
--neural_epochs 4 \
--symbolic_epochs 12 \
--class_weights "2.0"
[Optional] Additional Experiments: Test on Other Behavior Classes#

In behavior analysis, animals exhibit a wide range of behaviors, and the goal of behavioral neuroscience is to learn the neural basis of these behaviors. Investigation vs. no investigation is one example of a human-defined behavior, but there’s also behaviors such as mount, attack, rearing, approach, groom, …

Here, we provide an additional set of behavior annotations for mount. How does the DSL and algorithm you developed compare to using enumeration as a baseline for program search? Mount is a relatively rare class compared to investigation - how do the performances compare?

!yes | python \
--algorithm enumeration \
--exp_name investigation_morlet \
--trial 1 \
--seed 1 \
--dsl_str "morlet" \
--train_data "data/calms21_task1/train_data.npy" \
--test_data "data/calms21_task1/test_data.npy" \
--valid_data "data/calms21_task1/val_data.npy" \
--train_labels "data/calms21_task1/train_mount_labels.npy" \
--test_labels "data/calms21_task1/test_mount_labels.npy" \
--valid_labels "data/calms21_task1/val_mount_labels.npy" \
--input_type "list" \
--output_type "atom" \
--input_size 18 \
--output_size 1 \
--num_labels 1 \
--lossfxn "bcelogits" \
--learning_rate 0.0001 \
--symbolic_epochs 12 \
--max_num_programs 25 \
--class_weights "2.0"
!yes| python \
--algorithm astar-near \
--exp_name investigation_morlet \
--trial 1 \
--seed 1 \
--dsl_str "morlet" \
--train_data "data/calms21_task1/train_data.npy" \
--test_data "data/calms21_task1/test_data.npy" \
--valid_data "data/calms21_task1/val_data.npy" \
--train_labels "data/calms21_task1/train_mount_labels.npy" \
--test_labels "data/calms21_task1/test_mount_labels.npy" \
--valid_labels "data/calms21_task1/val_mount_labels.npy" \
--input_type "list" \
--output_type "atom" \
--input_size 18 \
--output_size 1 \
--num_labels 1 \
--lossfxn "bcelogits" \
--frontier_capacity 8 \
--max_num_children 10 \
--max_depth 5 \
--max_num_units 32 \
--min_num_units 16 \
--learning_rate 0.0001 \
--neural_epochs 4 \
--symbolic_epochs 12 \
--class_weights "2.0"
Acknowledgements: This notebook was developed by Jennifer J. Sun (Caltech) and Atharva Sengal (UT Austin) for the neurosymbolic summer school. The data subset is processed from CalMS21 and the DSL is developed by by Megan Tjandrasuwita (MIT) from her work on Interpreting Expert Annotation Differences in Animal Behavior. Megan’s work is partly based on NEAR by Ameesh Shah (Berkeley) and Eric Zhan (Argo).