Graph Neural Networks (GNNs) and particularly Graph Transformers (GTs) excel at capturing complex dependencies in graph-structured data. However, their standard formulations operate largely agnostic to explicit symbolic constraints often available from domain knowledge. This limitation can be problematic: models might propose scientifically invalid hypotheses, violate safety protocols in critical systems, or simply waste computational effort exploring impossible interactions. For instance, in bioinformatics, a rule might strictly forbid interactions between certain protein families, yet a standard GT might still assign non-zero attention, lacking an architectural mechanism to incorporate such hard constraints.
This post introduces Symbolic Masked Graph Transformers (SM-GT), an architectural solution that embeds hard, type-based exclusion rules directly into the GT attention mechanism, thereby guaranteeing constraint compliance. Complementing this, we propose lightweight, flexible "Logic Adapters"—inspired by Low-Rank Adaptation (LoRA)—offering a practical means to manage and deploy these constraints.
The core of GTs lies in the scaled dot-product attention mechanism, which computes attention weights $\alpha_{ij}$ between nodes $i$ and $j$ based on learned query ($Q$) and key ($K$) projections:
$$
e_{ij} = \frac{Q_i K_j^{\top}}{\sqrt{d_k}}, \quad \alpha_{ij} = \mathrm{softmax}_j(e_{ij})
$$
While heterogeneous architectures (e.g., HGT) can learn type-aware projections, they typically develop preferences rather than enforcing prohibitions. The model might learn likely interactions but isn't architecturally prevented from assigning attention to pairs deemed impossible by experts (e.g., Drug $\nrightarrow$ Gene if specified). Relying on implicit learning from data offers no guarantees, and alternative approaches like semantic loss penalties often enforce constraints only softly.
SM-GT integrates constraints directly. Given a set $\mathcal{F} = \{(T_A, T_B)\}$ defining forbidden source-target node type pairs, we construct a symbolic mask $M \in \mathbb{R}^{N \times N}$:
$$
M_{ij} = \begin{cases}
-\infty & \text{if } (\mathrm{type}(i), \mathrm{type}(j)) \in \mathcal{F} \\
0 & \text{otherwise.}
\end{cases}
$$
The critical step is adding this mask to the attention scores before applying the softmax function:
$$
e'_{ij} = e_{ij} + M_{ij}
$$
The subsequent application of the softmax, $\alpha'_{ij} = \mathrm{softmax}_j(e'_{ij})$, leverages the property that $\exp(-\infty) \rightarrow 0$. Consequently, the attention weight $\alpha'_{ij}$ for any forbidden pair $(i, j)$ is guaranteed to be exactly zero (within numerical precision), irrespective of the learned parameters $Q$ and $K$. This provides a mathematically sound, architectural guarantee of rule compliance during every forward pass.
While effective, a dense $N \times N$ mask $M$ presents scalability and flexibility challenges: prohibitive memory cost for large $N$ and inability to modify rules post-deployment without altering the model structure or retraining.
We address this with Logic Adapters, inspired by the parameter-efficient fine-tuning technique LoRA (Hu et al., 2022). We approximate the constraint mask (or a dynamic portion thereof) using a low-rank decomposition $\Delta M = AB^{\top}$, where $A \in \mathbb{R}^{N \times r}$, $B \in \mathbb{R}^{N \times r}$, and the rank $r \ll N$. The attention mechanism incorporates this additive, low-rank mask:
$$
\alpha = \mathrm{softmax}\left( \frac{QK^{\top}}{\sqrt{d_k}} + M_{\text{hard}} + \Delta M \right)
$$
Here, $M_{\text{hard}}$ represents an optional static mask for immutable rules, while $\Delta M = AB^{\top}$ is the Logic Adapter. This approach yields several benefits:
1. Parameter Efficiency: Storing the factors $A$ and $B$ requires only $2 \times N \times r$ parameters, a significant reduction compared to $N^2$ for a dense mask. This makes the approach viable even for large graphs.
2. Modularity & Flexibility: The adapter's parameters ($A, B$) can be treated as a separate module. Different rule sets can be represented by different adapters, which can be "hot-swapped" at inference time simply by loading the corresponding $A, B$ factors. This enables post-deployment updates to constraints (e.g., reflecting new regulations or knowledge) without touching the base GT model weights.
3. Derivation & Potential Training: Factors $A, B$ can be derived efficiently from a known full mask $M$ using Singular Value Decomposition (SVD), which finds the optimal rank-$r$ approximation. Alternatively, the adapter parameters could potentially be trained (as explored in LoRA for weights), allowing for "soft" constraints or even learning rules implicitly, though our primary focus here is enforcing known hard constraints.
Crucially, these Logic Adapters modify attention scores based on symbolic logic, fundamentally differing from standard LoRA which adapts model weights ($W + UV^T$) based on task gradients.
To clearly illustrate the mechanism and differentiate it from standard weight adaptation, we conducted an experiment on a minimal synthetic graph.
Setup:
Procedure and Results:
Baseline Violation = 1.0
(significant attention assigned to forbidden pairs).Logic Violation = 0.0
(attention on forbidden pairs forced to zero).LoRA Violation = 1.0
. This confirms that standard LoRA, which adapts weights based on task loss, does not inherently enforce symbolic constraints present only in the domain knowledge, not the loss function.LoRA+Logic Violation = 0.0
. This demonstrates that the Logic Adapter mechanism is compatible with standard weight adaptation techniques like LoRA; the Logic Adapter successfully enforces the constraint on top of the LoRA-modified weights.Interpretation: This demonstration clearly shows that the proposed pre-softmax Logic Adapter is necessary and sufficient to guarantee adherence to symbolic type-exclusion rules architecturally. Standard weight adaptation methods like LoRA do not provide this guarantee, though they can be used concurrently if desired for task-specific fine-tuning.
Symbolic Masked Graph Transformers (SM-GT) offer a method for embedding hard symbolic type-exclusion rules directly into the GT attention mechanism. This approach guarantees adherence to specified domain constraints. The proposed low-rank Logic Adapters provide a flexible and efficient mechanism for managing these rules, allowing for post-deployment updates without retraining the base model. This framework facilitates the development of more reliable and adaptable neurosymbolic models for graph-structured data where domain knowledge is critical.
You can find a toy example implementation on my git repos: https://github.com/marcelbtec/smgt/