Probing LLMs for Logical Reasoning

Manigrasso, Francesco; Schouten, Stefan; Morra, Lia; Bloem, Peter

doi:10.1007/978-3-031-71167-1_14

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14979))

Included in the following conference series:

International Conference on Neural-Symbolic Learning and Reasoning

755 Accesses

Abstract

Recently, the question of what types of computation and cognition large language models (LLMs) are capable of has received increasing attention. With models clearly capable of convincingly faking true reasoning behavior, the question of whether they are also capable of real reasoning—and how the difference should be defined—becomes increasingly vexed. Here we introduce a new tool, Logic Tensor Probes (LTP), that may help to shed light on the problem. Logic Tensor Networks (LTN) serve as a neural symbolic framework designed for differentiable fuzzy logics. Using a pretrained LLM with frozen weights, an LTP uses the LTN framework as a diagnostic tool. This allows for the detection and localization of logical deductions within LLMs, enabling the use of first-order logic as a versatile modeling language for investigating the internal mechanisms of LLMs. The LTP can make deductions from basic assertions, and track if the model makes the same deductions from the natural language equivalent, and if so, where in the model this happens. We validate our approach through proof-of-concept experiments on hand-crafted knowledge bases derived from WordNet and on smaller samples from FrameNet.

F. Manigrasso and S. Schouten—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning and Reasoning with Logic Tensor Networks

Towards bridging the neuro-symbolic gap: deep deductive reasoners

Article 06 February 2021

The RatioLog Project: Rational Extensions of Logical Reasoning

Article 05 June 2015

Notes

1.
The phrase probe is also used to refer to general inspection methods for neural networks. Here, we use it specifically to refer to shallow classifiers that take hidden activations as features.
2.
Throughout the text we use the word embedding to refer to the LLM’s representations of its input tokens at any stage of the LLM’s execution, not just at the initial embedding stage.
3.
Possible alternative explanations include a model recalling a stored association rather than reasoning from scratch. LTN probes are tools that can be used to investigate such possibilities and, through careful use, eliminate them. We do not claim that a successfully trained LTN probe is always proof that an LLM shows the modelled reasoning.
4.
A frame in FrameNet is a structured representation of a situation, including participants, props, and conceptual roles. Each frame contains a textual description (frame definition), associated elements, lexical units, example sentences, and relations with other frames.
5.
hf.co/TheBloke/open-llama-7b-open-instruct-GPTQ.

References

Azaria, A., Mitchell, T.: The internal state of an LLM knows when it’s lying. In: The 2023 Conference on Empirical Methods in Natural Language Processing (2023)
Google Scholar
Badreddine, S., Garcez, A.d., Serafini, L., Spranger, M.: Logic tensor networks. Artif. Intell. 303, 103649 (2022)
Google Scholar
Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, pp. 86–90. Association for Computational Linguistics, Montreal (1998). https://doi.org/10.3115/980845.980860
Bang, Y., et al.: A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In: Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 675–718 (2023)
Google Scholar
Belinkov, Y.: Probing classifiers: promises, shortcomings, and advances. Comput. Linguist. 48(1), 207–219 (2022)
Google Scholar
Bronzini, M., Nicolini, C., Lepri, B., Staiano, J., Passerini, A.: Unveiling LLMS: the evolution of latent representations in a temporal knowledge graph. arXiv preprint arXiv:2404.03623 (2024)
Bubeck, S., et al.: Sparks of artificial general intelligence: early experiments with GPT-4. arXiv preprint arXiv:2303.12712 (2023)
Carraro, T.: LTNtorch: PyTorch Implementation of Logic Tensor Networks (2022). https://doi.org/10.5281/zenodo.6394282
Conia, S., Navigli, R.: Probing for predicate argument structures in pretrained language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4622–4632. Association for Computational Linguistics, Dublin (2022)https://doi.org/10.18653/v1/2022.acl-long.316
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019). https://api.semanticscholar.org/CorpusID:52967399
Ettinger, A., Elgohary, A., Resnik, P.: Probing for semantic evidence of composition by means of simple classification tasks. In: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pp. 134–139 (2016)
Google Scholar
Fellbaum, C.: WordNet: An Electronic Lexical Database. Bradford Books (1998). https://mitpress.mit.edu/9780262561167/
Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: OPTQ: accurate quantization for generative pre-trained transformers. In: The Eleventh International Conference on Learning Representations (2022)
Google Scholar
Galassi, A., Lippi, M., Torroni, P.: Investigating Logic Tensor Networks for Neural-Symbolic Argument Mining (2021)
Google Scholar
Ganesh, P., et al.: Compressing large-scale transformer-based models: a case study on BERT. Trans. Assoc. Comput. Linguist. 9, 1061–1080 (2020). https://api.semanticscholar.org/CorpusID:211532645
Geng, X., Liu, H.: OpenLLaMA: An Open Reproduction of LLaMA (2023). https://github.com/openlm-research/open_llama
Huang, J., Chang, K.C.C.: Towards reasoning in large language models: a survey. arXiv preprint arXiv:2212.10403 (2022)
Hupkes, D., Veldhoen, S., Zuidema, W.: Visualisation and “diagnostic classifiers” reveal how recurrent and recursive neural networks process hierarchical structure. J. Artif. Intell. Res. 61, 907–926 (2018)
Google Scholar
Jin, M., et al.: Exploring concept depth: how large language models acquire knowledge at different layers? arXiv preprint arXiv:2404.07066 (2024)
Kosinski, M.: Theory of mind may have spontaneously emerged in large language models. Stanford University, Graduate School of Business, Tech. rep. (2023)
Google Scholar
Kuznetsov, I., Gurevych, I.: A matter of framing: the impact of linguistic formalism on probing results. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 171–182. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.13
Kyriakopoulos, S., d’Avila Garcez, A.S.: Continual reasoning: non-monotonic reasoning in neurosymbolic AI using continual learning. In: International Workshop on Neural-Symbolic Learning and Reasoning (2023). https://api.semanticscholar.org/CorpusID:258461140
Liu, Y.H., et al.: Understanding LLMS: a comprehensive overview from training to inference. arXiv preprint arXiv:2401.02038 (2024)
Manigrasso, F., Miro, F.D., Morra, L., Lamberti, F.: Faster-LTN: a neuro-symbolic, end-to-end object detection architecture. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds.) ICANN 2021. LNCS, vol. 12892, pp. 40–52. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86340-1_4
Manigrasso, F. Morra, L., Lamberti, F.: Fuzzy Logic Visual Network (FLVN): a neuro-symbolic approach for visual features matching. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds.) ICIAP 2023, Part II, pp. 456–467. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-43153-1_38
Morra, L., et al.: Designing logic tensor networks for visual sudoku puzzle classification. In: International Workshop on Neural-Symbolic Learning and Reasoning (2023)
Google Scholar
OpenAI, R.: GPT-4 technical report. arXiv pp. 2303–08774 (2023)
Google Scholar
Padalkar, P., Wang, H., Gupta, G.: NeSyFOLD: extracting logic programs from convolutional neural networks. In: ICLP Workshops (2023). https://api.semanticscholar.org/CorpusID:263875519
Serafini, L., Garcez, A.D.: Logic tensor networks: deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422 (2016)
Shapira, N., et al.: Clever Hans or neural theory of mind? Stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763 (2023)
Tenney, I., Das, D., Pavlick, E.: BERT rediscovers the classical NLP pipeline. In: Annual Meeting of the Association for Computational Linguistics (2019). https://api.semanticscholar.org/CorpusID:155092004
Tenney, I., et al.: What do you learn from context? Probing for sentence structure in contextualized word representations (2018). https://openreview.net/forum?id=SJzSgnRcKX
Thawani, A., Ghanekar, S., Zhu, X., Pujara, J.: Learn your tokens: word-pooled tokenization for language modeling. In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 9883–9893 (2023)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. (11) (2008)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Zhang, B., et al.: Sentiment interpretable logic tensor network for aspect-term sentiment analysis. In: Proceedings of the 29th International Conference on Computational Linguistics. pp. 6705–6714. International Committee on Computational Linguistics, Gyeongju (2022). https://aclanthology.org/2022.coling-1.582
Zhang, Y., et al.: Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023)

Download references

Author information

Authors and Affiliations

Politecnico di Torino, Turin, Italy
Francesco Manigrasso & Lia Morra
Vrije Universiteit, Amsterdam, Netherlands
Stefan Schouten & Peter Bloem

Authors

Francesco Manigrasso
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Schouten
View author publications
You can also search for this author in PubMed Google Scholar
Lia Morra
View author publications
You can also search for this author in PubMed Google Scholar
Peter Bloem
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesco Manigrasso .

Editor information

Editors and Affiliations

Sony AI, Barcelona, Spain
Tarek R. Besold
City, University of London, London, UK
Artur d’Avila Garcez
City, University of London, London, UK
Ernesto Jimenez-Ruiz
University of Padova, Padova, Italy
Roberto Confalonieri
City, University of London, London, UK
Pranava Madhyastha
City, University of London, London, UK
Benedikt Wagner

Appendices

A Hyperparameters

For all experiments, the aggregation function parameter in the knowledge base defined in Eq. 27 is set to $p_\forall =2$. We implemented the probes in PyTorch, using the LTNtorch library [8] and training was done using the Adam optimizer.

1.1 A.1 WordNet

To train the WordNet probes, we adopted a learning rate of 1e-5 and trained our architecture for 100 epochs. The knowledge base during training included batches of 128 randomly chosen sentences, both positive and negative. Experiments were done on a single Nvidia 2080 Ti GPU.

1.2 A.2 FrameNet

To train the FrameNet probes, we adopted a learning rate of 1e-3 and trained our architecture for 15 epochs and a batch size of 128. The experiments were performed on a single Nvidia 3090 GPU.

B WordNet LTN

1.1 B.1 Variables, Predicates and Axioms

In this subsection, we present the foundations of variables and predicates with the definition of the knowledge base $\mathcal {K}$.

Grounding Variables and their corresponding domains are grounded as follows:

$$\begin{aligned} &\mathcal {G}(n) = \mathbb {N}^{N}, \,\, \mathcal {G}(v) = \mathbb {N}^{V}, \,\, \mathcal {G}(g) = \mathbb {N}^{G}, \,\, \mathcal {G}(q) = \mathbb {N}^{Q}, \,\, \nonumber \\ &\mathcal {G}(h) = \mathbb {N}^{H}, \,\, \mathcal {G}(l) = \mathbb {N}^{L}, \,\, \mathcal {G}(s) = \mathbb {R}^{B \times D}, \,\, \nonumber \\ &\mathcal {G}(sub),\, \mathcal {G}(act),\, \mathcal {G}(obj)= \mathbb {R}^{T \times m} \,\, \nonumber \\ \end{aligned}$$

(11)

where g, q, n, a, h represent the class labels belonging to a set of classes G, macroclasses Q, names N, actions A, and habitats H sets. Additionally, l represents the label used to distinguish different sections of the sentence, such as subject and object. The variables B and D are used in the definition of s to represent the dimensions of the feature space, where B is the batch size and D is the dimensionality of the features.

On the other hand sub, act and obj retrieved with the attention model of an entire sentence s, are grounded into a feature space. obj can be translated into macroclass when the sentence is in the form isOfName isOfAction isOfMacroclass instead of isOfName isOfAction isofObject. Within the FOL language, different predicates are defined: isOfName(sub, n), isOfAction(act, a), isOfClass(obj, c) with the aim of categorizing the sentence components, livesInHabitat(obj, h) and isOfMacroclass(obj, q) allow classifying objects respect to habitat and macroclass and finally isSubject(sub, l) with isObject(obj, l) to predict if a sentence is logically true by investigating whether the subject and its complement are inserted in a correct order.

The predicate groundings $\mathcal {G}(\texttt {isOfName})$, $\mathcal {G}(\texttt {isOfAction})$, $\mathcal {G}(\texttt {isOfClass})$ are formed by the similarity between the input features and the corresponding trainable class vectors with an MLP $p_1$ with softmax activation function, as the classes are mutually exclusive.

$$\begin{aligned} \mathcal {G}(\texttt {isOfName}): & \text {sub},n \rightarrow {n}^T p_1(\mathcal {G}(sub),n) \end{aligned}$$

(12)

$$\begin{aligned} \mathcal {G}(\texttt {isOfAction}): & \text {act},v \rightarrow {v}^T p_1(\mathcal {G}(act),v) \end{aligned}$$

(13)

$$\begin{aligned} \mathcal {G}(\texttt {isOfClass}): & \text {obj},g \rightarrow {g}^T p_1(\mathcal {G}(obj),g) \end{aligned}$$

(14)

Likewise, the $\mathcal {G}(\texttt {isOfMacroclass}), \mathcal {G}(\texttt {livesInHabitat})$ predicates are grounded with two simple MLP layers with a softmax activation function $p_2, p_3$:

$$\begin{aligned} \mathcal {G}(\texttt {isOfMacroclass}): & \text {obj},q \rightarrow {v}^T p_2(\mathcal {G}(obj),q) \end{aligned}$$

(15)

$$\begin{aligned} \mathcal {G}(\texttt {isOfMacroclass}): & \text {obj},q \rightarrow {v}^T p_2(\mathcal {G}(obj),q) \end{aligned}$$

(16)

$$\begin{aligned} \mathcal {G}(\texttt {livesInHabitat}): & \text {obj},h \rightarrow {g}^T p_3(\mathcal {G}(obj),h) \end{aligned}$$

(17)

Finally, for the grounding of isSubject and isObject two parametric similarity functions based on two simple MLP layers with softmax activation function $p_4, p_5$:

$$\begin{aligned} \mathcal {G}(\texttt {isSubject}): & \text {sub},l \rightarrow {l}^Tp_4(\mathcal {G}(sub),l) \end{aligned}$$

(18)

$$\begin{aligned} \mathcal {G}(\texttt {isObject}): & \text {obj},l \rightarrow {l}^Tp_5(\mathcal {G}(obj),l) \end{aligned}$$

(19)

where l allows us to distinguish the different sections (subject, action, object) of the sentence.

1.2 B.2 Learning from Labeled Examples

The following is each axiom representing the labeled examples used during training:

$$\begin{aligned} \phi _{1} &= \forall \text {Diag}(sub,n) (\texttt {isOfName}(sub,n)) \end{aligned}$$

(20)

$$\begin{aligned} \phi _{2} &= \forall \text {Diag}(act,v) (\texttt {isOfAction}(act,v)) \end{aligned}$$

(21)

$$\begin{aligned} \phi _{3} &= \forall \text {Diag}(obj,g) (\texttt {isOfClass}(obj,g)) \end{aligned}$$

(22)

$$\begin{aligned} \phi _{4} &= \forall \text {Diag}(obj,q) (\texttt {isOfMacroclass}(obj,q)) \end{aligned}$$

(23)

$$\begin{aligned} \phi _{5} &= \forall \text {Diag}(obj,h)(\texttt {livesInHabitat}(obj,h)) \end{aligned}$$

(24)

$$\begin{aligned} \phi _{6} &= \forall \text {Diag}(sub,l) (\texttt {isSubject}(sub,l)) \end{aligned}$$

(25)

$$\begin{aligned} \phi _{7} &= \forall \text {Diag}(obj,l)(\texttt {isObject}(obj,l)) \end{aligned}$$

(26)

1.3 B.3 Grounding Logical Connectives and Aggregators

The knowledge base $\mathcal {K}$ includes a collection of axioms, with a view to reconstructing logical connectives and aggregators into Real Logic. Gradient descent is adopted to train the probe to maximize the satisfiability of the problem. In this configuration, we make use of the standard negation $\lnot $ defined as $N_S(a) = 1 - a$, and the Reichenbach implication $\rightarrow $ defined as $I_R(a, b) = 1 - a + ab$, where a and b are both truth values within the range of [0, 1]. To approximate the universal quantifier $\forall $, we use the generalized mean with respect to the error, denoted as $A_{pME}$, as described in [2, 34]. Given a set of n truth values $a_1, \ldots , a_n \in [0, 1]$:

$$\begin{aligned} \forall : A_{p M E} & \left( a_1, \ldots , a_n\right) = \nonumber \\ & 1 - \left( \frac{1}{n} \sum _{i=1}^n \left( 1-a_i\right) ^{p_{\forall }}\right) ^{\frac{1}{p_{\forall }}} p_{\forall } \geqslant 1 \end{aligned}$$

(27)

$A_{p M E}$ is a measure of how much, on average, truth values $a_i$ deviate from the true value of 1. Further details on the role of $p_{\forall }$ can be found in [2].

1.4 B.4 Knowledge Base for Inference Time

The knowledge base used at testing time is shown below:

$$\begin{aligned} \phi _{8} &= \forall \text {Diag}(sub,n,act,v,obj,g,l) \nonumber \\ & \quad (\texttt {isOfName}(sub,n) \wedge \texttt {isSubject}(sub,l)) \wedge \texttt {isOfAction}(act,v) \nonumber \\ &\wedge (\texttt {isOfClass}(obj,g) \wedge \texttt {isObject}(obj,l)) & \end{aligned}$$

(28)

$$\begin{aligned} \phi _{9} &= \forall \text {Diag}(sub_{1},n,act_{1},v,obj_{1},g,l,sub_{2},act_{2},obj_{2}) \nonumber \\ &\quad (\texttt {isOfName}(sub_{1},n) \wedge \texttt {isSubject}(sub_{1},l)) \wedge \texttt {isOfAction}(act_{1},v) \nonumber \\ &\wedge (\texttt {isOfClass}(obj_{1},g) \wedge \texttt {isObject}(obj_{1},l)) \nonumber \\ &\wedge (\texttt {isOfName}(sub_{2},n) \wedge \texttt {isSubject}(sub_{2},l)) \wedge \nonumber \\ &(\texttt {isOfMacroclass}(obj_{2},q) \wedge \texttt {isObject}(obj_{2},l)) & \end{aligned}$$

(29)

$$\begin{aligned} \phi _{10} &= \forall \text {Diag}(sub_{1},n,act_{1},v,obj_{1},g,l,sub_{2},obj_{2}) \nonumber \\ &\quad (\texttt {isOfName}(sub_{1},n) \wedge \texttt {isSubject}(sub_{1},l)) \wedge \texttt {isOfAction}(act_{1},v) \nonumber \\ &\wedge (\texttt {isOfClass}(obj_{1},g) \wedge \texttt {isObject}(obj_{1},l)) \nonumber \\ &\wedge ((\texttt {isOfName}(sub_{2},n) \wedge \texttt {isSubject}(sub_{2},l))) \nonumber \\ &\wedge (\texttt {livesInHabitat}(obj_{2},h)) \wedge \texttt {isObject}(obj_{2},l)) & \end{aligned}$$

(30)

The subscript introduced in the annotation allows us to identify whether the extracted representations come from the previous sentence or the following one. Axiom $\phi _{8}$ represent the degree of truthiness of a sentence like “Gordon is a German Shepherd”, axiom $\phi _{9}$ express “Gordon is a German Shepard and Gordon is Carnivorous”, axiom $\phi _{10}$ highlights “Gordon is a German Shepard and Gordon lives in a domestic environment”.

C FrameNet Data Details

1.1 C.1 Frame-Frame Implications

The following is the complete list of all implications included with the FrameNet experiments. For more information on the meaning of each of the frames, please see https://framenet.icsi.berkeley.edu/frameIndex.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manigrasso, F., Schouten, S., Morra, L., Bloem, P. (2024). Probing LLMs for Logical Reasoning. In: Besold, T.R., d’Avila Garcez, A., Jimenez-Ruiz, E., Confalonieri, R., Madhyastha, P., Wagner, B. (eds) Neural-Symbolic Learning and Reasoning. NeSy 2024. Lecture Notes in Computer Science(), vol 14979. Springer, Cham. https://doi.org/10.1007/978-3-031-71167-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-71167-1_14
Published: 10 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71166-4
Online ISBN: 978-3-031-71167-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Probing LLMs for Logical Reasoning