Using temporal-difference learning for multi-agent bargaining
Introduction
Bargaining and negotiation are generally used interchangeably to describe the interaction between two or more parties attempting to agree on a mutually acceptable outcome from their negatively correlated preferences [1]. The application of agent technologies to automated bargaining emerges from the growing interest in applying software agents to remove obstacles hindering the success of electronic commerce [2]. Effort has been spent on designing agents that can adopt dynamic strategies for automated negotiation. They used rule-based [3], [4], [5], [6], machine learning [7], [8], [9], [10], [11], [12], [13], [14], or hybrid approaches [15], [16]. However, these approaches may not be sufficiently robust and convenient for most sellers and buyers in e-commerce transactions. For example, users must define bargaining rules when using rule-based approaches, or must encode each proposal in a bit string (chromosome) and model a fitness function for proposal evaluation when using a genetic algorithm [7], [8], [9], [10]. With reference to Bayesian probability approaches [11], [12] or case-based reasoning approaches [13], [14], bargaining performance decreases when few bargaining experiences or similar cases exist.
This research aims to design agents on behalf of a seller or a buyer to deal with online bilateral bargaining over price. An agent can observe the prices offered by their opponent but they cannot know their opponent’s reservation prices. In Game Theory, this bargaining game is a dynamic game of incomplete information. The seller agent wants the price to be high, whereas the buyer agent wants it to be low. The seller has a reservation price srp and the buyer has a reservation price brp. If the final-contract price p∗ is greater than srp, p∗-srp is the seller’s surplus. If p∗ is less than brp, brp-p∗ is the buyer’s surplus. An agreement can only be reached if brp > srp. The agents want to maximize their masters’ surpluses in the bargaining process.
This research treats the bargaining process as a Markov decision process in which an agent perceives distinct states in the bargaining process and decides actions to respond to them. At each discrete time step, the agent can sense the current state and choose an action to perform. After that, the environment responds with a transition to the next state and rewards the agent, which indicates the desirability of the succeeding state. The agent’s goal is to learn the optimal policy to maximize its total rewards. Reinforcement learning is a machine learning technique and best fits this process when the numbers of states and actions are finite [17], [18]. The advantages of this technique are that an agent can learn from its own experience rather than from examples provided by a knowledgeable supervisor. Moreover, an agent can learn from simulation games if it has no real-world experience. For example, Gerry Tesauro’s TD-Gammon program learns by playing backgammon games against itself, and from this learning experience, it can play as well as the best human players [19]. Temporal-difference (TD) learning is a novel method of reinforcement learning. Agents using TD methods can learn directly from raw experience without a model of the environment’s dynamics, and these methods update estimates based in part on other learned estimates without waiting for a final outcome [20]. This research adopts TD-based reinforcement learning to design bargaining agents which can learn how to offer and counteroffer on their own, and conducts several simulation games to test these agents’ bargaining performance. We expect that using TD-based reinforcement learning is not only a robust and convenient way for online bargaining, but it also achieves a high bargaining performance in terms of average payoff and settlement rate.
Section snippets
Markov decision process
A Markov decision process (MDP) is a stochastic decision process on a discrete time Markov chain, where the decisions of each epoch and the returns, are associated with each state a decision maker has observed [21]. In a MDP, a decision maker perceives a set S of distinct states of their environment and has a set A of actions that they can perform. At each discrete time step t, the decision maker senses the current state st, chooses an action at from its set of actions, and performs it. The
TD-based bargaining agent
An agent should calculate its own utility and perceive its opponent’s bargaining power along the bargaining process to determine an effective bargaining strategy. An agent’s utility can be calculated according to its reservation price, its current offered price (agent’s position) and the opponent’s current offered price (opponent’s position) [25]. The perceived opponent’s bargaining power can be estimated by analyzing an opponent’s concession behavior [26], e.g. analyzing an opponent’s average
Experimental design
In the electronic commerce environment, a seller may delegate a software agent on a Web site to bargain with numerous human buyers. We would like to understand how a TD-based bargaining agent acts for a seller to bargain with buyers who have different risk-attitudes. A seller agent formulates its price position as (spt − srp)/(sp1 − srp), where srp is the seller agent’s reservation price, spt is the seller agent’s offer at time step t, and sp1 is the seller agent’s initial offer. A seller agent
Experimentation
The seller and buyer agents are built on the Java Agent Development framework (JADE); http://jade.cselt.it) platform that complies with the FIPA (Foundation for Intelligent Physical Agents; http://www.fipa.org) specifications. The TD-Bargain mechanism fixes the discount factor γ = 1 because the interaction intervals in a bargaining lesson are very short. The parameters of the neural network are set as learning rate 0.25, momentum 0.05, and the initial weights are generated randomly. The
Conclusions and future research
This study proposes using bargaining agents gifted with a TD-based reinforcement learning mechanism to perform bilateral bargaining over price with incomplete information. We use back-propagation neural networks to implement the Q-functions for learning dynamic strategies. The λ parameter controls the temporal credit assignment by determining how an error detected at a given time step feeds back to correct previous predictions. Four sets of bargaining games are designed to measure the agents’
References (33)
- et al.
A factory-based approach to support e-commerce agent fabrication
Electronic Commerce Research and Applications
(2004) - et al.
Reaching agreements through argumentation: A logical model and implementation
Artificial Intelligence Journal
(1998) - et al.
A formal approach to negotiating agents development
Electronic Commerce Research and Applications
(2002) - et al.
The Economics of Bargaining
(1969) - F. Sadri, F. Toni, P. Torroni, Dialogues for Negotiation: Agent Varieties and Dialogue Sequences, in: Pre-proceedings...
- S.-l. Huang, Y. Yuan, F.-r. Lin, Adding Persuasion into On-line Bargaining Process, in: Proceedings of the 6th Pacific...
- et al.
Genetic algorithm approach to a negotiation support system
IEEE Transactions on Systems, Man, and Cybernetics
(1991) A machine learning approach to automated negotiation and prospects for electronic commerce
Journal of Management Information Systems
(1997)- G. Dworman, S.O. Kimbrough, J.D. Laing, On Automated Discovery of Models Using Genetic Programming in Game Theoretic...
- G. Dworman, S.O. Kimbrough, J.D. Laing, Bargaining by Artificial Agents in Two Coalition Games: A Study in Genetic...
A multi-agent framework for automated online bargaining
IEEE Intelligent Systems
Cited by (6)
An intelligent negotiator agent design for bilateral contracts of electrical energy
2014, Expert Systems with ApplicationsCitation Excerpt :Some authors presented a Q-learning method that defined the belief of the agent about its offer/bid as its state vector to learn the optimal decision (Qu & Chen, 2010). Huang and Lin (2008) proposed a temporal difference (TD) learning method for multi agent bargaining problem and four years later, Jamali and Faez (2012) improved the proposed method by applying Simulated Annealing (SA) to overcome the challenge of finding a balance between exploration and exploitation. Both works used neural network (NN) as a function approximator for Q-value function in TD method.
Learning pareto optimal solution of a multi-attribute bilateral negotiation using deep reinforcement
2020, Electronic Commerce Research and ApplicationsCitation Excerpt :These approaches analyze the possible optimal solutions of the negotiation (Game Theory) or design an intelligent agent that learns to maximize its profit during the negotiation process (AI). In many researches in the AI field like (Hajimiri et al., 2014; Baek et al., 2007; S. Jamali, 2012; li Huang et al., 2008), only single attribute bilateral negotiation is studied, while in many applications, agents need to go through multi-attribute negotiation. Furthermore, the single attribute negotiation naturally results in a ’win-lose’ situation.
Cognitive Fuzzy-based Behavioral Learning System for Augmenting the Automated Multi-issue Negotiation in the E-commerce Applications
2022, Journal of Internet TechnologyExploring the Trust and Innovation Mechanisms in M&A of China's State Owned Enterprises with Mixed Ownership
2021, Exploring the Trust and Innovation Mechanisms in M and A of China's State Owned Enterprises with Mixed OwnershipLearning methodologies to support e-business in the automated negotiation process
2013, Intelligent Technologies and Techniques for Pervasive ComputingApplying SAQ-learning algorithm for trading agents in bilateral bargaining
2012, Proceedings - 2012 14th International Conference on Modelling and Simulation, UKSim 2012