Become a Dealer
Seller profile
pastorshrimp70
  • Full name: pastorshrimp70
  • Location: Isiala-Ngwa North, Osun, Nigeria
  • Website: https://controlc.com/ef94a14f
  • User Description: Pre-training Reinforcement Learning agents in a task-agnostic manner has shown promising results. However, previous works still struggle in learning and discovering meaningful skills in high-dimensional state-spaces, such as pixel-spaces. We approach the problem by leveraging unsupervised skill discovery and self-supervised learning of state representations. In our work, we learn a compact latent representation by making use of variational and contrastive techniques. We demonstrate that both enable RL agents to learn a set of basic navigation skills by maximizing an information theoretic objective. We assess our method in Minecraft 3D pixel maps with different complexities. Our results show that representations and conditioned policies learned from pixels are enough for toy examples, but do not scale to realistic and complex maps. To overcome these limitations, we explore alternative input observations such as the relative position of the agent along with the raw pixels.Reinforcement Learning, ICML, Unsupervised Learning, Skill-discovery, self-supervised learning, intrinsic motivation, empowermentReinforcement Learning (RL) [29] has witnessed a wave of outstanding works in the last decade, with special focus on games (Schrittwieser et al. [27], Vinyals et al. [32], Berner et al. [4]), but also in robotics (Akkaya et al. [2], Hwangbo et al. [18]). In general, these works follow the classic RL paradigm where an agent interacts with an environment performing some action, and in response it receives a reward. These agents are optimized to maximize the expected sum of future rewards.Rewards are usually handcrafted or overparametrized, and this fact becomes a bottleneck that prevents RL to scale. For this reason, there has been an increasing interest in training agents in a task-agnostic manner during the last few years, making use of intrinsic motivations and unsupervised techniques. Recent works have explored the unsupervised learning paradigm (Campos et al. [7], Gregor et al. [16], Eysenbach et al. [13], Warde-Farley et al. [33], Burda et al. [6], Pathak et al. [24]), but RL is still far from the remarkable results obtained in other domains. For instance, in computer vision, Chen et al. [10] achieve an 81% accuracy on ImageNet training in a self-supervised manner, or Caron et al. [9] achieves state-of-the-art results in image and video object segmentation using Visual Transformers [12] and no labels at all. Also, in natural language processing pre-trained language models such as GPT-3 [5] have become the basis for other downstream tasks.Humans and animals are sometimes guided through the process of learning. We have good priors that allow us to properly explore our surroundings, which leads to discovering new skills. For machines, learning skills in a task-agnostic manner has proved to be challenging [33, 21]. These works state that training pixel-based RL agents end-to-end is not efficient because learning a good state representation is unfeasible due to the high dimensionality of the observations. Moreover, most of the successes in RL come from training agents during thousands of simulated years (Berner et al. [4]) or millions of games (Vinyals et al. [32]). This learning approach is very sample inefficient and sometimes limits its research because of the high computational budget it may imply. As a response, some benchmarks have been proposed to promote the development of algorithms that can reduce the number of samples needed to solve complex tasks. This is the case of MineRL (Guss et al. [17]) or ProcGen Benchmark (Cobbe et al. [11]).Our work is inspired by Campos et al. [7] and their Explore, Discover and Learn (EDL) paradigm. EDL relies on empowerment [25] for motivating an agent intrinsically. Empowerment aims to maximize the influence of the agent over the environment while discovering novel skills. As stated by Salge et al. [25], this can be achieved my maximizing the mutual information between sequences of actions and final states. Gregor et al. [16] introduces a novel approach that instead of blindly committing to a sequence of actions, each action depends on the observation from the environment. This is achieved by maximizing the mutual information between inputs and some latent variables. Campos et al. [7] embraces this approach as we do. However, the implementation by Campos et al. [7] makes some assumptions that are not realistic for pixel observations. Due to the Gaussian assumption at the output of variational approaches, the intrinsic reward is computed as the reconstruction error, and in the pixel domain this metric does not necessarily match the distance in the environment space. Therefore, we look for alternatives that suit our requirements: we derive a different reward from the mutual information, and we study alternatives to the variational approach.This work focuses on learning meaningful representations, discovering skills and training latent-conditioned policies. In any of the cases, our methodology does not require any supervision and works directly from pixel observations. Additionally, we also study the impact of extra input information in the form of position coordinates. Our proposal is tested over the MineRL [17] environment, which is based on the popular Minecraft videogame. Even though the game proposes a final goal, Minecraft is well known by the freedom that it gives to the players, and actually most human players use this freedom to explore this virtual world following their intrinsic motivations. Similarly, we aim at discovering skills in Minecraft without any extrinsic reward.We generate random trajectories in Minecraft maps with little exploratory challenges, and also study contrastive alternatives that exploit the temporal information throughout a trajectory. The contrastive approach aims at learning an embedding space where observations that are close in time are also close in the embedding space. A similar result can be achieved by leveraging the agents’ relative position in the form of coordinates. In the latter, the objective is to infer skills that do not fully rely on pixel resemblance, but also take into account temporal and spatial relationships.Our final goal is to discover and learn skills that can be potentially used in more broad and complex tasks. Either by transferring the policy knowledge or by using hierarchical approaches. Some works have already assessed this idea specially in robotics [14] or 2D games [8]. Once the pre-training stage is completed and the agent has learned some basic behaviours or skills, the agent is exposed to an extrinsic reward. These works show how the agents leverage the skill knowledge to learn much faster and encourage proper exploration of the environment in unrelated downstream tasks. However, transferring the policy knowledge is not as straightforward as in other deep learning tasks. If Need realtor wants to transfer behaviours (policies), the change in the task might lead to catastrophic forgetting.Our contributions are the following:• We demonstrate that variational techniques are not the only ones capable of maximizing the mutual information between inputs and latent variables by leveraging contrastive techniques.• We provide alternatives for discovering and learning skills in procedurally generated maps by leveraging the agents coordinate information.• We succesfully implement the reverse form of the mutual information for optimizing pixel-based agents in a complex 3D environment.Intrinsic Motivations (IM) are very helpful mechanisms to deal with sparse rewards. In some environments the extrinsic rewards are very difficult to obtain and, therefore, the agent does not receive any feedback to progress. In order to drive the learning process without supervision, we can derive intrinsic motivations as proxy rewards that guide the agents towards the extrinsic reward or just towards better exploration.Skill Discovery. We relate Intrinsic Motivations to the concept of empowerment [25], a RL paradigm in which the agent looks for the states where it has more control over the environment. Mohamed and Rezende [22] derived a variational lower bound on the mutual information that allows to maximize empowerment. Skill discovery extends this idea from the action level to temporally-extended actions. Florensa et al. [14] merges skill discovery and hierarchical architectures. They learn a high-level policy on top of some basic skills learned in a task-agnostic way. They show how this set-up improves the exploration and enables faster training in downstream tasks. Similarly, Achiam et al. [1] emphasize in learning the skills dynamically using a curriculum learning approach, allowing the method to learn up to a hundred of skills. Instead of maximizing the mutual information between states and skills they use skills and whole trajectories. Eysenbach et al. [13] demonstrates that learned skills can serve as an effective pretraining mechanism for robotics. Our work follows their approach regarding the use of a categorical and uniform prior over the latent variables. Campos et al. [7] exposes the lack of coverage of previous works. They propose Explore, Discover and Learn (EDL), a method for skill discovery that breaks the dependency on the distributions induced by the policy. Warde-Farley et al. [33] provides an algorithm for learning goal-conditioned policies using an imitator and a teacher. They demonstrate the effectiveness of their approach in pixel-based environments like Atari, DeepMind Control Suite and DeepMind Lab.Intrinsic curiosity. In a more broad spectrum we find methods that leverage intrinsic rewards that encourage exploratory behaviours. Pathak et al. [24] present an Intrinsic Curiosity Module that defines the curiosity or reward as the error predicting the consequence of its own actions in a visual feature space. Similarly, Burda et al. [6] uses a Siamese network where one of the encoder tries to predict the output of the other. The bonus reward is computed as the error between the prediction and the random one.Goal-oriented RL. Many of the works dealing with skill-discovery end up parameterizing a policy. This policy is usually conditioned in some goal or latent variable z∼Zsimilar-to

    Listings from pastorshrimp70

    Top