The Extended Discrete Element Method (XDEM) is a novel and innovative numerical simulation technique that extends classical Discrete Element Method (DEM) (which simulates the motion of granular material), by additional properties such as the chemical composition, thermodynamic state, stress/strain for each particle. It has been applied successfully to numerous industries involving the processing of granular materials such as sand, rock, wood or coke , . In this context, computational simulation with (X)DEM has become a more and more essential tool for researchers and scientific engineers to set up and explore their experimental processes. However, increasing the size or the accuracy of a model requires the use of High Performance Computing (HPC) platforms over a parallelized implementation to accommodate the growing needs in terms of memory and computation time. In practice, such a parallelization is traditionally obtained using either MPI (distributed memory computing), openMP (shared memory computing) or hybrid approaches combining both of them. In this paper, we present the results of our effort to implement an openMP version of XDEM allowing hybrid MPI+openMP simulations (XDEM being already parallelized with MPI). Far from the basic openMP paradigm and recommendations (which simply summarizes by decorating the main computation loops with a set of openMP pragma), the openMP parallelization of XDEM required a fundamental code re-factoring and careful tuning in order to reach good performance. There are two main reasons for those difficulties. Firstly, XDEM is a legacy code developed for more than 10 years, initially focused on accuracy rather than performance. Secondly, the particles in a DEM simulation are highly dynamic: they can be added, deleted and interaction relations can change at any timestep of the simulation. Thus this article details the multiple layers of optimization applied, such as a deep data structure profiling and reorganization, the usage of fast multithreaded memory allocators and of advanced process/thread-to-core pinning techniques. Experimental results evaluate the benefit of each optimization individually and validate the implementation using a real-world application executed on the HPC platform of the University of Luxembourg. Finally, we present our Hybrid MPI+openMP results with a 15%-20% performance gain and how it overcomes scalability limits (by increasing the number of compute cores without dropping of performances) of XDEM-based pure MPI simulations.