Some cool open-source Python packages for Machine Learning Ep 2


There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities.

This post is following that one:


Database connectivity

  • Turbodbc - a module to access relational databases via the Open Database Connectivity (ODBC) interface.
  • ibis - a toolbox to bridge the gap between local Python environments, remote storage, execution systems like Hadoop components (HDFS, Impala, Hive, Spark) and SQL databases.

Data description

Data preparation

  • Snorkel - a system for quickly generating training data with weak supervision.
  • imbalanced-learn - a package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Feature engineering

  • dirty_cat - dirty cat helps with machine-learning on non-curated categories, by providing encoders that are robust to morphological variants, such as typos, in the category strings.

Dimension reduction

  • ivis - a machine learning algorithm for reducing dimensionality of very large datasets.


  • auto-sklearn - an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.
  • Auto-Keras - an open source software library for automated machine learning.
  • Keras Tuner - An hyperparameter tuner for Keras.

Model analysis

  • Skater - a unified framework to enable Model Interpretation for all forms of model.

Workflow management

  • prefect - a workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine.
  • papermill - a tool for parameterizing, executing, and analyzing Jupyter Notebooks.

Model management

  • Studio - a model management framework written in Python to help simplify and expedite your model building experience.

Data visualization

  • - a powerful open source geospatial analysis tool for large-scale data sets with a jupyter widget to render large-scale interactive maps in Jupyter Notebook.
  • glue - a library to explore relationships within and among related datasets.
  • KeplerMapper - an implementation of the TDA Mapper algorithm for visualization of high-dimensional data.


  • pytorch-transformers - a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).
  • spacy-pytorch-transformers - provides spaCy model pipelines that wrap Hugging Face's pytorch-transformers package, so you can use them in spaCy.

Time series

  • STUMPY - a powerful and scalable library that can be used for a variety of time series data mining tasks.