A tutorial on deep signal processing

Photo by M. B. M. on Unsplash

In signal processing, it is sometimes necessary to measure horizontal distance between some features of the signal, for example, the peaks. A good example of this could be interpreting an electrocardiogram (ECG), which relies on measuring distances for much of its interpretation. We will consider a toy example of the smooth signal with only two peaks in the picture below.

Getting Started

Small synthetic data set can be enough for training a model.

Photo by AJ Jean on Unsplash

The topic of data-efficient learning an important topic in Data Science and is an active area of research. Training large models on big data could take a lot of time and resources, so the question is can we replace a large data set with a smaller one, that will nevertheless contain all useful information from the large data set to enable us train the model on this small data set that will generalize well on the real data. …

Hands-on Tutorials

When all you have are categorical variables

Photo by Steve Johnson on Unsplash

I have spent some time studying data with categorical variables trying to explore many ways to encode them into numeric features. What if all your variables are categorical? One of the mechanism to describe this scenario known as contingency tables.

Contingency tables in their essence are (potentially multidimensional) tables where rows, columns and other dimensions represent categorical variables, and the cells contain counts of the occurrences of the combinations. As an example, consider a simple contingency table that represents salary (rows) vs. years of experience (columns). …

ICML2020

A deep learning algorithm for real world time series data

Photo by Fabrizio Verrecchia on Unsplash

As a data scientist working primarily with business data (sometimes also called “tabular data”), I’m always looking for latest development in the areas of data science that helps work with the more realistic data. One of this area addresses the fact that business data are rarely “tabular”, but usually are relational in nature. I already discussed working with relational data in another blog post. Deep sets algorithms help you to learn from the data that do not have rectangular shape, but can be represented as a collection of tables or as a graph. …

ICML 2020

Predict a distribution of the target variable, not just point estimate

Photo by mohammad alizade on Unsplash

While looking through the ICML 2020 accepted papers, I found an interesting paper:

You may ask, how many more papers we need on Gradient Boosting? But in fact, the GBT family of algorithms works really well on tabular data, consistently taking the first places in the Kaggle leaderboards. The technique is so successful, that it is now being extended to the areas beyond tabular data, for example, to NLP.

Problem statement

The problem we are tackling here is that almost all regression algorithms do not return the distribution of the target variable given predictors P(y|X), but an expectation of the target variable…

Categorical encoding techniques often used in data science competition

Photo by Annie Spratt on Unsplash

The problem of encoding categorical variables is a very important step in feature engineering. Unfortunately, there is no solution that is working for all cases. Multiple techniques were invented, and I cover some of them here, here and here.

In this post I will discuss Target Encoding (a.k.a. Mean Encoding) and its improved version Bayesian Target Encoding, as well as its latest improvement in Sampling Bayesian Encoder.

Why encode the categories?

Given infinite amount of clean data, you will not need to encode categories. You can just train a model for each category. For example, for the Titanic problem you would train separate models…

A useful variation of softmax

Photo by Greg Rosenke on Unsplash

In machine learning, there are several very useful functions, for example, sigmoid, relu, softmax. The latter is widely used in multi-class classification problems as an output layer of the Neural networks:

The second day of workshops and the last day of the conference

The last day of the conference, and the second day of workshops. I actually wanted to attend two workshops, “Sets and Partitions” and “Fairness in Machine Learning for Health”. I ended up attending sessions from both of the workshops. The most exciting one was the one I did not attend, it was “Tracking Climate Change with ML”. Even though the room was pretty big, not everybody was able to fit. Also this workshop attracted some famous ML researches, such as Yoshua Bengio and Andrew Ng. Other workshops had some entertainment value, for example, one featured an ELBO song. …

The first workshop day

Workshop is like a mini-conference about a specific topic. There were 27 workshops today, and some of them were very exiting, but that’s not what I was up to. For the benefit of my company, which, by the way, sent me to this conference (I am infinitely grateful for that!), I went to the Graph Representation learning workshop. To some degree of my disappointment, I did not hear anything revolutionary during this workshop. Major breakthroughs were done in the last couple of years, including the paper by Battaglia et. al, that summarized everything in one algorithm: GN. By the way…

Michael Larionov, PhD

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store