In signal processing, it is sometimes necessary to measure horizontal distance between some features of the signal, for example, the peaks. A good example of this could be interpreting an electrocardiogram (ECG), which relies on measuring distances for much of its interpretation. We will consider a toy example of the smooth signal with only two peaks in the picture below.

The topic of *data-efficient learning* an important topic in Data Science and is an active area of research. Training large models on big data could take a lot of time and resources, so the question is can we replace a large data set with a smaller one, that will nevertheless contain all useful information from the large data set to enable us train the model on this small data set that will generalize well on the real data. …

I have spent some time studying data with categorical variables trying to explore many ways to encode them into numeric features. What if all your variables are categorical? One of the mechanism to describe this scenario known as **contingency tables**.

Contingency tables in their essence are (potentially multidimensional) tables where rows, columns and other dimensions represent categorical variables, and the cells contain counts of the occurrences of the combinations. As an example, consider a simple contingency table that represents salary (rows) vs. years of experience (columns). …

As a data scientist working primarily with business data (sometimes also called “tabular data”), I’m always looking for latest development in the areas of data science that helps work with the more realistic data. One of this area addresses the fact that business data are rarely “tabular”, but usually are relational in nature. I already discussed working with relational data in another blog post. Deep sets algorithms help you to learn from the data that do not have rectangular shape, but can be represented as a collection of tables or as a graph. …

While looking through the ICML 2020 accepted papers, I found an interesting paper:

You may ask, how many more papers we need on Gradient Boosting? But in fact, the GBT family of algorithms works really well on tabular data, consistently taking the first places in the Kaggle leaderboards. The technique is so successful, that it is now being extended to the areas beyond tabular data, for example, to NLP.

The problem we are tackling here is that almost all regression algorithms do not return the distribution of the target variable given predictors P(y|X), but an expectation of the target variable…

The problem of encoding categorical variables is a very important step in feature engineering. Unfortunately, there is no solution that is working for all cases. Multiple techniques were invented, and I cover some of them here, here and here.

In this post I will discuss **Target Encoding** (a.k.a. **Mean Encoding**) and its improved version **Bayesian Target Encoding**, as well as its latest improvement in **Sampling Bayesian Encoder.**

Given infinite amount of clean data, you will not need to encode categories. You can just train a model for each category. For example, for the Titanic problem you would train separate models…

In machine learning, there are several very useful functions, for example, sigmoid, relu, softmax. The latter is widely used in multi-class classification problems as an output layer of the Neural networks:

The last day of the conference, and the second day of workshops. I actually wanted to attend two workshops, “Sets and Partitions” and “Fairness in Machine Learning for Health”. I ended up attending sessions from both of the workshops. The most exciting one was the one I did not attend, it was “Tracking Climate Change with ML”. Even though the room was pretty big, not everybody was able to fit. Also this workshop attracted some famous ML researches, such as Yoshua Bengio and Andrew Ng. Other workshops had some entertainment value, for example, one featured an ELBO song. …

Workshop is like a mini-conference about a specific topic. There were 27 workshops today, and some of them were very exiting, but that’s not what I was up to. For the benefit of my company, which, by the way, sent me to this conference (I am infinitely grateful for that!), I went to the Graph Representation learning workshop. To some degree of my disappointment, I did not hear anything revolutionary during this workshop. Major breakthroughs were done in the last couple of years, including the paper by Battaglia et. al, that summarized everything in one algorithm: GN. By the way…

Data Scientist