Making food recommendations by analyzing cuisines and ingredients

Food is one of the few universal things that connect people along with music, movies and sports. It is the most fundamental building block of life and is something people enjoy thoroughly. And just like music, the varieties are endless.

From the simplicity and subtlety of Italian cuisine to the bold colors and flavors of Indian food, there is a plethora of tastes to be discovered and re-discovered.

Fig 1. Vietnamese food is among some of the most vibrant cuisines in the world in terms of both color and flavor

The sensations of food are closely tied to their place origin, be it the sight, the smell or the taste. These feelings remind people of places they may have visited on their travels or even of their native home.

In this post, we explore various cuisines and the ingredients associated with them. From a dataset of dishes from around the world, we learn to predict the cuisine of each dish from the ingredients used in them.

Data description

The datasets were obtained from Yummly. Separated into two separate files, the training data consisted of 39774 rows whereas the test data contained 9944 rows.

Each row in the training dataset is a separate dish whereas the columns represent cuisine, id and ingredients. where each row in the ingredients variable consists of a list of ingredients. Hence, each instance in the data can be interpreted as a separate recipe.

Fig 2 shows part of the dataset with the aforementioned three variables it’s contents.

Fig 2. Snippet of initial dataset showing the three variables

Note that the test dataset does not contain the cuisine column as this data is used to actually predict the cuisines given the ingredients, or other variables, as the predictors. The train dataset was used for building the cuisine classification model which was then used to predict the cuisines in the test data.

Sampling and Bias

Assuming that the instances in the dataset were sampled randomly across the United States, the final prediction can be generalized to a larger population of data or real world data.

Since the dishes and cuisines in the dataset are mostly centralized to the United States, there could be some convenience bias involved if the data were to be generalized to the entire world. However, since Yummly mostly adheres to the US market, the population in question is cuisines prevalent in this country.

Methodology

Exploratory Data Analysis

As mentioned earlier, the train dataset consisted of 39774 instances which were distributed across a total of 20 cuisines. Fig 3 shows the frequency of each cuisine in the data.

Fig 3. Histogram showing the distribution of cuisines in the training dataset

From the histogram above, the highest occurring cuisine is Italian followed by Mexican and Southern US all of which have more than 4000 instances in the dataset. Indian, Chinese and French cuisines are also relatively more frequent than others with greater than 2000 occurrences.

The lowest occurring cuisines seem to be Brazilian, Jamaican and Russian. Ofcourse there can be hundreds of other cuisines but these 20 are the most common ones found in the United States.

Fig 4. Italian restaurants are widely prevalent in the United States with spaghetti being a common staple in several households

There is a whopping 428275 different ingredients in this data. The distribution of the highest occurring ingredients in the dataset is visualized in Fig 5.

The most frequent ingredient by far was salt, followed by olive oil, onions, water, garlic, sugar and garlic cloves.

Fig 5. Distribution of highest occurring cuisines in the dataset

Both the frequency of cuisines and the frequency of ingredients were determined by using the FreqDist function from the NLTK (Natural Language Toolkit)library in Python.

Pre-processing

Before diving into the modeling part, the train dataset needed to be cleaned and manipulated for building the model. The first step in pre-processing the data was to remove the commas in the ingredient lists and present the ingredients as a string, simply as words separated by spaces.

Another formatting method that was applied to each of the ingredients in the dataset was to normalize them using the Porter Stemmer function from the NLTK library in Python.

Since there could be several ingredients in the dataset that might be slight variations of one another, those variations would be listed as separate ingredients where they should really be classified as the same word.

If a certain ingredient is listed in singular and plural forms, ideally they should be represented as the same ingredient. The Porter Stemmer function converts both of these versions into the same normalized word.

For example, “olive” and “olives” are both converted to “oliv”. This will clean the data and provide a more accurate prediction model by removing noise. The resulting dataset is shown in Fig 6 below.

Fig 6. Dataset showing reformatted ingredients lists

As can be seen in the above table, the lists are converted into strings with each word separated by spaces. This allows for an easier conversion into a document-term matrix format, which will be discussed in the next section.

Feature Engineering

The variables or features in the training data are manipulated further by using vectorization. The two most common types of vectorization algorithms in Python are Count Vectorizer and TFIDF (Term Frequency Inverse Document Frequency) Vectorizer.

Count Vectorizer converts a list of words or a document into features. Each unique word from the document becomes a feature where an instance that contains that specific word or feature gets assigned a “1” and other instances get assigned a “0”. Equal weightage is given to all words.

TFIDF Vectorizer performs the same function as Count Vectorizer but also adds the TFIDF transformation.

In mathematical terms, the total number of documents or instances is divided by the number of documents containing that ingredient. Then the log of that value is taken which provides the weighting factor. The last step is to multiply this value with the frequency of the word in the entire dataset which gives the final weighted value for that ingredient.

The simplified equation of the algorithm is shown below.

TFIDF = (log(# of documents)/(# of documents containing the word)) * frequency of word in dataset

This kind of vectorization gives more importance to crucial words and less weightage to words that are too frequent or too scarce.

For this project, a min_df parameter of 2 was selected for the TFIDF vectorizer which was used to fit the ingredients from the train data. This parameter denotes that the particular ingredient has to be present in at least two documents for consideration in the algorithm.

The resulting dataset contained 39744 rows (same as before) and 5025 columns. As expected the number of columns has increased following the conversion of the ingredients to a document-term sparse matrix.

Fig 7. Document-term sparse matrix from the ingredients

As mentioned earlier, the document-term matrix contains each ingredient as a feature (columns) where there presence in a recipe denotes a numerical value in the matrix and the absence shows a 0. Since only a small fraction of the entire ingredients list is contained in a single recipe, most of the recipes will predominantly contain values of 0 as shown in Fig 7, hence the term sparse matrix.

Note: To keep the prediction meaningful, the test data was also fitted with the same vectorizer.

In addition to the vectorization, an additional feature was added to the dataset. This variable is the number of ingredients contained in each recipe.

Creating training set

There are several ways to approach the modeling part. Sometimes, it’s good to select a particular Machine Learning (ML) algorithm and take a cursory look by implementing it on the entire training dataset and viewing the result. Depending on the findings further modifications can be made to the model through evaluation and iterations.

In this project, this initial step is skipped and we start directly from the model optimization point.

The train-test split function from the sklearn library is used to split the dataset into training and test sets.

Note: Even though we were referring to our dataset as the training set, the more appropriate term would have been simply “entire dataset” as we use the entire data for building the model. This is important to note when using the train-test split method.

The initial data is first separated into the target variable containing just the cuisines and the explanatory variables containing the features that were created from the ingredients.

Fig 8. Diagram showing stages of splitting data into training, testing and validation groups

Splitting the dataset yields the train and test sets as shown in Fig 8. Note that this is called the “Holdout Method” in the diagram above. The validation set is only created at a later stage while using evaluation methods such as cross-validation and grid search.

The train set generally contains majority of the data as it’s used to train the model. The test set contains a smaller fraction of the data which is used to evaluate and optimize the model. There are four resulting datasets that are created:

  • X_train: Explanatory variables for the train set
  • X_test: Explanatory variables for the test set
  • Y_train: Target variable for the train set
  • Y_test: Target variable for the test set

For our purposes, the train data size was selected as being 90% of the initial dataset which would result in the test data being 10%. The choice for this parameter is up to the modeler.

Logistic Regression Classifier

For modeling, a logistic regression classifier was used with a regularization parameter (C) of 6. Note that this values was selected following the grid search evaluation technique.

Evaluation

Grid search loops through the various parameters for a certain ML algorithm and provides the results for each combination.

The type of score selected for evaluating the model was the ROC (Receiver Operating Characteristic) AUC (Area Under the Curve) metric. This measure indicates how well a model is able to classify between different labels.

Fig 9. Plot showing ROC curve and AUC

It provides the best trade-off between the True Positive (sensitivity) and False Positive score as shown in Fig 9. Higher the ROC score, the better.

The grid search algorithm yielded a C value of 6 as the parameter that provided the highest ROC score. This parameter indicates the amount of regularization applied to the logistic regression algorithm. Higher the C value, greater the ability of the model to prevent over-fitting.

Note that during grid search, the training data following train test split is further divided into the training and validation sets as shown in the second stage in Fig 6. The optimal parameter (C in this case) value along with the highest average score by combining the different cross-validation folds are presented.

Following the model optimization, a C value of 6 with a min_df value of 2 yielded the best ROC AUC score, 0.8006, when the model was evaluated with the test data.

Results

When analyzing the prediction of the model using the test data, 3186 cuisines were predicted correctly while 792 were predicted incorrectly.

Fig 8 shows the frequency of correctly predicted cuisines. Italian was predicted correctly the most, over 700 times. However, it is also important to note that it is the highest occurring cuisine in the dataset so it cannot be concluded that Italian cuisine is easier to predict.

The same can be said for Mexican food as it is both the second highest occurring cuisine as well as the second most accurately predicted cuisine. Similarly Russian and Brazilian cuisines are two of least accurately predicted cuisines and are also two of the least frequent cuisines seen in this data.

Fig 10. Histogram showing the frequency of correct predictions for each cuisine

The histogram in Fig 11 shows the frequency of incorrectly predicted cuisines where the most mistaken cuisine is Southern US, followed by French, Italian, Spanish and Mexican.

Fig 11. Histogram showing the frequency of incorrectly predicted cuisines

Looking at the results from an ingredients point of view, the most important raw materials involved in the above predictions are as follows – Soy sauce, fish sauce, chili powder, fresh basil, corn tortilla, evaporated milk. Both the most important ingredients as well as the least important features are shown in the table below.

Most important featuresLeast important features
SoysauceBlackbean
FishsauceDriedblackbean
ChilipowderTapiocastarch
FreshbasilSweetendedcondensedmilk
CorntortillaPalmoil
EvaporatedmilkChocolatesprinkles
FlourtortillaManiocflour
CuminseedTapiocaflour
CurrypowderHeartsofpalm
HeavycreamCachaca

Note on the table above that the ingredients are indicated as single words as opposed to separate words. For example, curry powder is denoted as currypowder.

A portion of the final prediction using the test set data is shown in Fig 12.

Fig 12. Snippet of final predictions table

Discussion

Several techniques were compared before arriving at the optimal algorithm. One of the main considerations was separating out each word in the ingredients list as individual words as opposed to keeping entire terms (containing one or more words) together.

Another major consideration was using Count Vectorizer or TFIDF Vectorizer. Lastly, the text mining technique known as stemming was used which normalized various forms of a word into a single type which helped remove some noise from the data.

A logistic regression model was selected as the algorithm of choice following comparisons with Bernouilli’s Naive Bayes, Multinomial Naive Bayes and Support Vector Machines machine learning techniques.

The model was evaluated by using the train test split method and the C parameter was optimized by using the grid search algorithm.

Conclusion

A final score of 0.8006 was obtained when the trained model was evaluated with the test data which provides a good generalization of the model to the split test set.

Despite the relatively high ROC AUC score, there could be questions regarding the fit of the data. It was not completely verified during evaluation and analysis whether the model would still hold up to real world data.

Since the score is high, it could be possible that it fits this test set well but could provide lower scores against external data. That would be an example of over-fitting.