Background

Wine has been produced for thousands of years. The earliest known winery is the 6100-year-old Areni-1 winery in Armenia. Throughout history, wine has been consumed for its intoxicating effects. Wine tasting is the sensory examination and evaluation of wine by experts. A wine sommelier, also known as "wine steward", is such a specialist wine expert in charge of developing a restaurant's wine list, and assisting customers with their selections (especially food-wine pairings). A wine critic is an expert and/or journalist who tastes and reviews wines for books and magazines. Wine producers would traditionally work closely with these professions in order to create the most appropriate or appealing wine reviews in order to entice customers to purchase their wines. I would like to help these wine producers analyse a dataset of wine reviews written by these Wine Enthusiasts, specifically to find out if the reviews reflect accurately the wine quality, its price and/or its variety - what are the typical descriptors used to identify highly priced wines, and also how to extract creative value from this huge library of wine reviews to create a review generator of our own (i.e. to generate sample reviews in an automated manner). This would be of value to wine producers as they can use these sample text descriptions for various non-critical business uses such as labeling, educational materials, flyers etc.

The data set

This data set was obtained from Kaggle. It has only 11 columns to start with, many of them are location related info in text format. And only 2 columns with numeric data - price and points. Altogether there are > 150K rows in this data set, all with wine reviews. Within the reviews text, there are > 6 M words. Lastly, there are > 13000 rows without price.

Samples of the review text corpus

As you can see from wine reviews on the right, the text is generally quite clean, with simple sentence structures, little to no connection between sentences in each document and hence little long-term dependencies. Only those words that contain diacritical marks need to be cleaned up. There are a number of them because many of the wineries are from Europe, and some French, German and Italian words use diacritical marks. In order to make my life easier, I used the Unidecode package to help me convert these special characters to Ascii characters since Python 2 only works with Ascii.

Workflow

As shown on the left, the work flow that I adopted to process the text corpus started with using Unidecode to convert those special characters into Ascii, followed by using the NLTK (Natural Language Tool Kit) package to find out the distribution of the most frequent words, and then using the VADER package to perform sentiment analysis on the text corpus.

Only after I had a better understanding of the word distribution and the sentiment analysis, then I proceeded to fit models to perform Regression on wine prices, and Classification of high or low prices as well as variety. Next, I used Word2Vec to extract contextual information and study the semantics of the words. After we have assessed the quality of this text corpus through the above-mentioned models, we then think of how to build a deep learning model to try to generate similar text reviews.

Word frequency distribution and sentiment analysis

We can see that the top 3 most frequently appearing words are "wine", "flavor" and "fruit", which makes a lot of sense as these are very generic, wine-related words.

The large majority of the reviews have either Positive or Neutral sentiment scores. Very few have negative scores. This is because only those reviews with points between 80 to 100 are provided in this dataset. Hence we have a text corpus that is pretty narrow in sentiment, and biased towards highly rated wines.

Regression results

Now that we have a better understanding of the review text, let's try to put together all the features and try to predict the price of the wine. For the features of Country, Province, Variety, Winery, and Region, we can use Label Encoder and OneHotEncoder to convert them to dummy columns.
As for the Reviews column, we will use TFIDF to help us filter the words and retain significant words based on their frequency in the reviews as features and convert to word vectors.
For the choice of Regressors, we will use 4 Regressors - Lasso, Ridge, Random Forest and Gradient Boosting Regressor. Lasso and Ridge are powerful techniques generally used for creating parsimonious models in the presence of a ‘large’ number of features. Random Forest is an ensemble method that applies Bagging techniques to train multiple decision trees on different parts of the same training set, and averages them with the aim of reducing variance. Finally Gradient Boosting is an ensemble method that tries to fit subsequent models to the residuals of the last model in a sequential manner in order to reduce bias. These 4 regressors provide a good coverage of regularization, bagging and boosting techniques to help us better understand our results.

Among the 4 models, Grad Boosting gave the best accuracy scores (0.884). The regression models have shown that specific wineries (typically French and Italian), regions, a couple of varieties and points are positively correlated to wine prices and hence are good predictors of the higher ranges of prices. On the other hand, specific words like 'value', 'inexpensive', 'best buy' and 'bargain' are negatively correlated, and hence they are associated with the lower ranges of wine prices. If we think about it, what really determine/set the price of wines, should be famous wineries, their skill at wine-making, species of grapes, climactic conditions, techniques of fermenting, aging process, and bottling. Hence our finding makes a lot of sense.

Classification results

Above, we have mixed all the words in the wine description together with all the other categorical features like country, variety, region in order to predict wine prices. Now, we just focus on the description words to see what are the important words that help distinguish high points from low points or high prices from low prices. This would also allow us to understand more about this wine text corpus.
For this classification task, I chose a price threshold of >$200 to separate high prices from low prices, and a points threshold >90 to separate high points from low points. As for variety, due to the large no. of variety classes (632), we shall try to classify these top 3 varieties of wine first to see if we can get any logical result. Besides the fact that these 3 varieties are the top 3 in terms of numbers, they are also distinctly different in terms of colour and taste - Chardonnay being white, Pinot Noir light red, and Cabernet Sauvignon being dark red. If our classifiers cannot even tell them apart, we can forget about classifying the rest of the varieties.

The classifiers have clearly identified words such as "years", "age", "rich", "long" and "powerful" as distinctive words that associate with high points AND high prices. This makes a lot of sense as it's common knowledge that wines cost higher, taste better, and valued more as they age with time.

Variety classification

As for wine variety, Random Forest returned a list of fruit-related words as being associated with the 3 varieties of wine - "cherry", "pineapple", "cassis", "tannins", "blackberry", "pear", "cola", "apple", "silky", "peach" while Logistic Regression gave a list of names of other wine varieties such as "pinot blanc","barbera", "barolo", "brunello", "prosecco", "pinotage", "malvasia", "chenin", "amarone", "petite", "nebbiolo", "garnacha", "semillon", "verdejo".
A very ripe Chardonnay will have flavors more towards tropical fruits like pineapple, guava and mango. A barely ripe Chardonnay will have green apple, pear or lemon flavors.
Pinot Noir derives its lighter color from red-fruits such as cherry, cranberry and strawberry while Cabernet Sauvignon derives its dark color from dark fruits such as blackcurrant and blackberry.
Although Pinot Blanc is top of the list returned by Logistic Regression, it should not be confused with Pinot Noir, because they are different grapes. Pinot Noir is a black wine grape with green flesh, while Pinot Blanc is a white grape that is often confused with Chardonnay. Pinot Blanc is very similar to a Chardonnay in that it has a medium to full body and light flavor. Its lighter flavors often include citrus, melon, pear, apricot, and perhaps smokey or mineral undertones.
Although the classification scores are not that bad, I still feel that the data present in this dataset is not sufficient for the models to classify wine varieties with high accuracy and precision. Just based on review text, the models are only able to draw vague relationships to fruits and similarity with other varieties at best. If we consider all the factors affecting the varieties of wine, we will understand why this data-set is lacking in a lot of information:

1. Many of the varieties are close relations of one another in terms of genealogy (e.g. Pinot Noir, Pinot Grigio, Pinot Blanc), but yet they have very different colors and properties. They may be grown in the same region, but may also be cultivated in other geographically very different regions.
2. Nearly all the varieties can have different flavors of fruits as a result of different degrees or process of aging. Genealogically different species of grapes like Chardonnay and Pinot Blanc can thus have similar citrus, or pear flavors depending on their age.
3. Due to historical and geographical reasons, some of the names of grapes or their corresponding wines can be different depending on the country of origin. For example, in France, it's called Pinot Gris, while in Italy it's called Pinot Grigio. They are in fact the same grape, just grown in different places. Another example is Pinot Blanc in France, and it's called Pinot Bianco in Italy. Without explicitly declaring such naming differences in the dataset, models will always mix them up.
4. Some of the varieties can be so close in taste, scent and color that even domain experts themselves cannot tell apart. In such cases, only minute chemical property differences or even isotopic distribution differences can differentiate them. This means that only if we can somehow get those chemical and physical properties data for the wine varieties in this data-set to supplement this review and geo-location data, can we have more confidence in getting the model to classify varieties more accurately.
5. Due to subtle differences in climate conditions, the quality of harvest of a particular grape variety in one year, could be slightly different from another year. This may be reflected in the flavor or undertone of the final wine product. Hence, the domain expert may be able to detect such subtle differences in a blind tasting session, but on consultation with the wine producer and getting more information about the harvests, he may then understand why they are actually from the same grape. Such harvest data is not collected in the data-set, but is still an important factor in classifying wine variety.

Word2Vec results

Word2Vec is a class of neural network models that produces a vector for each word in the corpus so as to encode its semantic information in an unsupervised manner. We have used TFIDF for word representation above, its scores give us some idea of a word's relative importance in a document, but they do not give any insight into its semantic meaning. We can use Word2Vec models to help us measure the semantic similarity between words by computing the cosine similarity between their corresponding vectors.

After fitting Word2Vec model on the text corpus, I gave it a list of independent words such as "aroma", "taste", "price", "promise", "steak" and asked it to predict the output word which is most similar to each of them. I'm pleased to find that it was able to associate "scent" -> "aroma", "sugary" -> "taste", "dollar" -> "price", "promise" -> "potential", "chop" -> "steak".
I also gave it two distinct wine variety names "chardonnay" which is a white wine, and "cabernet" which is a red wine. The model was able to tell me a list of white wine variety names such as "albarino", "grigio", "riesling", "semillon", "gris", "blanc" that are related to "chardonnay" as well as a list of red wine variety names such as "merlot", "syrah", "claret", "malbec" and "sauvignon" that are related to "cabernet". This shows that the model was able to learn the colour of wine names from the context!!

Configuring our neural network

Now that we have characterized the text corpus and found it to be a rich source of information, we should start to think about how to extract more creative value out of it, for example using it to build a text generative model. Recurrent Neural Nets are particularly designed to work with sequence dependent data (like our text sequence). The backpropagation training step of RNNs is notoriously hard, mainly due to the exploding or vanishing gradients issues. In order to minimize the possibility of them occurring, we shall be using a type of RNN cell called Gated Recurrent Unit (GRU), with Relu as the activation function instead of the default tanh, since they are known to help resolve these gradient issues. We will use Tensorflow to create such a neural network with the above-mentioned elements. To train our network, we will implement a softmax with cross-entropy as our loss function and measure the accuracy by taking the mean of cell-wise logical comparison between the output and label tensors.

Training with characters

Next, we have to think of how to provide input to it and how to interpret its output. There are two types of input we can consider, characters and word embeddings. While Word2Vec model does provide a rich source of contextual information, character level training is much easier to understand and visualise as compared to word embeddings. Also since we have noted that the text in this corpus is pretty clean, with simple sentence structures, little to no connection between sentences in each document and hence little long-term dependencies and a narrow sentiment range, character level input makes a lot of sense.

We will first break up each string of text from each document into individual characters, encode each character in Ascii code, and then One-Hot vectorize each Ascii encoded character. It is these One-Hot vectors that we input to the neural network, and it will produce an output that is also in this One-Hot vector form. Hence to decode it, we will have to reverse the process to convert it to Ascii code first, followed identifying the character that matches the Ascii code.

Recurrent Neural Net Model Architecture

Each GRU unit takes in a One-Hot vector representing a character as input, concatenates with the previous state Ht−1 then goes through a series of computations by the reset and update gates and generates an output state. This RNN unfolds over the length of a sequence of 30 characters, taking in one character at a time, and at the end of each sequence, it predicts the next character after the end of that sequence.

To train the network, I used AdamOptimizer to perform backpropagation by taking the loss function as an input and updating the weights and biases of each GRU unit. The loss function is computed by taking the softmax cross-entropy of the predicted output.
As a measure of performance, I defined accuracy as the mean of cell-wise comparison between this predicted character vector with the actual label vector.

Results of the training

After 100 batches:
The neural net couldn't predict anything yet, and it was only able to generate gibberish text....

After 2500 batches:
Now it starts to predict some characters, and it's starting to learn the positions of the spaces. However, it's still not able to spell any English words.

After 10000 batches:
It's able to spell many English words correctly and it also learned title case. There are still many spelling mistakes though and it's hard to figure out what it's trying to express in each sentence.

After 50000 batches:
It's now able to spell most words correctly, even long words like "concentrated", it was consistently able to spell it correctly.

Conclusion and future work

Overall, I'm pleased to see that my model is able to spell English words properly at the end, and also understands punctuation positions. Even the sentence structure is quite complete too. While the content of the text may be a little limited, this is after all largely dependent on the quality of the text corpus and I think it's not too bad for an initial attempt. Further processing and shaping of the text reviews is required to improve the quality of the text generation for sure.

Future work:

  • Increasing the input sequence length and/or number of hidden layers to improve accuracy.
  • Using the Word2Vec model to create a word embedding layer to the RNN to generate word sequences instead of character sequences.
  • Filter out those reviews with Positive sentiment scores to let the neural net learn only 'biased', but 'attractive' text reviews.
  • Using the RNN with word embeddings layer to try to classify the 632 varieties of wine.