As for wine variety, Random Forest returned a list of fruit-related words as being associated with the 3 varieties of wine - "cherry", "pineapple", "cassis", "tannins", "blackberry", "pear", "cola", "apple", "silky", "peach" while
Logistic Regression gave a list of names of other wine varieties such as "pinot blanc","barbera", "barolo", "brunello", "prosecco", "pinotage", "malvasia", "chenin", "amarone", "petite", "nebbiolo", "garnacha", "semillon", "verdejo".
A very ripe Chardonnay will have flavors more towards tropical fruits like pineapple, guava and mango. A barely ripe Chardonnay will have green apple, pear or lemon flavors.
Pinot Noir derives its lighter color from red-fruits such as cherry, cranberry and strawberry while Cabernet Sauvignon derives its dark color from dark fruits such as blackcurrant and blackberry.
Although Pinot Blanc is top of the list returned by Logistic Regression, it should not be confused with Pinot Noir, because they are different grapes. Pinot Noir is a black wine grape with green flesh, while Pinot Blanc is a white grape
that is often confused with Chardonnay. Pinot Blanc is very similar to a Chardonnay in that it has a medium to full body and light flavor. Its lighter flavors often include citrus, melon, pear, apricot, and perhaps smokey or mineral undertones.
Although the classification scores are not that bad, I still feel that the data present in this dataset is not sufficient for the models to classify wine
varieties with high accuracy and precision. Just based on review text, the models are only able to draw vague relationships to fruits and similarity with other varieties at best.
If we consider all the factors affecting the varieties of wine, we will understand why this data-set is lacking in a lot of information:
1. Many of the varieties are close relations of one another in terms of genealogy (e.g. Pinot Noir, Pinot Grigio, Pinot Blanc), but yet they have very different colors and properties. They may be grown in the same region, but may also be cultivated in other geographically very different regions.
2. Nearly all the varieties can have different flavors of fruits as a result of different degrees or process of aging. Genealogically different species of grapes like Chardonnay and Pinot Blanc can thus have similar citrus, or pear flavors depending on their age.
3. Due to historical and geographical reasons, some of the names of grapes or their corresponding wines can be different depending on the country of origin. For example, in France, it's called Pinot Gris, while in Italy it's called Pinot Grigio. They are in fact the same grape, just grown in different places. Another example is Pinot Blanc in France, and it's called Pinot Bianco in Italy. Without explicitly declaring such naming differences in the dataset, models will always mix them up.
4. Some of the varieties can be so close in taste, scent and color that even domain experts themselves cannot tell apart. In such cases, only minute chemical property differences or even isotopic distribution differences can differentiate them. This means that only if we can somehow get those chemical and physical properties data for the wine varieties in this data-set to supplement this review and geo-location data, can we have more confidence in getting the model to classify varieties more accurately.
5. Due to subtle differences in climate conditions, the quality of harvest of a particular grape variety in one year, could be slightly different from another year. This may be reflected in the flavor or undertone of the final wine product. Hence, the domain expert may be able to detect such subtle differences in a blind tasting session, but on consultation with the wine producer and getting more information about the harvests, he may then understand why they are actually from the same grape. Such harvest data is not collected in the data-set, but is still an important factor in classifying wine variety.