Trending December 2023 # 11 Must Have Features To Consider When Building A Vod Platform # Suggested January 2024 # Top 16 Popular

You are reading the article 11 Must Have Features To Consider When Building A Vod Platform updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 11 Must Have Features To Consider When Building A Vod Platform

If you are not Martin Scorsese or Woody Allen or Steven Speilberg, chances are that your content is gathering dust while waiting for the right media to showcase it. Instead of waiting for things to happen, why not shake things up by yourself?

An estimated 40.2 million households are going to shift away from paid TV subscriptions as they explore fresh and quality content on VOD platforms.

A Video-on-demand (VOD) platform is an internet-enabled distribution system without the constraints of a static broadcasting schedule of a pay-TV. Many technopreneurs are following the footsteps of pioneers like Netflix, Amazon Prime, and Hulu by building their own VOD platforms.

Features that your VOD Solution cannot do without: 1. Personalizing your platform

The benefit of creating your platform through a SaaP model is the extent of customization that can be incorporated into your video-on-demand business model. This helps in building your brand and visualizing an innovative solution for entertainment.

2. Choosing the server

Also read: Top 10 Best Software Companies in India

3. Inbuilt Content Discovery Functionality

CMS is the key to have an engaged subscriber base. Using a recommendation engine can help in content personalization in real-time through methods like collaborative filtering, content-based filtering, and deep learning. This ensures that your VOD streaming platform will gain the benefit of a longer watch-time which converts to revenue as per your monetization model.

4. Multiple Options for Building Income Flows

Video-on-demand platforms generate high ROI when their services are flexible. Revenue streams can be channelized better with subscription, transactional, ad-based revenue models or a combination of it all. Ads can be inserted at the server-side or through third-party integration.

5. Uninterrupted Global Streaming

Content Distribution Networks are a system to distribute HD content in your video on demand platform globally with a large number of networked servers. High-density CDNs ensure that videos are streamed faster and over shorter distances even in high-stress situations such as potential server outages.

6. Immersive full-screen experience

Also read: Top 10 IoT Mobile App Development Trends to Expect in 2023

7. Widen your reach

Transcoding protocols like RTMP and HLS convert your ingest streams into multiple formats that are compatible across devices like smartphones, TV and Web applications. Capitalize the viewership on different platforms by streaming videos even over low bandwidth networks.

8. Secure your platform even for offline streaming

Security functionalities prevent unauthorized access and modification to your video-on-demand services. Watermarking ensures that the videos cannot be illegally redistributed. If you are building a platform that supports offline viewing, DRM becomes an unavoidable feature.

9. Promote your content

Marketing your video-on-demand services are pivotal to generate engagement and revenue. Marketing is not limited to integrating your platform with social media. Metadata management, social publishing, lead capture forms, and email notifications must be some key elements of a marketing strategy for your VOD platform.

10. Faster and customized player

Also read: How to Start An E-commerce Business From Scratch in 2023

11. Dynamic insights into trends

In-depth analysis and review about user engagement, audience timeline, and refresh rate are pivotal to understand trends in your content. By tailoring your content to cater to the audience on your VOD streaming platform, you deliver a personalized experience that they will come back to!


You're reading 11 Must Have Features To Consider When Building A Vod Platform

15 Must Have Linux Applications

An operating system is of no value whatsoever without needed applications to get things done on a day to day basis. And even though this sounds obvious, it’s something that is on the minds of many new Linux converts.

Will they be able to relinquish control over their tired, older legacy apps on the Windows desktops? While finding usable, Linux compatible alternatives?

In this article, I’ll share fifteen software titles I use frequently — often everyday. These are applications that quite literally make using the Linux desktop a real pleasure.

1) LibreOffice – Long before LibreOffice even existed, I was a big fan of OpenOffice. So my history with the software suite predates many folks as I’ve been a full time user of OpenOffice since day one.

Today however, LibreOffice is the preferred option for distributions looking to offer a cutting edge, dependable office suite to their users. My most commonly used applications within the LibreOffice suite are Writer and Calc. I use Writer because it’s stable, provides me with strong control over my word processing documents and the options of installing extensions, and only further increases the software’s functionality in my eyes.

2) Evince – An application that doesn’t always make it into everyone’s list, Evince is a PDF viewer that is fast and stable. In my humble opinion, I’ve found Evince to be a preferred alternative to Adobe’s PDF viewer for Linux. It may have less options, but Evince makes up for it with speed and stability. Best of all, it comes pre-installed with many desktop Linux distributions these days.

3) gscan2pdf – Thanks to the SANE backend that comes with modern Linux distributions, scanning a document is usually as simple as connecting a scanner and selecting Simple Scan. And while it’s a good application for scanning images, even supporting export to PDF, it’s a pretty basic tool.

By contrast, gscan2pdf offers greater functionality, a better UI and of course, even supports network document scanners. In an enterprise environment, you’re going to want to have access to gscan2pdf’s capabilities. Another benefit to using gscan2pdf, is that it generally performs better, and works with greater stability for higher resolution scans.

4) Self Control – When you’re on your PC, distractions are something you have to contend with. And usually, I am able to make the most out of my work time. But every once in awhile, like during big events that I might be tracking, I can get distracted. This is why I use an application called Self Control. It allows me to easily block specific websites, for a set amount of time. Best of all, once activated, it’s very difficult to undo. So you won’t be temped to simply turn off the app, should you wish to stop working and visit those time-wasting websites.

5) Kazam – Perhaps not an application that is going to be used by everyone out there reading this, but for me, it’s a must-have. I have used a variety of screen capturing programs over the years on the Linux desktop. Without exception, nearly all of them were unusable. Worse, they offered poor results and left my recorded video looking over-compressed and grainy. Kazam is fantastic! It works well with most Linux audio connections, plus the video can be saved as WebM or MP4. Coming full circle, back to the audio connections, I love that it can actually record from two separate audio devices at the same time.

6) VLC – Whether I need a video player to view my own screen captures or perhaps instead I’m catching up on my favorite video podcasts, VLC is always my first video player of choice. This cross platform player plays practically anything, without needing to worry about which codecs are installed on your Linux distro. Everything that’s needed is already included with the VLC application. It’s also worth mentioning that VLC will also play DVDs, without any extra configuration. With this functionality, VLC saves me time and is hassle free. I know that any media file I throw at it will likely be played without missing a beat.

7) guvcview – Cheese, the photobooth app provided in many distributions, is garbage in its current state. The concept, layout and filters are pretty neat. But sadly, the application is a buggy, crashing, software mishap. Thankfully, there is still a solid solution for those of us using the UVC (Universal Video Class) powered webcams under the Linux desktop. Appropriately called guvcview, this software will provide you with much of the same photobooth functionality found in Cheese. The difference being, it won’t crash or over-tax your CPU in the process of running. This software is capable of capturing still images and video recordings. You can even save your captures in a wide rage of file formats and codecs.

8) Pithos – Regardless of where you happen to work from, I believe that music can often help to avoid distractions. But one of the problems of managing an MP3 playlist, is that it can become a distraction in and of itself. Therefore, the next logical step might be to look to services like Pandora. The best way to enjoy Pandora on the Linux desktop in my opinion is through an application called Pithos.

Top 11 Iot Securities You Must Have For Your Smart Devices

When speaking about IoT securities — smart houses would be the hot new fad that’s completely altering the home security situation.

It’s a revolution in people’s lifestyle. The net of stuff has made life much simpler as well as simpler. The planet is currently in a gold rush of the net of items.

The webcams, digital assistants, movement detectors, and even more play a significant role in making your life simpler. Listed below are eleven IoT securities which are must-have for your smart devices.

Our inter-connectivity world

Though internet-connected apparatus make resides a cakewalk, many don’t understand that the entire connectivity is a two-edged sword.

The safety tradeoffs in IoT aren’t paid attention to by several. Let us look at how the IoT has another life — and also the way to deal with this and cut it out.

What is the Internet of Things (IoT)?

IoT identifies physical objects embedded with sensors, applications, and other technology that link and exchange information with other programs on the net. These items vary from regular household items to the industrial instrument.

The importance of IoT now extends across; multiple sectors, including:

Consumer applications:

This comprises consumer items such as smartphones, smartwatches, and smart houses. These may be utilized to control everything from air conditioning to door locks.

Business sector:

The internet of items employed by companies ranges from smart safety cameras to trackers for ships, vehicles, and products to detectors that capture industrial information of machines.

Government sector: 

You may wonder in which the government utilizes IoT, but the IoT gets the government official’s job trouble-free. Few regions in which the IoT plays a fantastic role are wildlife monitoring, traffic monitor, disaster alarms.

The amount of IoT apparatus is soaring to greater than countless, and this amount won’t stop here. With the growth of internet-connected apparatus, among the fantastic concerns which are surfacing with consumers is safety.

Since the devices are linked to the world wide web, it’s open to risks globally, increasing the scrutiny of underlying safety problems.

How your IoT makes you vulnerable?

Some hackers may enter your network via the benign apparatus linked to the network. Your smart devices, such as smart TVs, smart locks, gaming consoles, smart thermostats anything, may be the gateway into your network.

They can get a great deal of info like your everyday routine, life standing, or sensitive data such as passwords or financial details.

This can make you more vulnerable to cyber-attacks along with other issues. The attackers may install malicious software like malware, which leaves your router and gathers all of the information from devices on the router. Smart home devices are more vulnerable since they have little if any built-in safety.

Anecdote of a IoT assault

In 2023, the Mirai botnet jeopardized a massive number of apparatus — all cheated by teens. A botnet is utilized to run large scale cyber-attack by combining the processing power of little devices.

The from date variations and simple credentials was the victim to the malware. To block your apparatus from cyber-attack, clinic the subsequent measures to make it even more secure.

Another huge assault in 2010 is utilizing the Stuxnet pig, a complex computer worm that searches down specific machinery employed in the nuclear sector.

These viruses started the assault in 2006 but implemented a mass play at the calendar year 2009.

The viruses concentrated the management system as well as the data acquisition methods and contaminated the schooling to the machines. It is therefore vital to see that the net of things is available to strike at any given level.

Vulnerabilities that puts you at the risk

Though we can’t stop the hackers and cybercriminals from doing the assault , the very best thing you can do is take a few steps.

To establish the ideal security steps, we could be safe and protected from such hackers’ curbs. However, to know this, first, you need to comprehend that the safety vulnerabilities that encourage breaches and offenses to your house or organization.

Terrible, guessable, or hardcoded passwords

The insecurity in-network services

The interface ecosystem

Lack of up-to-date mechanics at the apparatus with the newest applications

Utilization of elements that’s out of date or insecure

there’s not any enough solitude.

Overlook the storage and transfer of data

Default configurations that provide consent to unnecessary

Lack of physical steps

IOT Securities You Must Have 1. Make sure your device secure by design

Before buying an IoT apparatus or alternative, make certain it’s protected by design. If the supplier can’t provide the sufficient details, reconsider opting to get a specific device or alternative .

It’s also wise to ensure that the manufacturer offers timely patches and upgrades to your device all over its life. The timely patches and upgrades for the apparatus keep this current with the newest trend at that moment.

2. Name your router

Also read: Best CRM software for 2023

3. Know your system and related devices

With an increasing number of devices on the system, it will become hard to keep tabs on it.

To be protected, you need to understand more about the system, the devices connected to it, and also the sort of information that the apparatus can get.

When the devices have programs containing social sharing, choose that the permissions carefully.

4. Use powerful encryption

Your router needs to have a strong encryption process. Utilize the Most Recent encryption criteria like WPA2 Rather than WEP or WPA. Installing upgrades and timely stains helps in getting a minimal degree of danger.

5. Use a powerful password

The first major point to do while installing a unit would be to modify the default passwords. The cyber attackers may already know the default passwords and usernames of their IoT device. If the device does not permit you to modify the password then look at another one.

Verify the settings of your apparatus

6. Check the settings of your devices

Normally, the wise devices include default settings that may be insecure on your apparatus. The most peculiar thing is that some devices will not allow changing those settings.

The things which need to be assessed based on configurations are weak qualifications, intrusive features, permissions, and open vents.

7. Install firewalls and other security options

The safety gateways stand between your IoT apparatus and network. They have more processing capacity, memory, and capacities than IoT apparatus.

It’s possible to set up more powerful features such as a firewall to stop hackers from accessing your own IoT devices.

The firewall techniques block unauthorized traffic within the cable and run IDS or IPS which can be an intrusion detection or intrusion prevention system to inspect the network platform.

To make your task easier you can use vulnerability scanners to unveil the safety flaws within the system. You are able to use a port scanner to spot the ports that are open.

8. Use a separate network

If you’re operating a significant venture, then this suggestion is for you. Employing another system for smart devices aside from the company system for the IoT apparatus is among the very strategic approaches to guarantee IoT security.

When segmentation is set up, even if the hackers tempt the way in to the IoT apparatus, they can not find your company information or sniff the lender transfers.

9. Make sure that universal Plug and Play (UPnP) is off

The Universal Plug and Play is a group of network protocols that allow network devices to detect others’ existence seamlessly.

However, the same has left the potential for exposing you to hackers out more readily. The UPnP comes as a default setting on a lot of routers today.

Also read: 10 Best Saas Marketing Tools And Platforms For 2023

10. Implement physical security

In case you’ve got the liberty of controlling the wise device with a telephone, then be double-cautious you don’t shed your mobile phone. Have protection such as Pinpassword, or biometric to the gadget. Along with this, ensure you could erase your phone remotely. Have automatic backups set up or selective backups to your information which are important.

11. Increasing consumer awareness

Many customers overlook safety whilst buying an IoT apparatus. The users need to be conscious of the most recent security measures which need to be allowed for security.

Also read: 10 Types of Developer Jobs: IT Jobs

Bottom line

Regardless of the dangers, it’s no brainer the net of items has a massive possibility. It’s made day to day chores simple just like a wise kettle.

10 Things To Consider When Choosing Your Cell Phone Provider

There are a few things one might want to keep in mind when selecting a cell phone provider. Some notable considerations are the cost of calls, text messages and data plans. You should also consider network coverage and a few other factors. Below are ten things you should consider before agreeing to a cell phone provider.

Reputation of the Company

Consider the number of years the cell phone service provider has been in the industry. Think of the results they have garnered over that time — either good or bad. For instance, if a company has been in the business for decades, it means it is intent on longevity and probably has a good reputation (how else would it have lasted that long?).

Regional Coverage

On the other hand, there are a large group of territorial carriers that are conveying truly awesome services for individuals in specific parts of the nation. Look at those organizations and choose if that matters.

Contract Length

It is somewhat easy to get a quality cell phone via an attractive deal from a service provider. Unfortunately, that may leave you with a contract which could span two years or more. These contracts usually come with harsh penalties when you try to get out of said contracts earlier than the plan intended. A good cellphone provider will look to sell you even the world’s top phones at their most affordable prices, but without restrictive contracts. They look to offer you the best plans without much of an ulterior motive.

Good Customer Service

At one point or another, you are going to need assistance with your service. Something is bound to irk you or confuse you and you are going to have to call customer service to get some help. Look for a cell phone service provider that has good policies and service that is friendly to customers. Look for a cell phone service provider that prioritizes customer satisfaction.

Costs/Pricing Structure

Since many of us live our day-to-day lives on a budget (some times a tight one), you have to consider the overall amount of money you are going to spend on your cell phone bill. Always take a long and hard look at the pricing structure of the service provider. Ensure you know for sure just how much of your money is going into the service provider’s bank account.

In spite of the fact that most cell phone users are contract subscribers, there’s also the choice to run with prepaid services. Keep in mind, in the event that you are searching for a prepaid service, your decisions may be restricted.

Network Coverage

You should always know the extent of the service provider’s network coverage. With adequate network coverage, you should not be restricting yourself to locations that you can get a signal. Ensure its coverage is as wide as you need it to be, especially if you travel a lot. Also, roaming fees shouldn’t be included.

Look at the 4G LTE network that the cellphone provider offers. 4G LTE networks are a big deal nowadays and are something to be taken into serious consideration when choosing a cellphone service provider.


The mobile phone the company sells in its deals may not really be a big thing, but it’s one to consider. You probably want the cell phone service provider that can sell you the very best phones from th biggest brands in the world, so it’s something to consider. Also, consider any other types of devices (such as routers) that the company offers.

A Friendly Introduction To Knime Analytics Platform


In recent years, data science has become omnipresent in our daily lives, causing many data analysis tools to sprout and evolve for the data scientist Joe to use. Python, R, or KNIME Analytics Platform, are some of the most common tools. The innovative character of the KNIME Analytics Platform consists of its visual programming environment with an intuitive interface that embraces a wide variety of technologies.

In this blog post, we would like to give you a friendly introduction to the KNIME Analytics Platform, by showing you the user interface, explaining the most crucial functionalities, and demonstrating how to create a codeless data science process. In order to do that, we will use a specific example of implementing a workflow for customer segmentation based on the k-Means clustering procedure.

General Concepts: Nodes, Workflows, & Components

KNIME Analytics Platform is a free, open-source software for the entire data science life cycle. KNIME’s visual programming environment provides the tools to not only access, transform, and clean data but also train algorithms, perform deep learning, create interactive visualizations, and more.

The KNIME Analytics Platform user interface also referred to as workbench, is typically organized as shown in Fig. 1.

Figure 1. Overview of the KNIME workbench.

Nodes perform tasks in your data science process.

When you are assembling a visual workflow, you will be using “nodes”. A node is displayed as a colored box and performs an individual task. A collection of interconnected nodes is your assembled workflow and represents some part – or all – of your data analysis project.

Each node can perform all kinds of tasks, e.g., reading and writing files, transforming data, training models, or creating visualizations. All the different types of nodes are found in the Node Repository (in the lower-left corner). The data are routed through the node via input and output ports. A node can have data as input or output as well as other objects, such as connections or machine learning models, SQL queries, or data properties. Each object inputs or outputs the node via a dedicated port. Only ports of the same type can be connected. Nodes are color-coded, according to their category, e.g., all yellow nodes are for data wrangling. Depending on their task, nodes also have specific settings, which can be adjusted in their configuration dialog.

A simple traffic light system underneath each node shows you whether the node is already configured, executed, or whether an error has occurred.

Figure 2. The different states of a node

Workflows are assembled with nodes, metanodes, and components.

A workflow in KNIME Analytics Platform consists of several, combined nodes. The data flows through the workflow from left to right through the node connections.

You can use a so-called annotation – a colored frame that can be placed freely inside a workflow – to document the steps in your workflow.

Figure 3. A simple workflow performing customer segmentation through a k-Means clustering procedure. The workflow’s task as well as each step- read the data – preprocessing – apply k-Means – visualization – is documented inside annotation boxes.

You can also identify isolated blocks of logical operations in your workflows and include these nodes into so-called metanodes or components. Components are like metanodes but instead of just grouping some nodes for the sake of transparency, they encapsulate and abstract the functionalities of the logical block. Components serve a similar purpose as nodes, whereas a metanode is more of a visual appearance improvement.

Enrich functionality with extensions and integrations and utilize resources

Aside from all the above-mentioned functions, the KNIME Analytics Platform has two further important elements – extensions and integrations. The variety of extensions and integrations provide additional functionalities to the KNIME core functions. For example, the KNIME Deep Learning – Keras Integration or the Text Processing extension are only two of many exciting possibilities.

Finally, just a few words on where to get help and find resources.

The KNIME Hub is a public repository, where you can find vast amounts of nodes, components, workflows, and extensions and provides a space to collaborate with other KNIME users. On KNIME Hub, you can find example workflows and pre-packaged components to get started.

The KNIME Community Forum serves as an environment to exchange experiences with other KNIME users, to seek help, or to offer your skills to help others.

If you need help getting started, on our Learning webpages you will find additional courses and material.

The Example: Customer Segmentation

Let’s now put our knowledge into practice and assemble a visual workflow in which a k-Means clustering is applied to segment customer data.

The dataset we use for this example can be downloaded from Kaggle and contains some basic customer data: “Customer ID”, “Gender”, “Age”, “Annual Income”, and “Spending Score”.

There are many ways to perform customer segmentation. Most of them include some previous knowledge on at least one of the input attributes. When nothing is known, in terms of knowledge or goals, customer segmentation is often performed via a clustering technique.

In general, clustering is used to detect underlying patterns in the data. Similar traits – or data points – are grouped together based on similarity and assigned into clusters. Amongst all clustering techniques, k-Means is a very simple one, yet effective enough.

a) Read the Dataset with CSV Reader

First, you need to read your data into the workflow. KNIME Analytics Platform offers a variety of data reading options for different file types, for example, Excel files with the Excel Reader node, text files with the File Reader node, or CSV files with the CSV Reader node. It is also possible to connect to and read from a database using the dedicated DB nodes. To read your data into KNIME Analytics Platform there are two different options.

You can drag and drop your file into the workflow. In this case, if it is a known file extension, the KNIME Analytics Platform automatically creates the correct reader node and automatically feeds the file location into the node configuration settings.

Note that those are standard operations.

Figure 4. Searching the CSV Reader node via Node Repository

Figure 5. The configuration window of the CSV Reader node.

b) Apply Normalization to Attributes with Normalizer Node

The k-Means algorithm requires normalized numerical attributes. In general, normalization is necessary, when the attributes are in two different non-comparable units (e.g., cm and kg) or the variance between attributes is quite different. By normalizing the values, we make sure that one input feature does not dominate the other features in the distance or variance calculation. Additionally, nominal attributes should be discretized before feeding them into the algorithm, otherwise, it cannot handle them properly. Discretize nominal variables using the Category to Number node.

Our dataset consists of five attributes, of which one is a categorical variable (“Gender”), and the other four are numerical variables. Usually, the best practice is to use all input attributes, but in this case, we decided to limit ourselves to just two: Annual Income and Spending Score. We did this to obtain a clearer visualization of the results in a 2D scatter plot and thus an easier comparison of different k-Means runs.

In the configuration window of the Normalizer node (Fig. 6), you need to select the attributes for which normalization is required. We decided to use a min-max normalization to simply transform the values into a [0,1] interval, where the smallest value is transferred to 0 and the largest value to 1. But of course, there are other options, like a z-score normalization (suitable when having many outliers) or normalization by decimal scaling.

Figure 6. The configuration window of the Normalizer node. 

We then configure the Normalizer node appropriately (Fig. 6).

Nodes can either be executed step by step to examine intermediate results or the entire workflow at once. The two green arrowed buttons in the toolbar are responsible for these two types of execution.

c) Apply k-Means Algorithm and Configure Cluster Number

Now, connect the normalized dataset to the k-Means node and configure the number of clusters k. The correct choice of a number of clusters cannot be known beforehand. Usually, a few numbers clusters are tried and then the final cluster sets are visually compared and evaluated. Alternatively, the quality of the different cluster sets can be measured and compared, for example via the Silhouette Coefficient. The optimal k could also be obtained by running an optimization method, like the elbow method, the Silhouette optimization, or the Gap Statistics.

Let’s start with k=3. In the node configuration window of the k-Means node (Fig. 7), we can decide whether to initialize the algorithm with the first k rows or with k random data points of the dataset. Also, we can include or exclude attributes for the distance calculation.

You might now wonder why there is no option for the distance measure. In this case, let me point out that the k-Means node uses the Euclidean distance by default. Notice that the Euclidean distance only applies to numerical attributes and therefore only numerical columns are available to move from the Include to the Exclude panel and vice versa. We use the two normalized input attributes Annual Income and Spending Score. Would we obtain the same results when adding attribute Age? You can try…

The configuration window also includes an emergency stop criterion, to avoid infinite running without convergence to a stable solution. This is the setting named Max. Number of Iterations.

Figure 7. The configuration window of the k-Means node.

After executing this node successfully, it outputs the k cluster centers for the k=3 clusters (Fig. 8). You can try to run the algorithm again with a different number of clusters and see if and how the cluster centers change.

Figure 8. The three cluster centers.

d) Determine Clustering Quality with Silhouette Coefficient

Now we successfully identified three clusters. But how good is our clustering? More precisely, how good is the choice of our k? The quality of the cluster set can be measured via the Silhouette Coefficient as calculated via the Silhouette Coefficient node.

As in the k-Means node, an Include/Exclude panel allows to include attributes for the calculation of the Silhouette Coefficient. In addition, here, the cluster column must be selected, which in our case comes from the previous k-Means node, is called “Cluster”, and contains a string value indicating the cluster affiliation.

e) Assign Color to Clusters with Visualization Nodes, Color Manager & Scatter Plot

The last step refers to visualizing the obtained clusters. To do so, we use the Color Manager node to assign a specific color to each cluster (Fig. 9) and after that the Scatter Plot node to visualize the resulting cluster set (Fig. 10). Indeed, a visual inspection can help us evaluate the quality of the cluster set.

Figure 9. The configuration window of the Color Manager node.

In Fig. 10, you see the configuration window for the Scatter Plot node and its most important setting: selecting attributes for the x and y axis. In the tab “General Plot Options”, chart title and subtitle can be specified.

Figure 10. The configuration window of the Scatter Plot node.

Now, let’s look at the visualization (Fig. 11). As you can see, the clusters are quite widespread, especially cluster 0 and cluster 1. This and the Silhouette Coefficient of 0.45 indicates that we might need to rethink our parameter choice.

f) Document Workflow with Comments and Annotations

The final workflow is shown in Fig. 3 and can be downloaded from the KNIME Hub.

Try Out Different k’s

In case the clustering is not as satisfying as expected, simply rerun k-Means with different parameters and see if a better clustering can be achieved.

Below, we report the mean Silhouette Coefficient for k=3, k=5, and k=8 and the corresponding scatter plots.

Table 1. The mean overall Silhouette Coefficient for different k

k=3 k=5 k=8

Overall Silhouette Coefficient 0.45 0.56 0.44

Figure 11. The visualization of the resulting cluster set for k=3.

Cluster set for k=5

Figure 13. Cluster set for k=8

By comparing the scatter plots (visually) and the silhouette coefficient values, k=5 seems to be the preferable choice, so far. However, using the optimization procedure based on the Silhouette Coefficient and implemented in the component named “Optimized k-Means (Silhouette Coefficient)”, k=5 turns out to be the best choice for setting k.

Now, we should explain how to create and use a component… but this is material for the next article.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.


A Guide To Building An End

This article was published as a part of the Data Science Blogathon.

Knock! Knock!

Who’s there?

It’s Natural Language Processing!

Today we will implement a multi-class text classification model on an open-source dataset and explore more about the steps and procedure. Let’s begin.

Table of Contents


Loading the data

Feature Engineering

Text processing

Exploring Multi-classification Models

Compare Model performance



Dataset for Text Classification

The dataset consists of real-world complaints received from the customers regarding financial products and services. The complaints are labeled to a specific product. Hence, we can conclude that this is a supervised problem statement, where we have the input and the target output for that. We will play with different machine learning algorithms and check which algorithm works better.

Our aim is to classify the complaints of the consumer into predefined categories using a suitable classification algorithm. For now, we will be using the following classification algorithms.

Linear Support Vector Machine (LinearSVM)

Random Forest

Multinomial Naive Bayes

Logistic Regression.

Loading the Data

Download the dataset from the link given in the above section. Since I am using Google Colab, if you want to use the same you can use the Google drive link given here and import the dataset from your google drive. The below code will mount the drive and unzip the data to the current working directory in colab.

from google.colab import drive drive.mount('/content/drive') !unzip /content/drive/MyDrive/

First, we will install the required modules.

Pip install numpy

Pip install pandas

Pip install seaborn

Pip install scikit-learn

Pip install scipy

Ones everything successfully installed, we will import required libraries.

import os import pandas as pd import numpy as np from scipy.stats import randint import seaborn as sns # used for plot interactive graph. import matplotlib.pyplot as plt import seaborn as sns from io import StringIO from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_selection import chi2 from IPython.display import display from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from chúng tôi import LinearSVC from sklearn.model_selection import cross_val_score from sklearn.metrics import confusion_matrix from sklearn import metrics

Now after this let us load the dataset and see the shape of the loaded dataset.

# loading data df = pd.read_csv('/content/rows.csv') print(df.shape)

From the output of the above code, we can say that the dataset is very huge and it has 18 columns. Let us see how the data looks like. Execute the below code.


Now, for our multi-class text classification task, we will be using only two of these columns out of 18, that is the column with the name ‘Product’ and the column ‘Consumer complaint narrative’. Now let us create a new DataFrame to store only these two columns and since we have enough rows, we will remove all the missing (NaN) values. To make it easier to understand we will rename the second column of the new DataFrame as ‘consumer_complaints’.

# Create a new dataframe with two columns df1 = df[['Product', 'Consumer complaint narrative']].copy() # Remove missing values (NaN) df1 = df1[pd.notnull(df1['Consumer complaint narrative'])] # Renaming second column for a simpler name df1.columns = ['Product', 'Consumer_complaint'] print(df1.shape) df1.head(3).T

We can see that after discarding all the missing values, we have around 383k rows and 2 columns, this will be our data for training. Now let us check how many unique products are there.


There are 18 categories in products. To make the training process easier, we will do some changes in the names of the category.

# Because the computation is time consuming (in terms of CPU), the data was sampled df2 = df1.sample(10000, random_state=1).copy() # Renaming categories df2.replace({'Product': {'Credit reporting, credit repair services, or other personal consumer reports': 'Credit reporting, repair, or other', 'Credit reporting': 'Credit reporting, repair, or other', 'Credit card': 'Credit card or prepaid card', 'Prepaid card': 'Credit card or prepaid card', 'Payday loan': 'Payday loan, title loan, or personal loan', 'Money transfer': 'Money transfer, virtual currency, or money service', 'Virtual currency': 'Money transfer, virtual currency, or money service'}}, inplace= True) pd.DataFrame(df2.Product.unique())

The 18 categories are now reduced to 13, we have combined ‘Credit Card’ and ‘Prepaid card’ to a single class and so on.

Now, we will map each of these categories to a number, so that our model can understand it in a better way and we will save this in a new column named ‘category_id’. Where each of the 12 categories is represented in numerical.

# Create a new column 'category_id' with encoded categories df2['category_id'] = df2['Product'].factorize()[0] category_id_df = df2[['Product', 'category_id']].drop_duplicates() # Dictionaries for future use category_to_id = dict(category_id_df.values) id_to_category = dict(category_id_df[['category_id', 'Product']].values) # New dataframe df2.head()

Let us visualize the data, and see how many numbers of complaints are there per category. We will use Bar chart here.

fig = plt.figure(figsize=(8,6)) colors = ['grey','grey','grey','grey','grey','grey','grey','grey','grey', 'grey','darkblue','darkblue','darkblue'] df2.groupby('Product').Consumer_complaint.count().sort_values().plot.barh( ylim=0, color=colors, title= 'NUMBER OF COMPLAINTS IN EACH PRODUCT CATEGORYn') plt.xlabel('Number of ocurrences', fontsize = 10);

Above graph shows that most of the customers complained regarding:

Credit reporting, repair, or other

Debt collection


Text processing

The text needs to be preprocessed so that we can feed it to the classification algorithm. Here we will transform the texts into vectors using Term Frequency-Inverse Document Frequency (TFIDF) and evaluate how important a particular word is in the collection of words. For this we need to remove punctuations and do lower casing, then the word importance is determined in terms of frequency.

We will be using TfidfVectorizer function with the below parameters:

min_df: remove the words which has occurred in less than ‘min_df’ number of files.

Sublinear_tf: if True, then scale the frequency in logarithmic scale.

Stop_words: it removes stop words which are predefined in ‘english’.

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, ngram_range=(1, 2), stop_words='english') # We transform each complaint into a vector features = tfidf.fit_transform(df2.Consumer_complaint).toarray() labels = df2.category_id print("Each of the %d complaints is represented by %d features (TF-IDF score of unigrams and bigrams)" %(features.shape))

Now, we will find the most correlated terms with each of the defined product categories. Here we are finding only three most correlated terms.

# Finding the three most correlated terms with each of the product categories N = 3 for Product, category_id in sorted(category_to_id.items()): features_chi2 = chi2(features, labels == category_id) indices = np.argsort(features_chi2[0]) feature_names = np.array(tfidf.get_feature_names())[indices] unigrams = [v for v in feature_names if len(v.split(' ')) == 1] bigrams = [v for v in feature_names if len(v.split(' ')) == 2] print(" * Most Correlated Unigrams are: %s" %(', '.join(unigrams[-N:]))) print(" * Most Correlated Bigrams are: %s" %(', '.join(bigrams[-N:])))

* Most Correlated Unigrams are: overdraft, bank, scottrade * Most Correlated Bigrams are: citigold checking, debit card, checking account * Most Correlated Unigrams are: checking, branch, overdraft * Most Correlated Bigrams are: 00 bonus, overdraft fees, checking account * Most Correlated Unigrams are: dealership, vehicle, car * Most Correlated Bigrams are: car loan, vehicle loan, regional acceptance * Most Correlated Unigrams are: express, citi, card * Most Correlated Bigrams are: balance transfer, american express, credit card * Most Correlated Unigrams are: report, experian, equifax * Most Correlated Bigrams are: credit file, equifax xxxx, credit report * Most Correlated Unigrams are: collect, collection, debt * Most Correlated Bigrams are: debt collector, collect debt, collection agency * Most Correlated Unigrams are: ethereum, bitcoin, coinbase * Most Correlated Bigrams are: account coinbase, coinbase xxxx, coinbase account * Most Correlated Unigrams are: paypal, moneygram, gram * Most Correlated Bigrams are: sending money, western union, money gram * Most Correlated Unigrams are: escrow, modification, mortgage * Most Correlated Bigrams are: short sale, mortgage company, loan modification * Most Correlated Unigrams are: meetings, productive, vast * Most Correlated Bigrams are: insurance check, check payable, face face * Most Correlated Unigrams are: astra, ace, payday * Most Correlated Bigrams are: 00 loan, applied payday, payday loan * Most Correlated Unigrams are: student, loans, navient * Most Correlated Bigrams are: income based, student loan, student loans * Most Correlated Unigrams are: honda, car, vehicle * Most Correlated Bigrams are: used vehicle, total loss, honda financial

Exploring Multi-classification Models

The classification models which we are using:

Random Forest

Linear Support Vector Machine

Multinomial Naive Bayes

Logistic Regression.

For more information regarding each model, you can refer to their official guide.

Now, we will split the data into train and test sets. We will use 75% of the data for training and the rest for testing. Column ‘consumer_complaint’ will be our X or the input and the product is out Y or the output.

X = df2['Consumer_complaint'] # Collection of documents y = df2['Product'] # Target or the labels we want to predict (i.e., the 13 different complaints of products) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)

We will keep all the using models in a list and loop through the list for each model to get a mean accuracy and standard deviation so that we can calculate and compare the performance for each of these models. Then we can decide with which model we can move further.

models = [ RandomForestClassifier(n_estimators=100, max_depth=5, random_state=0), LinearSVC(), MultinomialNB(), LogisticRegression(random_state=0), ] # 5 Cross-validation CV = 5 cv_df = pd.DataFrame(index=range(CV * len(models))) entries = [] for model in models: model_name = model.__class__.__name__ accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV) for fold_idx, accuracy in enumerate(accuracies): entries.append((model_name, fold_idx, accuracy)) cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

The above code will take sometime to complete its execution.

Compare Text Classification Model performance

Here, we will compare the ‘Mean Accuracy’ and ‘Standard Deviation’ for each of the four classification algorithms.

mean_accuracy = cv_df.groupby('model_name').accuracy.mean() std_accuracy = cv_df.groupby('model_name').accuracy.std() acc = pd.concat([mean_accuracy, std_accuracy], axis= 1, ignore_index=True) acc.columns = ['Mean Accuracy', 'Standard deviation'] acc

From the above table, we can clearly say that ‘Linear Support Vector Machine’ outperforms all the other classification algorithms. So, we will use LinearSVC to train model multi-class text classification tasks.

plt.figure(figsize=(8,5)) sns.boxplot(x='model_name', y='accuracy', data=cv_df, color='lightblue', showmeans=True) plt.title("MEAN ACCURACY (cv = 5)n", size=14);

Evaluation of Text Classification Model

Now, let us train our model using ‘Linear Support Vector Machine’, so that we can evaluate and check it performance on unseen data.

X_train, X_test, y_train, y_test,indices_train,indices_test = train_test_split(features, labels, df2.index, test_size=0.25, random_state=1) model = LinearSVC(), y_train) y_pred = model.predict(X_test)

We will generate claasifiaction report, to get more insights on model performance.

# Classification report print('ttttCLASSIFICATIION METRICSn') print(metrics.classification_report(y_test, y_pred, target_names= df2['Product'].unique()))

From the above classification report, we can observe that the classes which have a greater number of occurrences tend to have a good f1-score compared to other classes. The categories which yield better classification results are ‘Student loan’, ‘Mortgage’ and ‘Credit reporting, repair, or other’. The classes like ‘Debt collection’ and ‘credit card or prepaid card’ can also give good results. Now let us plot the confusion matrix to check the miss classified predictions.

conf_mat = confusion_matrix(y_test, y_pred) fig, ax = plt.subplots(figsize=(8,8)) sns.heatmap(conf_mat, annot=True, cmap="Blues", fmt='d', xticklabels=category_id_df.Product.values, yticklabels=category_id_df.Product.values) plt.ylabel('Actual') plt.xlabel('Predicted') plt.title("CONFUSION MATRIX - LinearSVCn", size=16);

From the above confusion matrix, we can say that the model is doing a pretty decent job. It has classified most of the categories accurately.


Let us make some prediction on the unseen data and check the model performance.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0) tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, ngram_range=(1, 2), stop_words='english') fitted_vectorizer = tfidf_vectorizer_vectors = fitted_vectorizer.transform(X_train) model = LinearSVC().fit(tfidf_vectorizer_vectors, y_train)

Now run the prediction.

complaint = """I have received over 27 emails from XXXX XXXX who is a representative from Midland Funding LLC. From XX/XX/XXXX I received approximately 6 emails. From XX/XX/XXXX I received approximately 6 emails. From XX/XX/XXXX I received approximately 9 emails. From XX/XX/XXXX I received approximately 6 emails. All emails came from the same individual, XXXX XXXX. It is becoming a nonstop issue of harassment.""" print(model.predict(fitted_vectorizer.transform([complaint]))) complaint = """Respected Sir/ Madam, I am exploring the possibilities for financing my daughter 's XXXX education with private loan from bank. I am in the XXXX on XXXX visa. My daughter is on XXXX dependent visa. As a result, she is considered as international student. I am waiting in the Green Card ( Permanent Residency ) line for last several years. I checked with Discover, XXXX XXXX websites. While they allow international students to apply for loan, they need cosigners who are either US citizens or Permanent Residents. I feel that this is unfair. I had been given mortgage and car loans in the past which I closed successfully. I have good financial history. print(model.predict(fitted_vectorizer.transform([complaint]))) complaint = """They make me look like if I was behind on my Mortgage on the month of XX/XX/2023 & XX/XX/XXXX when I was not and never was, when I was even giving extra money to the Principal. The Money Source Web site and the managers started a problem, when my wife was trying to increase the payment, so more money went to the Principal and two payments came out that month and because I reverse one of them thru my Bank as Fraud they took revenge and committed slander against me by reporting me late at the Credit Bureaus, for 45 and 60 days, when it was not thru. Told them to correct that and the accounting department or the company revert that letter from going to the Credit Bureaus to correct their injustice. The manager by the name XXXX requested this for the second time and nothing yet. I am a Senior of XXXX years old and a Retired XXXX Veteran and is a disgraced that Americans treat us that way and do not want to admit their injustice and lies to the Credit Bureau.""" print(model.predict(fitted_vectorizer.transform([complaint])))

The model is not perfect, yet it is performing very good.

The notebook is available here.


We have implemented a basic multi-class text classification model, you can play with other models like Xgboost, or you can try to compare multiple model performance on this dataset using a machine learning framework called AutoML. This is not yet, still there are complex problems associated within the multi-class text classification tasks, you can always explore more and acquire new concepts and ideas about this topic. That’s It!!

Thank you!

All images are created by the author.

My LinkedIn

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion


Update the detailed information about 11 Must Have Features To Consider When Building A Vod Platform on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!