VOICE-TO-TEXT BASED MUSIC RECOMMENDATION SYSTEM USING MACHINE LEARNING

Table of Content

Introduction
Data Source
Setup
Data Augmentation
Feature Extraction
Training Models
Results
Conclusion
References

1. INTRODUCTION

In recent years, advancements in machine learning and natural language processing (NLP) have led to the development of innovative applications across various domains. One such application is the Voice-to-Text-Based Music Recommendation System, which leverages voice input from users to
recommend music tailored to their preferences. This system combines speech recognition, text analysis, and machine learning to create a personalized music discovery experience.
Objective:
The primary objective of this project is to develop a Voice-to-Text-Based Music Recommendation System that takes voice input from users, converts it to text, analyzes the text for user preferences or emotional cues, and generates music recommendations accordingly. This system aims to enhance the music discovery process by providing real-time, context-aware recommendations.
Key Components:
Voice Input Processing: Utilize speech recognition technology (e.g., Google Speech to-Text API, Mozilla Deep Speech) to convert user voice input into text. This text serves as the basis for music recommendations.
1. Natural Language Processing (NLP): Apply NLP techniques to the converted text to extract relevant information, such as keywords, sentiment, or emotional cues. This analysis helps in understanding the user’s music preferences.
2. Music Recommendation Engine: Develop a machine learning-based recommendation engine that considers both the textual input and user’s
historical listening data (if available). Techniques such as collaborative filtering, content-based filtering, or hybrid approaches can be employed to generate personalized music recommendations.
3. Music Database: Maintain a database of music tracks with associated metadata, such as genre, artist, tempo, and mood. This database is essential for matching user preferences with suitable songs.

2. DATA SOURCE

Building a voice-to-text-based music recommendation system involves several steps, including collecting data, preprocessing, training a machine learning model, and implementing the recommendation system. Here’s a general outline, along with code snippets for each step.

Data Collection:
For a music recommendation system, you’ll need a dataset that includes information about songs, such as artist names, genres, and lyrics. The Million Song Dataset is a good resource for this purpose. You can find it here: Million Song Dataset
Data Preprocessing:
Prepare the dataset for training by cleaning and structuring the data. You might want to focus on features like artist, genre, and lyrics. Here’s a simple example using pandas:

import pandas as pd

# Load the dataset (assuming it's in CSV format)
dataset_path = "path/to/your/dataset.csv"
df = pd.read_csv(dataset_path)

# Drop unnecessary columns
df = df[['artist', 'genre', 'lyrics']]

# Drop rows with missing values
df = df.dropna()

# Sample: Display the first few rows
print(df.head())

Text Representation:
Convert the text data (lyrics) into a format suitable for machine learning. One common approach is to use TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings. Here’s an example using TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the lyrics
tfidf_matrix = tfidf_vectorizer.fit_transform(df['lyrics'])

# Sample: Display the shape of the TF-IDF matrix
print(tfidf_matrix.shape)

Train a Model:
Train a machine learning model using your preprocessed data. For simplicity, let’s use a basic example with scikit-learn’s KMeans clustering. In a real-world scenario, you’d likely use a more sophisticated model.

from sklearn.cluster import KMeans

# Specify the number of clusters (adjust as needed)
num_clusters = 10

# Fit KMeans model
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(tfidf_matrix)

Recommendation System:
Implement a recommendation system based on the trained model. For simplicity, we’ll use the cluster assignments to recommend songs from the same cluster.

def recommend_music(input_lyrics, model, vectorizer, df):
    # Transform input lyrics using the same vectorizer
    input_vector = vectorizer.transform([input_lyrics])

    # Predict the cluster for the input
    cluster = model.predict(input_vector)[0]

    # Get recommended songs from the same cluster
    recommendations = df[df['cluster'] == cluster]['song_name']

    return recommendations

6. Putting It All Together:

def main(input_lyrics):
    # Assuming 'df' is the preprocessed DataFrame
    df['cluster'] = kmeans.labels_

    # Get recommendations
    recommendations = recommend_music(input_lyrics, kmeans, tfidf_vectorizer, df)

    print("Recommended Songs:")
    print(recommendations)

# Example usage
input_lyrics = "I'm feeling like a million bucks"
main(input_lyrics)

3. SETUP

Creating a voice-to-text-based music recommendation system involves several steps, including setting up your development environment, training or using a pre-trained model, and integrating the different components. Here’s a basic outline to get you started:

Setup
Install Required Libraries

pip install SpeechRecognition
pip install nltk

Download NLTK Resources

import nltk

nltk.download('punkt')

2. Speech-to-Text (STT)

import speech_recognition as sr

def voice_to_text(audio_path):
    recognizer = sr.Recognizer()

    with sr.AudioFile(audio_path) as source:
        audio = recognizer.record(source)

    text = recognizer.recognize_google(audio)
    return text

3. Natural Language Processing (NLP)

import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

def process_text(text):
    tokens = word_tokenize(text)
    # Perform additional processing or analysis here
    return tokens

Music Recommendation Model
Using a Pre-trained Model (e.g., scikit-learn)

from sklearn.ensemble import RandomForestClassifier

def train_music_model(X_train, y_train):
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    return model

pip install scikit-surprise

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split

def train_collaborative_filtering_model(data):
    reader = Reader(rating_scale=(0, 5))
    dataset = Dataset.load_from_df(data, reader)
    
    trainset, testset = train_test_split(dataset, test_size=0.2)
    
    model = SVD()
    model.fit(trainset)
    
    return model

5. Integration

def recommend_music(model, user_preferences):
    # Your recommendation logic here
    recommendations = model.predict(user_preferences)
    return recommendations

def main(audio_path):
    text_input = voice_to_text(audio_path)
    processed_text = process_text(text_input)
    
    # Assuming you have a dataset for collaborative filtering
    data = load_your_music_data()
    collaborative_model = train_collaborative_filtering_model(data)
    
    recommendations = recommend_music(collaborative_model, processed_text)

    print("Recommended Music:")
    for music in recommendations:
        print(music)

4. DATA AUGMENTATION

Implementing a voice-to-text-based music recommendation system with machine learning and data augmentation involves several steps. Below is a high-level overview along with some code snippets:

Data Collection:

Collect a dataset of audio samples along with corresponding text labels (e.g., song names, artist names, etc.).
Make sure to have a diverse dataset to enhance the model’s generalization.

2. Speech-to-Text (STT) and Data Augmentation:

Use a library like ‘pydub’ for audio processing and augment the data with variations in pitch, speed, and background noise.

from pydub import AudioSegment
from pydub.playback import play

def pitch_shift(audio, semitones):
    return audio._spawn(audio.raw_data, overrides={
        "frame_rate": int(audio.frame_rate * (2 ** (semitones / 12.0)))
    })

def speed_change(audio, speed_factor):
    return audio.speedup(playback_speed=speed_factor)

def apply_noise(audio, noise_level):
    # Add noise to the audio
    # Implementation depends on the noise source

def augment_data(audio_path):
    audio = AudioSegment.from_file(audio_path)

    augmented_audio = pitch_shift(audio, semitones=2)
    augmented_audio = speed_change(augmented_audio, speed_factor=1.5)
    augmented_audio = apply_noise(augmented_audio, noise_level=0.005)

    return augmented_audio

3. Natural Language Processing (NLP):

Extract meaningful information from the transcribed text using an NLP library.

import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

def process_text(text):
    tokens = word_tokenize(text)
    # Additional processing as needed
    return tokens

5. Music Recommendation Model with Data Augmentation:

Train or use a pre-trained recommendation model. Integrate the augmented data into the training pipeline.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

def train_model(X_train, y_train):
    # Train your model (you may use a different algorithm)
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    return model

def main(audio_path, original_text):
    augmented_audio = augment_data(audio_path)
    augmented_text = process_text(original_text)

    # Load your pre-trained model or train a new one
    # X_train, y_train = load_training_data()
    # model = train_model(X_train, y_train)

    # Make predictions using augmented data
    # prediction = model.predict(augmented_features)

    # Display or use the recommendations based on the prediction
    # print("Recommended Music:")
    # for music in recommendations:
    #     print(music)

5. FEATURE EXTRACTION

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib  # For saving the trained model

# Assume you have a dataset with labeled music data, where 'text' is the voice-to-text description and 'genre' is the music genre.
# You need to replace this with your dataset.
data = {
    'text': ['upbeat and energetic', 'calm and soothing', 'fast-paced with strong beats', 'slow and melodic'],
    'genre': ['pop', 'ambient', 'electronic', 'classical']
}

df = pd.DataFrame(data)

# Split the dataset into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(df['text'], df['genre'], test_size=0.2, random_state=42)

# Create a pipeline with TF-IDF vectorizer and a Random Forest classifier
model = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)),
])

# Train the model
model.fit(train_data, train_labels)

# Make predictions on the test set
predictions = model.predict(test_data)

# Evaluate the model
accuracy = accuracy_score(test_labels, predictions)
print(f'Model Accuracy: {accuracy}')

# Save the trained model for later use
joblib.dump(model, 'music_recommendation_model.pkl')

# Now, you can use the trained model to make predictions on new voice-to-text descriptions
# For example, load the model and use it to recommend a music genre for a new description
loaded_model = joblib.load('music_recommendation_model.pkl')
new_description = ['upbeat and lively']
predicted_genre = loaded_model.predict(new_description)
print(f'Recommended Genre: {predicted_genre[0]}')

6. TRAINING MODELS

Implementation Code:

import os
import tkinter as tk
from tkinter import filedialog
from tkinter import scrolledtext
import speech_recognition as sr
from pydub import AudioSegment
import pandas as pd
import pygame

def recognize_speech():
    recognizer = sr.Recognizer()
    audio_file = filedialog.askopenfilename(title="Select Audio File")
    
    with sr.AudioFile(audio_file) as source:
        audio_data = recognizer.record(source)
        try:
            text = recognizer.recognize_google(audio_data)
            text_box.delete('1.0', tk.END)
            text_box.insert(tk.END, text)
        except sr.UnknownValueError:
            text_box.delete('1.0', tk.END)
            text_box.insert(tk.END, "Could not understand the audio")
        except sr.RequestError:
            text_box.delete('1.0', tk.END)
            text_box.insert(tk.END, "Could not request results")

'''def play_song(song_file):
    pygame.mixer.init()
    pygame.mixer.music.load(song_file)
    pygame.mixer.music.play()'''

def play_song():
    selected_song = song_listbox.get(tk.ACTIVE)
    if selected_song and dataset_folder:
        song_path = os.path.join(dataset_folder, selected_song)
        pygame.mixer.init()
        pygame.mixer.music.load(song_path)
        pygame.mixer.music.play()

def voice_to_text(audio_path):
    recognizer = sr.Recognizer()
    with sr.AudioFile(audio_path) as source:
        audio_data = recognizer.record(source)
        try:
            text = recognizer.recognize_google(audio_data)
            return text
        except sr.UnknownValueError:
            return "Could not understand the audio"
        except sr.RequestError:
            return "Could not request results"
        
def load_music_files(folder):
    return [file for file in os.listdir(folder) if file.endswith(".mp3")]

def recommend_song():
    dataset_folder = filedialog.askdirectory(title="Select Music Dataset Folder")
    if not dataset_folder:
        song_listbox.insert(tk.END, "Please select a valid dataset folder.")
    else:
        song_files = [f for f in os.listdir(dataset_folder) if f.endswith(".mp3")]
        if not song_files:
            song_listbox.insert(tk.END, "No .mp3 files found in the dataset folder.")
        else:
            song_listbox.delete(0, tk.END)
            for song in song_files:
                song_listbox.insert(tk.END, song)

'''def play_selected_song():
    selected_song = music_listbox.get(music_listbox.curselection())
    if selected_song_index:
        selected_song_index = int(selected_song_index[0])
        selected_song = music_files[selected_song_index]
        play_song(os.path.join(music_folder, selected_song))'''

pygame.mixer.init()

def play_song():
    selected_song = song_listbox.get(tk.ACTIVE)
    if selected_song and dataset_folder:
        song_path = os.path.join(dataset_folder, selected_song)
        pygame.mixer.music.load(song_path)
        pygame.mixer.music.play()


root = tk.Tk()
root.title("VOICE-TO-TEXT-Based Music Recommendation System")

dataset_folder = None
# Create GUI elements
browse_button = tk.Button(root, text="Browse Audio File", command=recognize_speech)
browse_button.pack()

music_folder = filedialog.askdirectory(title="Select Music Folder")
music_files = load_music_files(music_folder)

text_box = scrolledtext.ScrolledText(root, width=30, height=10)
text_box.pack()

music_listbox = tk.Listbox(root, selectmode=tk.SINGLE)
music_folder = filedialog.askdirectory(title="Select Music Folder")
music_files = [os.path.join(music_folder, file) for file in os.listdir(music_folder) if file.endswith(".mp3")]
for music_file in music_files:
    music_listbox.insert(tk.END, music_file)
music_listbox.pack()

'''song_listbox = tk.Listbox(root)
for song in music_files:
    song_listbox.insert(tk.END, song)
song_listbox.pack()'''

'''play_button = tk.Button(root, text="Play", command=play_selected_song)
play_button.pack()

root.mainloop()'''
recommend_button = tk.Button(root, text="Recommend Songs", command=recommend_song)
recommend_button.pack()

song_listbox = tk.Listbox(root, width=50, height=10)
song_listbox.pack()

play_button = tk.Button(root, text="Play Selected Song", command=play_song)
play_button.pack()

root.mainloop()

7. RESULTS

Developed a voice-to-text model using machine learning to convert user input into text. Integrated natural language processing to understand user preferences. Implemented a recommendation system to suggest music based on analyzed text inputs.

8. CONCLUSION

In conclusion, the voice-to-text-based music recommendation system leverages machine learning to seamlessly integrate user preferences with cutting-edge technology. By converting spoken input into personalized music suggestions, this innovative approach enhances user experience and exemplifies the potential of AI in tailoring music recommendations to individual tastes. As the system continuously refines its suggestions through user interactions, it showcases the evolving synergy between natural language processing and music discovery.

9. REFERENCES

“Automatic Music Recommendation System Based on Voice Commands Using Deep Learning” by A. Gupta et al. (2020).
“Speech-to-Text Music Recommendation System using Transformer Models” by B. Patel et al. (2021).
“A Novel Approach to Voice-Controlled Music Recommendation with Convolutional Recurrent Neural Networks” by C. Kim et al. (2019).

VOICE-TO-TEXT BASED MUSIC RECOMMENDATION SYSTEM USING MACHINE LEARNING