Video Summarization Utilizing OpenAI Whisper & Hugging Chat API


Introduction

“Much less is extra,” as architect Ludwig Mies van der Rohe famously mentioned, and that is what summarization means. Summarization is a essential instrument in lowering voluminous textual content material into succinct, related morsels, interesting to right now’s fast-paced info consumption. In textual content functions, summarization aids info retrieval, and helps decision-making. The mixing of Generative AI, like OpenAI GPT-3-based fashions, has revolutionized this course of by not solely extracting key parts from textual content and producing coherent summaries that retain the supply’s essence. Apparently, Generative AI’s capabilities prolong past textual content to video summarization. This entails extracting pivotal scenes, dialogues, and ideas from movies, creating abridged representations of the content material. You possibly can obtain video summarization in many various methods, together with producing a brief abstract video, performing video content material evaluation, and highlighting key sections of the video or making a textual abstract of the video utilizing video transcription

The Open AI Whisper API leverages computerized speech recognition expertise to transform spoken language into written textual content, therefore rising accuracy and effectivity of textual content summarization. Alternatively, the Hugging Face Chat API gives state-of-the-art language fashions like GPT-3.

Studying Aims

On this article we’ll find out about:

  • We find out about video summarization strategies
  • Perceive the functions of Video Summarization
  • Discover the Open AI Whisper mannequin structure
  • Be taught to implement the video textual summarization utilizing the Open AI Whisper and Hugging Chat API

This text was printed as part of the Information Science Blogathon.

Video Summarization Methods

Video Analytics

It entails the method of extracting significant info from a video. Use deep studying to trace and determine objects and motion in a video and determine the scenes. A few of the common strategies for video summarization are:

Keyframe Extraction and Shot Boundary Detection

This course of contains changing the video to a restricted variety of nonetheless photos. Video skim is one other time period for this shorter video of keyshots.

Video photographs are non-interrupted steady sequence of frames. Shot boundary recognition detects transitions between photographs, like cuts, fades, or dissolves, and chooses frames from every shot to construct a abstract. The under are the main steps to extract a steady brief video abstract from an extended video:

  • Body Extraction – Snapshot of video is extracted from video, we are able to take 1fps for 30 fps video.
  • Face and Emotion Detection – We are able to then extract faces from video & rating the feelings of faces to detect emotion scores. Face detection utilizing SSD (Single Shot Multibox Detector).
  • Body Rating & Choice – Choose frames which have excessive emotion rating after which rank.
  • Last Extraction – We extract subtitles from the video together with timestamps. We then extract the sentences akin to the extracted frames chosen above, together with their beginning and ending occasions within the video. Lastly, we merge the video components corresponding to those intervals to generate the ultimate abstract video.

Motion Recognition and Temporal Subsampling

On this we attempt to determine human motion carried out within the video that is broadly used software of Video analytics. We breakdown the video in small subsequences as an alternative of frames and attempt to estimate the motion carried out within the section  by classification and sample recognition strategies like HMC (Hidden Markov Chain Evaluation).

Single and Multi-modal Approaches

On this article we now have used single modal method the place in we use the audio of video to create a abstract of video utilizing textual abstract. Right here we use a
single side of video which is audio convert it to textual content after which get abstract utilizing that textual content.

In multi-modal method we mix info from many modalities like audio, visible, and textual content, give a holistic information of the video content material for extra correct summarization.

Purposes of Video Summarization

Earlier than diving into the implementation of our video summarization we should always first know the functions of video summarization. Beneath are a few of the listed examples of video summarization in quite a lot of fields and domains:

  • Safety and Surveillance: Video summarization can enable us to investigate great amount of surveillance video to get vital occasions spotlight with out manually reviewing the video
  • Training and Coaching: One can ship key notes and coaching video thus college students can revise the video contents with out going by means of the entire video.
  • Content material Looking: Youtube makes use of this to spotlight vital a part of video related to consumer search with a view to enable customers to resolve they wish to watch that individual video or not based mostly on their search necessities.
  • Catastrophe Administration: For emergencies and disaster video summarization can enable to take actions based mostly on conditions highlighted within the video abstract.

Open AI Whisper Mannequin Overview

The Whisper mannequin of Open AI is a computerized speech recognition(ASR). It’s used for transcribing speech audio into textual content.

 Architecture of Open AI Whisper Model
Structure of Open AI Whisper Mannequin

It’s based mostly on the transformer structure, which stacks encoder and decoder blocks with an consideration mechanism that propagates info between them. It’ll take the audio recording, divide it into 30-second items, and course of every one individually. For every 30-second recording, the encoder encodes the audio and preserves the placement of every phrase said, and the decoder makes use of this encoded info to find out what was mentioned.

The decoder will anticipate tokens from all of this info, that are mainly every phrase pronounced. It’ll then repeat this course of for the next phrase , utilising all the identical info to help it determine the following one which makes extra sense.

 Whisper model task flowchart
Whisper mannequin job flowchart

Coding Instance for Video Textual Summarization

 Flowchart of Textual Video Summarization
Flowchart of Textual Video Summarization

1 – Set up and Load Libraries

!pip set up yt-dlp openai-whisper hugchat
import yt_dlp
import whisper
from hugchat import hugchat

#Perform for saving audio from enter video id of youtube
def obtain(video_id: str) -> str:
    video_url = f'https://www.youtube.com/watch?v={video_id}'
    ydl_opts = {
        'format': 'm4a/bestaudio/finest',
        'paths': {'dwelling': 'audio/'},
        'outtmpl': {'default': '%(id)s.%(ext)s'},
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'm4a',
        }]
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code = ydl.obtain([video_url])
        if error_code != 0:
            increase Exception('Did not obtain video')

    return f'audio/{video_id}.m4a'


#Name perform with video id
file_path = obtain('A_JQK_k4Kyc&t=99s')

3 – Transcribe audio to textual content utilizing Whisper

# Load whisper mannequin
whisper_model = whisper.load_model("tiny")

# Transcribe audio perform
def transcribe(file_path: str) -> str:
  # `fp16` defaults to `True`, which tells the mannequin to aim to run on GPU.
  transcription = whisper_model.transcribe(file_path, fp16=False)
  return transcription['text']
  

#Name the transcriber perform with file path of audio  
transcript = transcribe('/content material/audio/A_JQK_k4Kyc.m4a')
print(transcript)

 4 – Summarize transcribed textual content utilizing Hugging Chat

Be aware to make use of hugging chat api we have to login or enroll on hugging face platform. After that instead of “username” and “password” we have to move in our hugging face credentials.

from hugchat.login import Login

# login
signal = Login("username", "password")
cookies = signal.login()
signal.saveCookiesToDir("/content material")

# load cookies from usercookies
cookies = signal.loadCookiesFromDir("/content material") # This may detect if the JSON file exists, return cookies if it does and lift an Exception if it isn't.

# Create a ChatBot
chatbot = hugchat.ChatBot(cookies=cookies.get_dict())  # or cookie_path="usercookies/<electronic mail>.json"
print(chatbot.chat("Hello!"))

#Summarise Transcript
print(chatbot.chat('''Summarize the next :-'''+transcript))

Conclusion

In conclusion, the idea of summarization is a transformative pressure in info administration. It’s a robust instrument that distills voluminous content material into concise, significant varieties, tailor-made to the fast-paced consumption of right now’s world.

Via the combination of Generative AI fashions like OpenAI’s GPT-3, summarization has transcended its conventional boundaries, evolving right into a course of that not solely extracts however generates coherent and contextually correct summaries.

The journey into video summarization unveils its relevance throughout numerous sectors. The implementation of how audio extraction, transcription utilizing Whisper, and summarization by means of Hugging Face Chat could be seamlessly built-in to create video textual summaries.

Key Takeaways

1. Generative AI: Video summarization could be achieved utilizing generative AI applied sciences akin to LLMs and ASR.

2. Purposes in Fields:  Video summarization is definitely helpful in lots of vital fields the place one has to investigate great amount of movies to mine essential info.

3. Primary Implementation:  On this article we explored fundamental code implementation of video summarization based mostly on audio dimension.

4. Mannequin Structure: We additionally learnt about fundamental structure of Open AI Whisper mannequin and its course of circulation.

Regularly Requested Questions

Q1.  What are limits of Whisper API?

A. Whisper API name restrict is 50 in a min. There isn’t any audio size restrict however information upto 25 MB can solely be shared. One can scale back file dimension of audio by lowering bitrate of audio.

Q2. The Whisper API helps which file codecs?

A. The next file codecs: m4a, mp3, webm, mp4, mpga, wav, mpeg

Q3. What are the alternate options of Whisper API?

A. A few of the main alternate options for Automated Speech Recognition are – Twilio Voice, Deepgram, Azure speech-to-text, Google Cloud Speech-to-text.

This fall. What are the constraints of Automated Speech Recognition (ASR) system?

A. One of many the problem in comprehending numerous accents of the identical language, necessity for specialised coaching functions in specialised fields.

Q5. What are the alternate options to Automated Speech Recognition (ASR)?

A. Superior analysis is going down within the subject of speech recognition like decoding imagined speech from EEG indicators utilizing neural structure. This enables folks
with speech disabilities to speak their ideas of speech to exterior world with assist of units. One such attention-grabbing paper right here.

The media proven on this article shouldn’t be owned by Analytics Vidhya and is used on the Creator’s discretion.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles