Giter Club home page Giter Club logo

jdepoix / youtube-transcript-api Goto Github PK

View Code? Open in Web Editor NEW
2.4K 23.0 261.0 1016 KB

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!

License: MIT License

Python 99.87% Shell 0.13%
youtube-api subtitles youtube transcripts youtube-subtitles youtube-transcripts python transcript subtitle cli

youtube-transcript-api's Introduction

jdepoix's GitHub stats

youtube-transcript-api's People

Contributors

crhowell avatar daflh avatar danielcliu avatar dannylagrouw avatar eseiver avatar esha71 avatar jdepoix avatar jheasly avatar liamrs222 avatar maja-lin avatar majamil16 avatar nbonato avatar vandivier avatar xenova avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

youtube-transcript-api's Issues

Allow for Age Restricted Videos to be Accessed

If you try to access an age-restricted video, like GU7qYaYzd-0, without cookies you get back the The video is no longer available error. There should be functionality added to allow a user to load their Youtube cookies so they have the authorization required to access the age-restricted video.

I have almost finished a small PR to complete this, so if you're ok with it I will open up said PR.

Can't get the youtube transcript

Hello, admin. I had python2.7.5 pip 20.0.2 from /usr/lib/python2.7/site-packages/pip (python 2.7) with latest youtube_transcript_api installed in my centos7. I had try python3.7 but no luck to make them work... Any help would be appreciated!

[root@lofidi ~]# youtube_transcript_api --list-transcripts PT2_F-1esPk

Could not retrieve a transcript for the video https://www.youtube.com/watch?v=PT2_F-1esPk! This is most likely caused by:

The video is no longer available

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!

Error when using from a SSL secure server

Hey! I have been using the api from my local machine and it works smooth, but when trying to use it in a ubuntu server with SSL enabled i run into the following error:

requests.exceptions.SSLError: ("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')

If i override the Request verification by changin lines 56 and 258 of _transcripts.py to
get( *URL REQUEST*, verify=False)
i don't get the error, but its a big security issue.

Do you have any solucion in mind? Thanks!! And great job, love the work done here :D

Unable to catch CouldNotRetrieveTranscript exception

As far as I can tell, there is no way to cleanly handle an exception from the get_transcript method.

from youtube_transcript_api import YouTubeTranscriptApi as ytcc

try:
    captions = ytcc.get_transcript("MHTizZ_XcUM", languages=["fr"])
except:
    captions = ytcc.get_transcript("MHTizZ_XcUM", languages=["en"])

This code shouldn't work at first because the video does not have French captions, so I would like to handle it by getting the English captions. But, this is what is printed out instead:

Could not get the transcript for the video https://www.youtube.com/watch?v=MHTizZ_XcUM! This usually happens if one of the following things is the case:
 - subtitles have been disabled by the uploader
 - none of the language codes you provided are valid
 - none of the languages you provided are supported by the video
 - the video is no longer available.

If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues

Is this intentional or am I missing something? I would like to be able to handle the exception without anything being printed out.

CLI support for preferred language

In v0.1.2 support for providing a list of preferred languages was added. However this is not accessible using the CLI as of now, although it should be.

Fetch video title

Hi! I was looking for a way to retrieve the video title but couldn't find one. Is there a way to achieve this that im missing? I know its not the purpouse of these API but It may be cool if it could so i dont have to get also youtube-dl or something like that only for the video title.
Thanks!!

works for some videos and not others

Hi, I had written previously when this wasnt working earlier this week and the author updated and that worked great. Thank you!

It does now seem that I am getting the same error as last time but only for certain videos. Most watch mojo videos on youtube I cannot get transcripts for, most John oliver last week tonight upsides I can.

Any idea why some might not be accessible? I can see their transcripts on the actual website.

Thanks.

Below is the error I get:
ParseError Traceback (most recent call last)
/anaconda3/envs/insight/lib/python3.7/site-packages/youtube_transcript_api/_api.py in get_transcript(cls, video_id, languages, proxies)
92 try:
---> 93 return _TranscriptParser(_TranscriptFetcher(video_id, languages, proxies).fetch()).parse()
94 except Exception:

/anaconda3/envs/insight/lib/python3.7/site-packages/youtube_transcript_api/_api.py in parse(self)
159 }
--> 160 for xml_element in ElementTree.fromstring(self.plain_data)
161 if xml_element.text is not None

/anaconda3/envs/insight/lib/python3.7/xml/etree/ElementTree.py in XML(text, parser)
1314 parser = XMLParser(target=TreeBuilder())
-> 1315 parser.feed(text)
1316 return parser.close()

ParseError: not well-formed (invalid token): line 2, column 972

During handling of the above exception, another exception occurred:

CouldNotRetrieveTranscript Traceback (most recent call last)
in
----> 1 Confed_transcript=YouTubeTranscriptApi.get_transcript('-VidQFzpW7M')

/anaconda3/envs/insight/lib/python3.7/site-packages/youtube_transcript_api/_api.py in get_transcript(cls, video_id, languages, proxies)
93 return _TranscriptParser(_TranscriptFetcher(video_id, languages, proxies).fetch()).parse()
94 except Exception:
---> 95 raise YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id)
96
97

CouldNotRetrieveTranscript: Could not get the transcript for the video https://www.youtube.com/watch?v=-VidQFzpW7M! This usually happens if one of the following things is the case:

  • subtitles have been disabled by the uploader
  • none of the language codes you provided are valid
  • none of the languages you provided are supported by the video
  • the video is no longer available.

If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues

Cant get the transcripts anymore

Having trouble getting transcripts, it was working fine yesterday.

I get the following error:

TypeError Traceback (most recent call last)
/anaconda3/envs/insight/lib/python3.7/site-packages/youtube_transcript_api/_api.py in get_transcript(cls, video_id, languages, proxies)
92 try:
---> 93 return _TranscriptParser(_TranscriptFetcher(video_id, languages, proxies).fetch()).parse()
94 except Exception:

/anaconda3/envs/insight/lib/python3.7/site-packages/youtube_transcript_api/_api.py in parse(self)
152 }
--> 153 for xml_element in ElementTree.fromstring(self.plain_data)
154 if xml_element.text is not None

/anaconda3/envs/insight/lib/python3.7/xml/etree/ElementTree.py in XML(text, parser)
1314 parser = XMLParser(target=TreeBuilder())
-> 1315 parser.feed(text)
1316 return parser.close()

TypeError: a bytes-like object is required, not 'NoneType'

During handling of the above exception, another exception occurred:

CouldNotRetrieveTranscript Traceback (most recent call last)
in
----> 1 comments('VjizMuzCltY')

in comments(video_id)
247 def comments(video_id): #request
248 #video_id=request.GET['video_id']
--> 249 tran= get_transcript_df(video_id)
250 comments= get_comments_df(video_id)
251

in get_transcript_df(video_id_input)
148 duration=[]
149
--> 150 for dic in YouTubeTranscriptApi.get_transcript(video_id_input):
151 text+=[dic['text']]
152 start+=[dic['start']]

/anaconda3/envs/insight/lib/python3.7/site-packages/youtube_transcript_api/_api.py in get_transcript(cls, video_id, languages, proxies)
93 return _TranscriptParser(_TranscriptFetcher(video_id, languages, proxies).fetch()).parse()
94 except Exception:
---> 95 raise YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id)
96
97

CouldNotRetrieveTranscript: Could not get the transcript for the video https://www.youtube.com/watch?v=VjizMuzCltY! This usually happens if one of the following things is the case:

  • subtitles have been disabled by the uploader
  • none of the language codes you provided are valid
  • none of the languages you provided are supported by the video
  • the video is no longer available.

If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues

getting error for unavalable for language code for video, while list shows code is available

I did start with the example provided (it actually is missing "import .." part), code:

# retrieve the available transcripts
from youtube_transcript_api import YouTubeTranscriptApi

video_id='77rjqnNsP8Q'
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

# iterate over all available transcripts
for transcript in transcript_list:

    # the Transcript object provides metadata properties
    print(
        transcript.video_id,
        transcript.language,
        transcript.language_code,
        # whether it has been manually created or generated by YouTube
        transcript.is_generated,
        # whether this transcript can be translated or not
        transcript.is_translatable,
        # a list of languages the transcript can be translated to
        transcript.translation_languages,
    )

    # fetch the actual transcript data
    print(transcript.fetch())

    # translating the transcript will return another transcript object
    print(transcript.translate('en').fetch())

# you can also directly filter for the language you are looking for, using the transcript list
transcript = transcript_list.find_transcript(['ru', 'en'])

# or just filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['ru', 'en'])

# or automatically generated ones
transcript = transcript_list.find_generated_transcript(['ru', 'en'])

and I got error:

Could not retrieve a transcript for the video https://www.youtube.com/watch?v=77rjqnNsP8Q! This is most likely caused by:

No transcripts were found for any of the requested language codes: ['ru', 'en']

For this video (77rjqnNsP8Q) transcripts are available in the following languages:

(MANUALLY CREATED)
None

(GENERATED)
 - en ("English (auto-generated)")[TRANSLATABLE]

(TRANSLATION LANGUAGES)
 - af ("Afrikaans")
 - sq ("Albanian")
 - am ("Amharic")
 - ar ("Arabic")
...
 - pa ("Punjabi")
 - ro ("Romanian")
 - ru ("Russian")
 - sm ("Samoan")
 - gd ("Scottish Gaelic")
...
 - zu ("Zulu")

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api yo
u are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!

What I don't understand is, the output
a) states cannot find EN or RU
b) lists EN and RU as available

one of this options is false or my hands are curved too much?

pip list|fgrep youtube && python --version
youtube-transcript-api 0.3.1
Python 3.8.1

Only get videos by language

Hi, this is a wonderful api.

Just wondering, is there a way for me to only get videos by a specific language ?

For example, today i only want to download random videos with manual english subtitles, (hence video is in english too).

Tomorrow i only want to download transcript with manual mandarin transcripts, (hence video is in chinese too).

Shows No Video Found even if video exists

I was creating transcripts for around 10k videos using a cron. After creating around 260 transcripts, this library started giving error, video not found. However, I checked that the video existed and its CC also existed. Is there any rate limit imposed by youtube? If yes, can you describe it in detail and how can I overcome it?
Also shouldn't the library throw 429 error in that case? Please reply to this as it is urgent.

get_transcripts() doesn't work for me

Hi,
first off: great script! This is exactly what I was looking for!
the get_transcript() command also works wonderfully for me. However, if I try to use a list of ~200 videos, I get errors.
Let's say I try:
YouTubeTranscriptApi.get_transcripts("Video ID1","Video ID2")
I get error messages like this one:

Could not get the transcript for the video https://www.youtube.com/watch?v=K! Most likely subtitles have been disabled by the uploader or the video is no longer available.
It starts with the first letter of the ID and then goes until the first ID is finished.
Am I doing something wrong?

Getting transcripts in other language

Hi.. thanks for this package.. i'm observing that while passing a list of video IDs, i'm getting other language transcripts from it. Is there anyway to restrict to English ?

For example:
For this video ID : GJLlxj_dtq8

I got the below transcript (sample of it):
'嘿,这里是Dave2D 这是微软的Surface Go,当他们发布这款产品的时候我就对它特别感兴趣。 在一段时间的体验后,感觉非常有吸引力的一款设备 我真的认为这是微软这么久以来发布的最好的产品,它的起售价为400美元(约等于2724RMB) 尽管我不认为你应该去买基础配置款,但他们有中等配置 550美元,稍微贵了一些,但是你可以得到两倍的运行内存,两倍的储存空间,而且值得注意的是更快的储存 如果你能付得起的话,那一款配置是值得大多数人购买的 这里这款中等配置的机型 550美元,我 真的喜欢它。好,让我们来看看它的外观。这款设备的制造质量非常好。它是一款SURFACE系列的产品 它有一个合身的镁制外壳,完成度非常高 这个 正面的屏幕的四周有圆角包边,这样确实能够让这款设备握着更加舒适 不像最早的

Not working for some videos

I used this library before and it was working for all of the videos that I had in my list. But it is showing me the error when running for some of the videos.

 - subtitles have been disabled by the uploader
 - none of the language codes you provided are valid
 - none of the languages you provided are supported by the video
 - the video is no longer available.

e,g. It fails to generate transcripts for the video - https://www.youtube.com/watch?v=I5_rWPxJwk8

Not working?

I have absolutely loved working with this library for what I'm researching so thank you.

That stated, today while using 0.1.9, I'm getting the exception "raise YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id)" for every video I've tried including videos that were collected as recently as yesterday.

I did already check to see if a new version had come out (which I also verified here), rebooted server in full to try and eliminate anything on my side that might be causing this.

Is anyone else also having this issue currently?

get_transcript returning 3 copies of each dictionary

Hello, and thank you for releasing this package, it is awesome.

I have a problem when trying to concatenate all of the text to create a full text copy of the transcript. When I loop through the transcript returned by get_transcript, I get 3 copies of each dictionary. Here is my code:

with open(filename, 'r') as event:
    parsed_event = json.load(event)
    url_data = parse_qs(parsed_event["url"])
    video_id = url_data[next(iter(url_data))][0] # gets video id from youtube url
    transcript = transcriptor.get_transcript(video_id, languages=['en'])

  print(transcript)

with transcriptor being the YouTubeTranscriptApi.

My output looks like this:

...{'text': 'They kill babies too young to be vaccinated.', 'start': 604.545, 'duration': 3.101}, {'text': 'They kill babies too young to be vaccinated.', 'start': 604.545, 'duration': 3.101}, {'text': 'They kill babies too young to be vaccinated.', 'start': 604.545, 'duration': 3.101}, {'text': 'They kill healthy children that are just unlucky.',
'start': 607.981, 'duration': 3.102}, {'text': 'They kill healthy children that are just unlucky.', 'start': 607.981, 'duration': 3.102}, {'text': 'They kill healthy children that are just unlucky.', 'start': 607.981, 'duration': 3.102}, {'text': 'They bring serious diseases back\nfrom the verge of extinction.', 'start': 611.718, 'duration': 3.836}, {'text': 'They bring serious diseases back\nfrom the verge of extinction.', 'start': 611.718, 'duration': 3.836}, {'text': 'They bring serious diseases back\nfrom the verge of extinction.', 'start': 611.718, 'duration': 3.836}, {'text': 'And, the biggest side effect\nof vaccines is fewer dead children.', 'start': 615.923, 'duration': 4.336}, {'text': 'And, the biggest side effect\nof vaccines is fewer dead children.', 'start': 615.923, 'duration': 4.336}, {'text': 'And, the biggest side effect\nof vaccines is fewer dead children.', 'start': 615.923, 'duration': 4.336}, {'text': 'Vaccines are one of\nthe most powerful tools we have', 'start': 620.761, 'duration': 3.403}, {'text': 'Vaccines are one of\nthe most powerful tools we have', 'start': 620.761, 'duration': 3.403}, {'text': 'Vaccines are one of\nthe most powerful tools we have', 'start': 620.761, 'duration': 3.403},...

Do you know of this issue? Thank you!

API stopped working?

The API stopped working for me today. For all videos, I get the error:
Could not get the transcript for the video https://www.youtube.com/watch?v=AfsnHVaScjg! Most likely subtitles have been disabled by the uploader or the video is no longer available.

I hope it's not YT trying to seal the access -_-

Transcription works on cli but it does not in code

Hi i'm facing this exception when I try to transcript a video:

Could not get the transcript for the video https://www.youtube.com/watch?v=1zCs7zYZK4g! Most likely subtitles have been disabled by the uploader or the video is no longer available.

But if I try to use by cli version it works perfectly.

Any ideas of what I can do ?

My code:

from youtube_transcript_api import YouTubeTranscriptApi
x = YouTubeTranscriptApi.get_transcript(["1zCs7zYZK4g"], languages=["en"])
print(x)

Different video source

Hi,

Great work!
I was wondering if it also work on personal videos on my local and not only youtube ones

A little inaccuracy in the documentation

There is a string in the doc:

transcript_list = YouTubeTranscriptApi.list_transcripts(video_id, languages=['de', 'en'])

but the list_transcript method has no languages parameter.

is there any way to probe/list available transcripts?

for a given video, could there be a way to list all the available transcripts (short of manually trying/failing a series of language codes)? Either to fetch them all at once or as a way to inform what can be fetched

Some videos have transcripts in 20+ languages (say this one https://www.youtube.com/watch?v=GAgp7nXdkLU), or just have multiple english transcripts (auto/manually generated) and I'd be useful to collect/compare them.

Thanks for making this tool

VideoUnavailable problem for available video

I use:

from youtube_transcript_api import YouTubeTranscriptApi
YouTubeTranscriptApi.get_transcript(video_id)

Get this error:

VideoUnavailable:
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=ppWPuXsnf1Q! This is most likely caused by:

The video is no longer available

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!

pip install youtube_transcript_api throws error

Collecting youtube_transcript_api
Downloading https://files.pythonhosted.org/packages/27/92/17ce1a35de1f3cf91e206869300c63f32a8c9042c468c7b70f456acbb8af/youtube_transcript_api-0.1.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/private/tmp/pip-install-0nDnr_/youtube-transcript-api/setup.py", line 32, in
install_requires=get_requirements(),
File "/private/tmp/pip-install-0nDnr_/youtube-transcript-api/setup.py", line 12, in get_requirements
return list(filter(lambda line: line != '' and not line.startswith('#'), get_file_content('requirements.txt').split('\n')))
File "/private/tmp/pip-install-0nDnr
/youtube-transcript-api/setup.py", line 5, in _get_file_content
with open(file_name, 'r') as file_handler:
IOError: [Errno 2] No such file or directory: 'requirements.txt'

Could not get available persian subtitle

simply I'm doing the following command in cli
youtube_transcript_api LKvjIsyYng8 --languages fa
However I cant get the language I want that is available in subtitles.

No longer getting transcripts

Hello,

I was able to get transcripts for videos about a month ago, but it seems that all attempts now return an error that the transcripts aren't available. I have tried multiple videos. Thanks

Can not query multiple videos ( returns always false)

Hello, for a single video id I am able to get the subtitles however if that single video id is in a bunch of video ids and I want to perform :
jsonobject=(YouTubeTranscriptApi.get_transcript(id_list[i]))
even though that particular video id is in the list, it returns false for all the video ids as though none of them has a subtitle.

Script is broken

Hi, I just tried to reinstall everything and seems like there is something wrong with the latest version:
File "<stdin>", line 1, in <module> File "/Library/Python/2.7/site-packages/youtube_transcript_api/__init__.py", line 1, in <module> from ._api import YouTubeTranscriptApi File "/Library/Python/2.7/site-packages/youtube_transcript_api/_api.py", line 4, in <module> CookieLoadError = (FileNotFoundError, cookiejar.LoadError)

I can't use it via line command or on my python script

Preferred language support

Hi, I noticed that someone opened up another post with regards to restricting the language to be English. I am also looking for this functionality. The video linked by Sundaresh, GJLlxj_dtq8 has transcripts in many languages, but it seems that the script pulls the text for the language that is listed first alphabetically, in this case it would be in Chinese. I am wondering if the API has changed since 6 months ago?

And so, I think it would help myself and many users you could look into whether functionality could be added such that an array of preferred languages may be passed and it retrieves the transcript based on whether it is available in those languages, e.g. [english, korean, russian] --> not found in english, found in korean ==> korean output. If none of the preferred languages are found, output text for any language, otherwise throw the 'no transcript error'.

Thanks for the great repo!

Transcripts not available

I used this library yesterday and was able to access a multitude of transcripts. However, when I try to use the library today, it says that it could not retrieve the transcript most likely because: the video is no longer available when it is. Did something change or am I doing something wrong?

Transcript retrieval failed.

Hi,
just to confirm with you if the library has stopped working due to changes in Youtube ?
thank

As all the videos (including those that I am able to retrieve successfully in the past) has failed with similar following message.:

ERR: Could not get the transcript for the video https://www.youtube.com/watch?v=AeJ9q45PfD0! This usually happens if one of the following things is the case:

  • subtitles have been disabled by the uploader
  • none of the language codes you provided are valid
  • none of the languages you provided are supported by the video
  • the video is no longer available.

If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues

Requests limits?

Does this method has any limitations? I made a loop and after 250 requests it returns:

_VideoUnavailable:
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=xxxxx! This is most likely caused by:

The video is no longer available

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!_

Google Colab issue

Hi jdepoix!

Thanks for a fantastic API. I've found it very useful.

I have noticed a strange bug though (and it may not be related to your code). I cannot get this API to work run on Google Colab. It works fine in a local environment for this video:

https://www.youtube.com/watch?v=_S2lYXf-uu0 (video ID: _S2lYXf-uu0)

But the same video returns a VideoUnavailable error when run on Google Colab. Both environments are running youtube-transcript-api 0.3.1.

Do you have any idea why this might be? I am happy for this to be closed if you aware of this as a separate issue unrelated to your codebase etc, or think it would be unrelated, or can't help, etc.

Thanks again
Ben

Inconsistent get_transcript results

When I repeatedly call "get_transcript" with the same video id, sometimes I get back the transcript, while other times I get back a "VideoUnavailable" error. As far as I can tell, this can happen with any youtube video. I am using version 0.3.1.

subtitles file saving

I think it would be great if we had a feature that saves the subtitle as a .srt, .vtt or another subtitles format, I'd also like to say that I think that I can implement this feature, I already did some work and I think I can get it to work, should I pull and proceed to do it? As I don't see anything about contributing in the documentation, so I thought I'd ask in a separate issue.

Error downloading the transcript

I'm getting an error saying that the video or the subtitles might be no longer available but this isn't true.

I'm simply passing a .txt file with the video ids I want to have the transcript.

Here's the error I'm getting,

C:\youtube_transcripts_mp3>python get_transcript.py id.txt
Traceback (most recent call last):
File "C:\youtube_transcripts_mp3\youtube_downloads\lib\site-packages\youtube_transcript_api_api.py", line 93, in get_transcript
return _TranscriptParser(_TranscriptFetcher(video_id, languages, proxies).fetch()).parse()
File "C:\youtube_transcripts_mp3\youtube_downloads\lib\site-packages\youtube_transcript_api_api.py", line 153, in parse
for xml_element in ElementTree.fromstring(self.plain_data)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\xml\etree\ElementTree.py", line 1314, in XML
parser.feed(text)
TypeError: a bytes-like object is required, not 'NoneType'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "get_transcript.py", line 15, in
Transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['pt'])
File "C:\youtube_transcripts_mp3\youtube_downloads\lib\site-packages\youtube_transcript_api_api.py", line 95, in get_transcript
raise YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id)
youtube_transcript_api._api.CouldNotRetrieveTranscript: Could not get the transcript for the video https://www.youtube.com/watch?v=MzRoOMF8XEQ
! This usually happens if one of the following things is the case:

  • subtitles have been disabled by the uploader
  • none of the language codes you provided are valid
  • none of the languages you provided are supported by the video
  • the video is no longer available.

Here are the video IDs,

MzRoOMF8XEQ
J3Yo1Tz1YXg

You can check yourself that both videos are still on youtube and they have auto generated subtitles available.

What should I do?

Inaccurate Caption Availability

The project has a bug where, depending on the available captions for a given video, results may not be accurate. Because Youtube seems to return the list of available captions in alphabetical order ( example: English(en) links will be ordered after Dutch(nl) but before German(de)). This means that in the case a language happens to be listed before the user's desired language and is of a different "kind" (ie asr), the project will return poor results.

Example: if "z3Mvd1VEFyw" is the desired video id and languages=['en'], the api will return the CouldNotRetrieveTranscript error despite there being English captions available (just asr).

I am interested in opening a PR to fix this issue and helping maintain the project in general. Let me know if this is ok.

The video is no longer available WITH ANY VIDEO

I had used youtube_transcript_api without problem a lot of time.
A couple days ago I upgraded my python to 3.8. the library still worked for few hours but after that started to crash and don't work with any id_video.

the error show "The video is no longer available" with any available video.

youtube-transcript-api 0.3.1

File "", line 1, in
transcript_list = YouTubeTranscriptApi.list_transcripts(v_id)

File "C:\Users\SnooPI\Anaconda3\lib\site-packages\youtube_transcript_api_api.py", line 70, in list_transcripts
return TranscriptListFetcher(http_client).fetch(video_id)

File "C:\Users\SnooPI\Anaconda3\lib\site-packages\youtube_transcript_api_transcripts.py", line 34, in fetch
self._extract_captions_json(self._fetch_html(video_id), video_id)

File "C:\Users\SnooPI\Anaconda3\lib\site-packages\youtube_transcript_api_transcripts.py", line 42, in _extract_captions_json
raise VideoUnavailable(video_id)

Subtitles including a u' or u" in the response

Hey,

First, I wanted to thank you for this library. It's awesome.

I'm seeing that the (at least the CLI version of the) library is returning a u' or u" before the text on many rows.
Example:

youtube_transcript_api 73nZwUENB9E --languages fr --exclude-generated

It returns:

[[{'duration': 1.36,
   'start': 10.12,
   'text': u'(Une femme fredonne\n"Au clair de la lune")'},
  {'duration': 1.6, 'start': 11.8, 'text': '-La, la, la, la, la, la...'},
  {'duration': 3.24,
   'start': 17.04,
   'text': '-Gontran avait toujours entendu\nde la musique par hasard.'},
  {'duration': 1.64, 'start': 20.6, 'text': 'Au parc,'},
  {'duration': 1.84, 'start': 24.88, 'text': 'dans la cuisine,'},
  {'duration': 1.36, 'start': 27.8, 'text': 'en voyage.'},
  {'duration': 2.2, 'start': 33.44, 'text': 'La plupart du temps,'},
  {'duration': 2.6,
   'start': 35.96,
   'text': u"il s'en fichait compl\xe8tement."},
  {'duration': 2.2, 'start': 39.16, 'text': u"Jusqu'au jour o\xf9..."},
  {'duration': 3.12,
   'start': 44.92,
   'text': u'Jamais la musique\nne lui avait fait \xe7a.'},
  {'duration': 2.64,
   'start': 51.4,
   'text': u"Gontran \xe9tait proche\nde l'extase quand..."},
  {'duration': 1.96,
   'start': 54.36,
   'text': u'-Le propri\xe9taire du bateau\ngonflable jaune'},
  {'duration': 2.84,
   'start': 56.64,
   'text': u'appel\xe9 "Mimi la truite"\nstationnant devant le magasin'},
  {'duration': 4.439,
   'start': 59.8,
   'text': u'est pri\xe9 de venir le r\xe9cup\xe9rer\nau plus vite. Merci.'},
  {'duration': 1.92,
   'start': 71.0,
   'text': u"-Plusieurs fois,\nil fut tout proche d'entendre"},
  {'duration': 3.0,
   'start': 73.24,
   'text': u'\xe0 nouveau son morceau pr\xe9f\xe9r\xe9.'},
  {'duration': 1.2, 'start': 81.84, 'text': u'Sans succ\xe8s.'},
  {'duration': 3.08,
   'start': 88.08,
   'text': u'Et puis un jour,\nil eut une r\xe9v\xe9lation.'},
  {'duration': 1.8, 'start': 92.72, 'text': 'Oui, il allait agir.'},
  {'duration': 0.68, 'start': 96.12, 'text': 'Il connaissait'},
  {'duration': 3.88,
   'start': 97.12,
   'text': u'le nom du groupe : M\xe9tal M\xe9tal.\nEt le nom du morceau :'},
  {'duration': 2.68,
   'start': 101.24,
   'text': u'"Gronk". Plus rien\nne pouvait l\'arr\xeater.'},
  {'duration': 1.56, 'start': 104.68, 'text': 'En quelques secondes,'},
  {'duration': 2.64,
   'start': 106.56,
   'text': u'il retrouva son morceau pr\xe9f\xe9r\xe9\nsur un site de streaming.'},
  {'duration': 3.24,
   'start': 109.52,
   'text': u"Le morceau \xe9tait l\xe0 ! Il pouvait\nenfin l'\xe9couter jusqu'au bout"},
  {'duration': 3.92,
   'start': 113.08,
   'text': u"autant de fois qu'il le voulait.\nEt quand il le voulait."},
  {'duration': 1.12, 'start': 120.68, 'text': 'Le lendemain,'},
  {'duration': 2.28,
   'start': 122.56,
   'text': u"Gontran acheta\nl'album du m\xeame groupe"},
  {'duration': 3.36,
   'start': 125.16,
   'text': 'chez le disquaire\net une place pour leur concert.'},
  {'duration': 2.641,
   'start': 128.919,
   'text': u"Maintenant qu'il pouvait retrouver\nfacilement sa musique pr\xe9f\xe9r\xe9e,"},
  {'duration': 3.24,
   'start': 131.88,
   'text': u'il allait d\xe9cider,\npour lui et ses oreilles.'}]]

Also, the text returned is not valid JSON (as it uses single quotes) but in both cases, well, I'm using some parsing to "fix" the text before storing it. Just wanted you to be aware of this.

Cheers and again, thanks for the library.
Mikel

Mismatch in Start Time and Duration

Hi,

I extracted the subtitles of a video (video id - Li9R5RI5kdQ). For some of the phrases Start time + duration is greater than the start time of the next phrases.

Is that a bug or I am missing something?

Thanks in advance!

video no longer found

the api keeps throwing exceptions that video is longer available although i checked with this video for example "SPuS9UJF1lo" and it was available and it had transcripts.

Failing to get transcription for videos that used to work

So as the title says, the module is failing when trying to get the transcript for programs that used to work completely fine before.

Any help would be appreciated.

Edit: I am running Python 3.7.3 on Mac OSX Mojave 10.14.6 but I am also running as a Cloud Function (worked fine until a few days ago).

Here is an example with one of such videos:

>>> transcript = YouTubeTranscriptApi.get_transcript(video_id)
Traceback (most recent call last):
  File "/anaconda3/envs/viziotagnlpenv1/lib/python3.7/site-packages/youtube_transcript_api/_api.py", line 93, in get_transcript
    return _TranscriptParser(_TranscriptFetcher(video_id, languages, proxies).fetch()).parse()
  File "/anaconda3/envs/viziotagnlpenv1/lib/python3.7/site-packages/youtube_transcript_api/_api.py", line 160, in parse
    for xml_element in ElementTree.fromstring(self.plain_data)
  File "/anaconda3/envs/viziotagnlpenv1/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 972

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/envs/viziotagnlpenv1/lib/python3.7/site-packages/youtube_transcript_api/_api.py", line 95, in get_transcript
    raise YouTubeTranscriptApi.CouldNotRetrieveTranscript(video_id)
youtube_transcript_api._api.CouldNotRetrieveTranscript: Could not get the transcript for the video https://www.youtube.com/watch?v=wFdSFVzWeKk! This usually happens if one of the following things is the case:
 - subtitles have been disabled by the uploader
 - none of the language codes you provided are valid
 - none of the languages you provided are supported by the video
 - the video is no longer available.

If none of these things is the case, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues```


cli tool is referred but is missing

Hello!
Docs refers usage of CLI tool, but the tool itself is not provided. For example after pip install youtube_transcript_api , dir contents is:

coolcold@LAZ-VN-L-W-2202:~$ tree ./.asdf/installs/python/3.8.1/lib/python3.8/site-packages/youtube_transcript_api
./.asdf/installs/python/3.8.1/lib/python3.8/site-packages/youtube_transcript_api
├── __init__.py
├── __main__.py
├── __pycache__
│   ├── __init__.cpython-38.pyc
│   ├── __main__.cpython-38.pyc
│   ├── _api.cpython-38.pyc
│   ├── _cli.cpython-38.pyc
│   ├── _errors.cpython-38.pyc
│   ├── _html_unescaping.cpython-38.pyc
│   ├── _settings.cpython-38.pyc
│   └── _transcripts.cpython-38.pyc
├── _api.py
├── _cli.py
├── _errors.py
├── _html_unescaping.py
├── _settings.py
├── _transcripts.py
└── test
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-38.pyc
    │   ├── test_api.cpython-38.pyc
    │   └── test_cli.cpython-38.pyc
    ├── assets
    │   ├── __init__.py
    │   └── __pycache__
    │       └── __init__.cpython-38.pyc
    ├── test_api.py
    └── test_cli.py

5 directories, 24 files

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.