mirror of
https://github.com/openai/whisper.git
synced 2025-03-30 14:28:27 +00:00
evaluated transcripts with base model
This commit is contained in:
parent
67e68af114
commit
36e49d920f
50
main.py
50
main.py
@ -1,5 +1,51 @@
|
||||
import whisper
|
||||
import os
|
||||
|
||||
model = whisper.load_model('tiny.en')
|
||||
TEST_DATA_BASE = "test_data/"
|
||||
FIVE_SEC_BASE = os.path.join(TEST_DATA_BASE, "5s/")
|
||||
THIRTY_SEC_BASE = os.path.join(TEST_DATA_BASE, "30s/")
|
||||
TRANSCRIPTS_BASE = "test_transcripts_before/"
|
||||
|
||||
print(model.transcribe('test_data/5/out001.wav')['text'])
|
||||
model = whisper.load_model("tiny.en")
|
||||
|
||||
def transcribe_baseline(file_name):
|
||||
return model.transcribe(file_name).get('text', '')
|
||||
|
||||
def get_all_files(base_path, count=1000):
|
||||
return [os.path.join(base_path, f"out{i:03d}.wav") for i in range(count)]
|
||||
|
||||
def write_to_file(file_name, text):
|
||||
os.makedirs(os.path.dirname(file_name), exist_ok=True)
|
||||
with open(file_name, "w") as f:
|
||||
f.write(text)
|
||||
|
||||
def calculate_wer(hypothesis, actual):
|
||||
hyp_words = hypothesis.strip().lower().split()
|
||||
act_words = actual.strip().lower().split()
|
||||
|
||||
dp = [[0] * (len(hyp_words) + 1) for _ in range(len(act_words) + 1)]
|
||||
|
||||
for i in range(len(act_words) + 1):
|
||||
dp[i][0] = i
|
||||
for j in range(len(hyp_words) + 1):
|
||||
dp[0][j] = j
|
||||
|
||||
for i in range(1, len(act_words) + 1):
|
||||
for j in range(1, len(hyp_words) + 1):
|
||||
if act_words[i - 1] == hyp_words[j - 1]:
|
||||
dp[i][j] = dp[i - 1][j - 1]
|
||||
else:
|
||||
dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + 1)
|
||||
|
||||
total_words = len(act_words)
|
||||
return dp[len(act_words)][len(hyp_words)] / total_words if total_words else float("inf") if hyp_words else 0.0
|
||||
|
||||
def process_files(files, output_base):
|
||||
for file_name in files:
|
||||
hypothesis = transcribe_baseline(file_name)
|
||||
sample_name = os.path.splitext(os.path.basename(file_name))[0]
|
||||
write_to_file(os.path.join(output_base, f"{sample_name}.txt"), hypothesis)
|
||||
|
||||
if __name__ == "__main__":
|
||||
# process_files(get_all_files(FIVE_SEC_BASE), os.path.join(TRANSCRIPTS_BASE, "5s"))
|
||||
process_files(get_all_files(THIRTY_SEC_BASE), os.path.join(TRANSCRIPTS_BASE, "30s"))
|
||||
|
1
test_transcripts_before/30s/out000.txt
Normal file
1
test_transcripts_before/30s/out000.txt
Normal file
@ -0,0 +1 @@
|
||||
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. So welcome to 6172. My name is Charles Lyerson, and I am
|
1
test_transcripts_before/30s/out001.txt
Normal file
1
test_transcripts_before/30s/out001.txt
Normal file
@ -0,0 +1 @@
|
||||
One of the two lecturers this term, the other is Professor Julian Shun. We're both in EECS and in C-Sale on the seventh floor of the Gates Building. If you don't know it, you are in performance engineering of software systems. So if this is the wrong, if you found yourself in the wrong place, now's the time to exit. I want to start today by
|
1
test_transcripts_before/30s/out002.txt
Normal file
1
test_transcripts_before/30s/out002.txt
Normal file
@ -0,0 +1 @@
|
||||
talking a little bit about why we do performance engineering. And then I'll do a little bit of administration, and then sort of dive into sort of a case study that'll give you a good sense of some of the things that we're gonna do during the term. I put the administration in the middle, because it's like if you don't, from me telling you about the course, you don't wanna do the course, then it's like why should you listen to the administration, right?
|
1
test_transcripts_before/30s/out003.txt
Normal file
1
test_transcripts_before/30s/out003.txt
Normal file
@ -0,0 +1 @@
|
||||
It's like, so let's just dive right in, okay? So the first thing to always understand whenever you're doing something is a perspective on what matters and what you're doing. So we're going to study the whole term we're going to do software performance engineering. And so this is kind of interesting because it turns out that performance is usually not at the top of what people are interested in when they're building.
|
1
test_transcripts_before/30s/out004.txt
Normal file
1
test_transcripts_before/30s/out004.txt
Normal file
@ -0,0 +1 @@
|
||||
software. What are some of the things that are more important than software? That's probably then performance. Yeah. Deadlines. Deadlines. Good. Cost. Correctness. Extensibility. Yeah, I'm going to go on and on. I think that you folks could probably make a pretty long list. I made a short list of all the kinds of things that are more important than performance.
|
1
test_transcripts_before/30s/out005.txt
Normal file
1
test_transcripts_before/30s/out005.txt
Normal file
@ -0,0 +1 @@
|
||||
So then, if programmers are so willing to sacrifice performance for these properties, why do we study performance? Okay, so this is kind of a bit of a paradox and a bit of a puzzle. Why do you study something that clearly isn't at the top of the list of what most people care about when they're developing software? I think the answer to that is that...
|
1
test_transcripts_before/30s/out006.txt
Normal file
1
test_transcripts_before/30s/out006.txt
Normal file
@ -0,0 +1 @@
|
||||
performance is the currency of computing. You use performance to buy these other properties. So you'll say something like, gee, I want to make it easy to program, and so therefore I'm willing to sacrifice some performance to make something easy to program. I'm willing to sacrifice some performance to make sure that my system is secure. And all those things come out of your performance budget. And clearly if performance.
|
1
test_transcripts_before/30s/out007.txt
Normal file
1
test_transcripts_before/30s/out007.txt
Normal file
@ -0,0 +1 @@
|
||||
to grades too far, your stuff becomes unusable. When I talk with people with programmers and I say, people are fond of saying, oh, performance, you do performance, performance doesn't matter, I never think about it. Then I talk with people who use computers and I ask, what's your main complaint about the computing systems you use? Answer too slow. Okay, so.
|
1
test_transcripts_before/30s/out008.txt
Normal file
1
test_transcripts_before/30s/out008.txt
Normal file
@ -0,0 +1 @@
|
||||
So it's interesting whether you're the producer or whatever. But the real answer is that performance is like currency. It's something you spend. I would rather have, if you look, would I rather have $100 or a gallon of water? Well, water is indispensable to life. There are circumstances certainly where I would prefer to have the water, okay, than $100. But in our modern society.
|
1
test_transcripts_before/30s/out009.txt
Normal file
1
test_transcripts_before/30s/out009.txt
Normal file
@ -0,0 +1 @@
|
||||
I can buy water for much less than $100. Okay, so even though water is essential to life and far more important than money, money is a currency. And so I prefer to have the money because I can just buy the things I need. And that's the same kind of analogy of performance. It has no intrinsic value, but it contributes to things. You can use it to buy things that you can.
|
1
test_transcripts_before/30s/out010.txt
Normal file
1
test_transcripts_before/30s/out010.txt
Normal file
@ -0,0 +1 @@
|
||||
care about, like usability or testability or what have you. Now, in the early days of computing, software performance engineering was common because machine resources were limited. If you look at these machines from 1964 to 1977, look at how many bytes they have on them. In 1964, there is a computer with 524 kilobytes.
|
1
test_transcripts_before/30s/out011.txt
Normal file
1
test_transcripts_before/30s/out011.txt
Normal file
@ -0,0 +1 @@
|
||||
Okay? That was a big machine. Back then. That's kilobytes. That's not megabytes. That's not gigabytes. That's kilobytes. Okay? And many programs would strain the machine resources. Okay? The clock rate for that machine was 33 kilohertz. What's a typical clock rate today? About what? Four gigahertz, three gigahertz, two gigahertz somewhere up there.
|
1
test_transcripts_before/30s/out012.txt
Normal file
1
test_transcripts_before/30s/out012.txt
Normal file
@ -0,0 +1 @@
|
||||
Yeah, somewhere in that range, okay? And here they were operating with kilohertz. So many programs would not fit without intense performance engineering. And one of the things also that there's a lot of, a lot of sayings that came out of that area. Donald Knuth, who's one of the Turing Award winner, absolutely fabulous computer scientists in all respects, wrote premature optimization as the root of all evil.
|
1
test_transcripts_before/30s/out013.txt
Normal file
1
test_transcripts_before/30s/out013.txt
Normal file
@ -0,0 +1 @@
|
||||
And I invite you, by the way, to look that quote up and because this is actually taken out of context. So trying to optimize stuff too early, he was worried about. Bill Wolf, who built the design, the bliss language, and worked on the PDP-11 and such, said, more computing sins are committed in the name of efficiency without necessarily achieving it than for any other single reason, including blind stupidity. And Michael Jackson.
|
1
test_transcripts_before/30s/out014.txt
Normal file
1
test_transcripts_before/30s/out014.txt
Normal file
@ -0,0 +1 @@
|
||||
said the first rule of program optimization, don't do it. Second rule of program optimization for experts only. Don't do it yet. So everybody warning away, because when you start trying to make things fast, your code becomes unreadable. Making code that is readable and fast. Now that's where the art is. And hopefully we'll learn a little bit about doing that. OK. And indeed, there was no real point in.
|
1
test_transcripts_before/30s/out015.txt
Normal file
1
test_transcripts_before/30s/out015.txt
Normal file
@ -0,0 +1 @@
|
||||
in working too hard on performance engineering for many years. If you look at technology scaling and you look at how many transistors are on various processor designs up until about 2004, we had Moore's Law in full throttle, with chip densities doubling every two years and really quite amazing.
|
1
test_transcripts_before/30s/out016.txt
Normal file
1
test_transcripts_before/30s/out016.txt
Normal file
@ -0,0 +1 @@
|
||||
And along with that, as they shrunk the dimensions of chips, because by miniaturization, the clock speed would go up correspondingly as well. And so if you found something was too slow, wait a couple of years, okay? Wait a couple of years, it'll be faster. And so, if you're gonna do something with software and make your software ugly, that really wasn't a real...
|
1
test_transcripts_before/30s/out017.txt
Normal file
1
test_transcripts_before/30s/out017.txt
Normal file
@ -0,0 +1 @@
|
||||
you know, wasn't a real good payoff compared to just simply waiting around. And in that era, there was something called Denard Scaling, where which allowed things to, as things shrunk, allowed the clock speeds to get larger, basically by reducing power. You could reach...
|
1
test_transcripts_before/30s/out018.txt
Normal file
1
test_transcripts_before/30s/out018.txt
Normal file
@ -0,0 +1 @@
|
||||
reduce power and still keep everything fast. And we'll talk about that in a minute. So if you look at what happened to, from 1977 to 2004, here are Apple computers with similar price tags. And you can see the clock rate really just skyrocketed. One megahertz, 400 megahertz, 1.8 gigahertz. And the data pass went from 8 bits to 30 to 6.
|
1
test_transcripts_before/30s/out019.txt
Normal file
1
test_transcripts_before/30s/out019.txt
Normal file
@ -0,0 +1 @@
|
||||
64, the memory correspondingly grow, cost approximately the same. And that was, that's the legacy from Moore's Law and the tremendous advances in semiconductor technology. And so until 2004, Moore's Law and the scaling of clock frequency, so-called denard scaling, was essentially a printing press for the currency of performance. Okay, you didn't have to do anything. You just made the hardware go faster. Very, very, uh, and.
|
1
test_transcripts_before/30s/out020.txt
Normal file
1
test_transcripts_before/30s/out020.txt
Normal file
@ -0,0 +1 @@
|
||||
And all that came to an end, well, some of it came to an end in 2004, when clock speeds plateaued. OK, so if you look at this around 2005, you can see all the speeds we hit, you know, two to four gigahertz, and we have not been able to make chips go faster than that in any practical way since then. But the densities have kept going great. Now the reason that the clock speed flat.
|
1
test_transcripts_before/30s/out021.txt
Normal file
1
test_transcripts_before/30s/out021.txt
Normal file
@ -0,0 +1 @@
|
||||
was because of power density. And this is a slide from Intel from that era, looking at the growth of power density. And what they were projecting was that the junction temperatures of the transistors on the chip, if they just keep scaling the way they had been scaling, would start to approach, first of all, the temperature of a nuclear reactor, then the temperature of a rocket nozzle, and then the sun's surface.
|
1
test_transcripts_before/30s/out022.txt
Normal file
1
test_transcripts_before/30s/out022.txt
Normal file
@ -0,0 +1 @@
|
||||
Okay? So we're not going to build little technology that cools that very well. And even if you could solve it for a little bit, the writing was in the wall. We cannot scale clock frequencies anymore. The reason for that is that originally, clock frequency was scaled assuming that the, most of the power was dynamic power, which was going when you switched the circuit. And what happened as we kept reducing that and reducing that is something that used to be in the noise, namely the leakage currents. Okay? Start.
|
1
test_transcripts_before/30s/out023.txt
Normal file
1
test_transcripts_before/30s/out023.txt
Normal file
@ -0,0 +1 @@
|
||||
to become significant. To the point where now today the dynamic power is far less of a concern than the static power from just the circuit sitting there leaking. And when you miniaturize, you can't stop that effect from happening. So what do the vendors do in 2004 and 2005 and since? They said, oh gosh, we've got all these transistors.
|
1
test_transcripts_before/30s/out024.txt
Normal file
1
test_transcripts_before/30s/out024.txt
Normal file
@ -0,0 +1 @@
|
||||
to use, but we can't use the transistors to make stuff run faster. So what they did is they introduced parallelism in the form of multi-core processors. They put more than one processing core in a chip. And to scale performance, they would have multiple cores. And each generation of Moore's law now was potentially doubling the number of cores. So if you look at what happened,
|
1
test_transcripts_before/30s/out025.txt
Normal file
1
test_transcripts_before/30s/out025.txt
Normal file
@ -0,0 +1 @@
|
||||
For a processor call, as you see, that around 2005, 2004, 2005, we started to get multiple processing cores per chip. To the extent that today, it's basically impossible to find a single core chip for a laptop or a workstation or whatever. Everything is multi-core. You can't buy just one. You have to buy a parallel processor. And so
|
1
test_transcripts_before/30s/out026.txt
Normal file
1
test_transcripts_before/30s/out026.txt
Normal file
@ -0,0 +1 @@
|
||||
The impact of that was that performance was no longer free. You couldn't just speed up the hardware. Now if you wanted to use that potential, you had to do parallel programming. And that's not something that anybody in the industry really had done. So today, there are a lot of other things that happened in that intervening time. We got vector units as common parts of our machines. We got GPUs. We got steeper cache hierarchies. We have a computer.
|
1
test_transcripts_before/30s/out027.txt
Normal file
1
test_transcripts_before/30s/out027.txt
Normal file
@ -0,0 +1 @@
|
||||
your logic on some machines and so forth. And now it's up to the software to adapt to it. And so although we don't want to have to deal with performance, today you have to deal with performance and in your lifetimes you will have to deal with performance, okay, in software if you're going to have effective software. Okay. You can see what happened also. So this is a study that we did looking at software bugs in a variety of...
|
1
test_transcripts_before/30s/out028.txt
Normal file
1
test_transcripts_before/30s/out028.txt
Normal file
@ -0,0 +1 @@
|
||||
of open source projects where they're mentioning the word performance. And you can see that in 2004 the numbers start going up. Some of them it's not as convincing for some things as others. But generally there's a trend of after 2004 people started worrying more about performance. If you look at software developer jobs, as of early mid 2000s.
|
1
test_transcripts_before/30s/out029.txt
Normal file
1
test_transcripts_before/30s/out029.txt
Normal file
@ -0,0 +1 @@
|
||||
The 2000 OOs, I guess. Okay, you see once again, the mention of performance and jobs is going up. And anecdotally, I can tell you, I had one student who came to me after the spring after he'd taken 6172. And he said, I went and I applied for five jobs. And every job asked me, at every job interview, they asked me
|
1
test_transcripts_before/30s/out030.txt
Normal file
1
test_transcripts_before/30s/out030.txt
Normal file
@ -0,0 +1 @@
|
||||
I couldn't have answered if I hadn't taken 6172. And I got five offers. And when I compared those offers, they tended to be 20% to 30% larger than people are just web monkeys. So anyway, that's not to say that you should necessarily take this class. But I just want to point out that what we're going to learn is going to be interesting from a practical point of view, i.e. your future.
|
1
test_transcripts_before/30s/out031.txt
Normal file
1
test_transcripts_before/30s/out031.txt
Normal file
@ -0,0 +1 @@
|
||||
as well as theoretical points of view and technical points of view. So modern processors are really complicated, and the big question is how do we write software to use that modern hardware efficiently? I want to give you an example of performance engineering of a very well-studied problem, namely matrix multiplication, who has never seen this problem.
|
1
test_transcripts_before/30s/out032.txt
Normal file
1
test_transcripts_before/30s/out032.txt
Normal file
@ -0,0 +1 @@
|
||||
Okay, so we got some jokers in the class I can say. Okay, so this is, you know, it takes N cubed operations because you're basically computing N squared dot products. Okay, so essentially if you add up the total number of operations it's about 2 N cubed because there is essentially a multiply and an add for every pair of terms that need.
|
1
test_transcripts_before/30s/out033.txt
Normal file
1
test_transcripts_before/30s/out033.txt
Normal file
@ -0,0 +1 @@
|
||||
to be accumulated. So it's basically 2N cube. We're going to look at it assuming for simplicity that our N is an exact power of 2. Now, the machine that we're going to look at is going to be one of the ones that you'll have access to an AWS. It's a compute-optimized machine, which has a Haswell microarchitecture.
|
1
test_transcripts_before/30s/out034.txt
Normal file
1
test_transcripts_before/30s/out034.txt
Normal file
@ -0,0 +1 @@
|
||||
running at 2.9 gigahertz. There are two processor chips for each of these machines. And nine processing cores per chip, so a total of 18 cores. So that's the amount of parallel processing. It does two-way hyperthreading, which we're actually going to not deal a lot with. Hyperthreading gives you a little bit more performance, but it also makes it really hard to measure.
|
1
test_transcripts_before/30s/out035.txt
Normal file
1
test_transcripts_before/30s/out035.txt
Normal file
@ -0,0 +1 @@
|
||||
So generally we will turn off hyperthreading, but the performances that you get tends to be correlated with what you get when you are hyperthread. For floating point unit, it is capable of doing eight double precision operations, that 64-bit floating point operations, including a fused multiply add per core per cycle. So that is a vector unit. So basically each of these 18 cores can do eight-
|
1
test_transcripts_before/30s/out036.txt
Normal file
1
test_transcripts_before/30s/out036.txt
Normal file
@ -0,0 +1 @@
|
||||
double precision operations and so including a fused multiply add which is actually two operations. Okay, the way that they count these things. Okay, it has a cache line size of 64 bytes. The i cache is 32 kilobytes which is 8 ways set associative. We'll talk about some of these things. If you don't know all the terms it's okay, we're going to cover most of these terms later on. It's got a decash of the same size. It's got an L2 cache of...
|
1
test_transcripts_before/30s/out037.txt
Normal file
1
test_transcripts_before/30s/out037.txt
Normal file
@ -0,0 +1 @@
|
||||
256 kilobytes and it's got an L3 cache or what's sometimes called an LLC last level cache of 25 megabytes and then it's got 60 gigabytes of DRAM. So this is a honking big machine. This is like you can get things to sing on this. If you look at the peak performance, it's the clock speed times two processor chips, times nine.
|
1
test_transcripts_before/30s/out038.txt
Normal file
1
test_transcripts_before/30s/out038.txt
Normal file
@ -0,0 +1 @@
|
||||
processing cores per chip, each capable of, if you can use both the multiply and the add 16 floating point operations, and that goes out to just short of a tariff lops, 836 giga flops. So that's a lot of power. That's a lot of power. These are fun machines, actually. Especially when we get into things like the
|
1
test_transcripts_before/30s/out039.txt
Normal file
1
test_transcripts_before/30s/out039.txt
Normal file
@ -0,0 +1 @@
|
||||
the game playing AI and stuff that we do for the fourth project. You'll see they're really fun to have a lot of compute. Okay. Now, the base, here's the basic code. This is the full code for Python for doing matrix multiplication. Now, generally in Python, you wouldn't use this code because you just call a library subroutine that does matrix multiplication. But sometimes you have a problem, I'm going to illustrate with matrix multiplication, that sometimes you have a problem that is...
|
1
test_transcripts_before/30s/out040.txt
Normal file
1
test_transcripts_before/30s/out040.txt
Normal file
@ -0,0 +1 @@
|
||||
for what you have to write the code. And I want to give you an idea of what kind of performance you get out of Python. In addition, somebody has to write if there is a library routine, somebody had to write it. And that person was a performance engineer, because they wrote it to be as fast as possible. And so this will give you an idea of what you can do to make code run fast. So when you run this code, so you can see that the start time before the tripling nested loop.
|
1
test_transcripts_before/30s/out041.txt
Normal file
1
test_transcripts_before/30s/out041.txt
Normal file
@ -0,0 +1 @@
|
||||
right here, before the tripling nested loop, we take a time measurement and then we take another time measurement at the end and then we print the difference. And then that's just this classic tripling nested loop for matrix multiplication. And so when you run this, how long is this run for, you think? Any guesses? Let's see. Now about, let's do the.
|
1
test_transcripts_before/30s/out042.txt
Normal file
1
test_transcripts_before/30s/out042.txt
Normal file
@ -0,0 +1 @@
|
||||
that was runs for six microseconds, who thinks six microseconds. How about six milliseconds? How about six milliseconds? How about six seconds? How about six minutes? Okay, how about six hours? How about six days? Okay. This was really disappointing person
|
1
test_transcripts_before/30s/out043.txt
Normal file
1
test_transcripts_before/30s/out043.txt
Normal file
@ -0,0 +1 @@
|
||||
know what size it is, is 4,096 by 4,096 as it shows in the code. OK? So, and those of you didn't vote, can I wake up? Let's get active. This is active learning. Put yourself out there. OK, it doesn't matter whether you're right or wrong. There'll be a bunch of people who got the right answer, and there's no idea why. OK? So it turns out it takes about 21,000 seconds, which is about six hours. OK? Amazing.
|
1
test_transcripts_before/30s/out044.txt
Normal file
1
test_transcripts_before/30s/out044.txt
Normal file
@ -0,0 +1 @@
|
||||
Is this fast? Yeah, right, duh, right? Is this fast? No, you know, how do we tell whether this is fast or not? OK? What should we expect from our machine? So let's do a back of the envelope calculation of how many operations there are and how fast we ought to be able to do it. We just went through and said we're at all the parameters.
|
1
test_transcripts_before/30s/out045.txt
Normal file
1
test_transcripts_before/30s/out045.txt
Normal file
@ -0,0 +1 @@
|
||||
the machine. So there are two N cubed operations that need to be performed. We're not doing stress and saligar or anything like that. We're just doing straight tripling nested loop. So that's two to the 37 floating point operations. The running time is 21,000 seconds. So that says that we're getting about 6.25 megaflops out of our machine when we run that code.
|
1
test_transcripts_before/30s/out046.txt
Normal file
1
test_transcripts_before/30s/out046.txt
Normal file
@ -0,0 +1 @@
|
||||
Okay? Just by dividing it out. How many floating point operations per second do we get? Let me take the number of operations divided by the time. Okay? The peak, as you recall, was about 836 giga-flops. Okay? And we're getting 6.25 mega-flops. Okay? So we're getting about 0.0075% of peak.
|
1
test_transcripts_before/30s/out047.txt
Normal file
1
test_transcripts_before/30s/out047.txt
Normal file
@ -0,0 +1 @@
|
||||
Okay. This is not fast. Okay. This is not fast. So let's do something really simple. Let's code it in Java rather than Python. Okay. So we take just that loop. The code is almost the same. Okay. It's the tripling nested loop. We run it in Java. Okay. And the running time now it turns.
|
1
test_transcripts_before/30s/out048.txt
Normal file
1
test_transcripts_before/30s/out048.txt
Normal file
@ -0,0 +1 @@
|
||||
out is about just under 3,000 seconds, which is about 46 minutes. The same code, Python Java. We got almost a nine times speed up over just simply coding it in a different language. Well, let's try C. That's the language we're going to be using here. What happens when you code it in C?
|
1
test_transcripts_before/30s/out049.txt
Normal file
1
test_transcripts_before/30s/out049.txt
Normal file
@ -0,0 +1 @@
|
||||
It's exactly the same thing. OK. We're going to use the Clang LLVM 5.0 compiler. I believe we're using 6.0. This term is that right? Yeah. OK. I should have rerun these numbers for 6, but I didn't. So now it's basically 1,100 seconds, which is about 19 minutes. So we got then about its twice as fast as Java and about 18 times faster than Python. So here's where we stand so far.
|
1
test_transcripts_before/30s/out050.txt
Normal file
1
test_transcripts_before/30s/out050.txt
Normal file
@ -0,0 +1 @@
|
||||
Okay, we have the running time of these various things. Okay, and the relative speed up is how much faster it is than the previous row. And the absolute speed up is how it is compared to the first row. And now we're managing to get now 0.014% of peak. So we're still slow, but before we go and try to optimize it further, startingess for
|
1
test_transcripts_before/30s/out051.txt
Normal file
1
test_transcripts_before/30s/out051.txt
Normal file
@ -0,0 +1 @@
|
||||
Like, why is Python so slow and see so fast? Does anybody know? The two platforms in coverage are so it has two services in C program and that basically causes Python 5.3 structing, which takes up the source of the product. OK, that's kind of on the right track. Anybody else have any eyes? Articulate a little bit. Why Python is so slow? Yeah. You write like multiply and add those. Not the only instructions.
|
1
test_transcripts_before/30s/out052.txt
Normal file
1
test_transcripts_before/30s/out052.txt
Normal file
@ -0,0 +1 @@
|
||||
It's doing lots of code, you're like, doing Python objects like integers and the problem block. Yeah, yeah. OK, good. So the big reason why Python is slow and CSO fast is that Python is interpreted. And CS compile directly to machine code. And Java is somewhere in the middle, because Java is compiled to byte code, which is then interpreted, and then just in time compiled into machine.
|
1
test_transcripts_before/30s/out053.txt
Normal file
1
test_transcripts_before/30s/out053.txt
Normal file
@ -0,0 +1 @@
|
||||
Let me talk a little bit about these things. So interpreters, such as in Python or versatile, but slow. It's one of these things where they said we're going to take some of our performance and use it to make a more flexible, easier to program environment. The interpreter basically reads, interprets, and performs each program statement and then updates the machine state. So it's not just, it's actually going through an...
|
1
test_transcripts_before/30s/out054.txt
Normal file
1
test_transcripts_before/30s/out054.txt
Normal file
@ -0,0 +1 @@
|
||||
each time reading your code, figuring out what it does, and then implementing it. So there's like all this overhead compared to just doing its operations. So interpreters can easily support high level programming features and they can do things like dynamic code alteration and so forth at the cost of performance. So typically the cycle for an interpreter is you read the next statement, you interpret the statement, you then perform the statement, and then you up.
|
1
test_transcripts_before/30s/out055.txt
Normal file
1
test_transcripts_before/30s/out055.txt
Normal file
@ -0,0 +1 @@
|
||||
update the state of the machine, and then you would fetch the next instruction. And you're going through that each time, and that's done in software. Okay? When you have things compiled to machine code, it goes through a similar thing, but it's highly optimized just for the things that machines are done. Okay? And so when you compile, you're able to take advantage of the hardware interpreter of machine instructions, and that's much, much lower overhead than that.
|
1
test_transcripts_before/30s/out056.txt
Normal file
1
test_transcripts_before/30s/out056.txt
Normal file
@ -0,0 +1 @@
|
||||
big software overhead you get with Python. Now JIT is somewhere in the middle, what's used in Java. JIT compilers can recover some of the performance. In fact, it did a pretty good job in this case. The idea is when the code is first interpreted, it's executed, it's interpreted. And then the runtime sees system keeps track of how often the various pieces of code are executed. And whatever it finds that there's some piece of code that it's executing frequently, it then calls the compiler to compile that piece of code.
|
1
test_transcripts_before/30s/out057.txt
Normal file
1
test_transcripts_before/30s/out057.txt
Normal file
@ -0,0 +1 @@
|
||||
code. And then subsequent to that, it runs the compiled code. So try to get the big advantage of the performance by only compiling the things that are necessary for which it's actually going to pay off to invoke the compiler to do. So anyway, so that's the big difference with those kinds of things. One of the reasons we don't use Python in this class is
|
1
test_transcripts_before/30s/out058.txt
Normal file
1
test_transcripts_before/30s/out058.txt
Normal file
@ -0,0 +1 @@
|
||||
is because the performance model is hard to figure out. See, it's much closer to the metal, much closer to the silicon. And so it's much easier to figure out what's going on in that context. OK? But we will have a guest lecture that we're going to talk about performance in managed languages like Python. So it's not that we're going to ignore the topic, but we will.
|
1
test_transcripts_before/30s/out059.txt
Normal file
1
test_transcripts_before/30s/out059.txt
Normal file
@ -0,0 +1 @@
|
||||
learn how to do performance engineering a place where it's easier to do it. Okay. Now one of the things that good compiler will do is, you know, once you get to, let's say we have the C version, which is where we're going to move from this point because that's the fastest we got so far, is it turns out that you can change the order of loops in this program without affecting the correctness. Okay. So here we went, you know, for i for j for t-
|
1
test_transcripts_before/30s/out060.txt
Normal file
1
test_transcripts_before/30s/out060.txt
Normal file
@ -0,0 +1 @@
|
||||
OK, do the update. We could otherwise do. We could do for i for k for j, do the update. And it computes exactly the same thing. Or we could do for k for j for i, do the updates. So we can change the order without affecting the correctness. And so do you think the order of loops matters for performance?
|
1
test_transcripts_before/30s/out061.txt
Normal file
1
test_transcripts_before/30s/out061.txt
Normal file
@ -0,0 +1 @@
|
||||
And I believe this is like this leading question. Yeah, question. The 84 or the top? The localities. Yeah, so, and you're exactly right. Cash locality is what it is. So when we do that, we get the loop order affects the running time by a factor of 18. Whoa, just by switching the order. OK, what's going on there? OK, what's going on? So we're going to talk about this in more depth. I'm going to say.
|
1
test_transcripts_before/30s/out062.txt
Normal file
1
test_transcripts_before/30s/out062.txt
Normal file
@ -0,0 +1 @@
|
||||
It's going to fly through this because this is just showing you the kinds of considerations that you do. So, in hardware, there are each processor region writes main memory in contiguous blocks called cache lines. Previously, access cache lines are stored in a small memory called cache that sits near the processor. When it access, when the processor access is something, if it's in the cache, you get a hit, that's very cheap, okay, and fast. If you miss, you have to go out to either a D.
|
1
test_transcripts_before/30s/out063.txt
Normal file
1
test_transcripts_before/30s/out063.txt
Normal file
@ -0,0 +1 @@
|
||||
a deeper level cache or all the way out to main memory, that is much, much slower. And we'll talk about that kind of thing. So what happens in this matrix problem is the matrices are laid out in memory and row major order. That means you take, you know, you have a two-dimensional matrix, it's laid out in the linear order of the addresses of memory by essentially taking row one and then after row one two and after that stick row three and so forth and unfolding.
|
1
test_transcripts_before/30s/out064.txt
Normal file
1
test_transcripts_before/30s/out064.txt
Normal file
@ -0,0 +1 @@
|
||||
There's another order that things could have been laid out. In fact, they are in Fortran, which is called column major order. So it turns out C in Fortran operate in different orders. And it turns out it affects performance, which way it does it. So let's just take a look at the access pattern for order ijk. So what we're doing is once we figure out what i and what j is, we're going to go through and cycle through k. And as we cycle through k,
|
1
test_transcripts_before/30s/out065.txt
Normal file
1
test_transcripts_before/30s/out065.txt
Normal file
@ -0,0 +1 @@
|
||||
OK, Cij stays the same for everything. We get for that excellent spatial locality, because we're just accessing the same location. Every single time it's going to be in cache, it's always going to be there. It's going to be fast to access C. For A, what happens is we go through an linear order and we get good spatial locality, but for B, it's going through columns. And those points are distributed far away in memory. So the processor's going to be bringing in 64.
|
1
test_transcripts_before/30s/out066.txt
Normal file
1
test_transcripts_before/30s/out066.txt
Normal file
@ -0,0 +1 @@
|
||||
to operate on a particular datum. And then it's ignoring seven of the eight floating point words on that cache line and going to the next one. So it's wasting an awful lot. So this one has good spatial locality and that it's all adjacent. And you would use the cache lines effectively. This one you're going 4,096 elements apart. It's got poor spatial locality. And that's why.
|
1
test_transcripts_before/30s/out067.txt
Normal file
1
test_transcripts_before/30s/out067.txt
Normal file
@ -0,0 +1 @@
|
||||
And that's for this one. So then if we look at the different other ones, this one, the order IKJ, it turns out you get good spatial locality for both C and B and excellent for A. And if you look at even another one, you don't get nearly as good as the other one. So there's a whole range of things. This one you're doing optimally, badly, and both. And so you can just measure the different ones. And it turns out that.
|
1
test_transcripts_before/30s/out068.txt
Normal file
1
test_transcripts_before/30s/out068.txt
Normal file
@ -0,0 +1 @@
|
||||
that you can use a tool to figure this out. And the tool that we'll be using is called cache grind. And it's one of the valgrind suites of caches. And what it'll do is it'll tell you what the miss rates are for the various pieces of code. And you'll learn how to use that tool and figure out, oh, look at that. We have a high miss rate for some and not for others. So that may be why my code is running slowly. So when you pick the best one of the.
|
1
test_transcripts_before/30s/out069.txt
Normal file
1
test_transcripts_before/30s/out069.txt
Normal file
@ -0,0 +1 @@
|
||||
Okay, we then got a relative speed up from about six and a half. So what other simple changes can we try? There's actually a collection of things that we could do that don't even have us touching the code. What else could we do for people who have played with compilers and such? Hint hint. Yeah. Yeah. Yeah, change the compiler flags, okay? So, click.
|
1
test_transcripts_before/30s/out070.txt
Normal file
1
test_transcripts_before/30s/out070.txt
Normal file
@ -0,0 +1 @@
|
||||
which is the compiler we'll be using, provides a collection of optimization switches, and you can specify, switch to the compiler to ask it to optimize. So you do minus O and then a number, and zero, if you look at the documentation, it says do not optimize. One says optimize. Two says optimize even more. Three says optimize yet more, okay? In this case, it turns out that even though it optimized more in O3,
|
1
test_transcripts_before/30s/out071.txt
Normal file
1
test_transcripts_before/30s/out071.txt
Normal file
@ -0,0 +1 @@
|
||||
It turns out O2 was a better setting. Okay? This is one of these cases. It doesn't happen all the time. Usually O3 does better than O2, but in this case, O2 actually optimized better than O3, because the optimizations are to some extent heuristic. Okay? And there are also other kinds of optimization. You can have it do a profile guided optimization where you look at what the performance was and feed that back into the code and then the
|
1
test_transcripts_before/30s/out072.txt
Normal file
1
test_transcripts_before/30s/out072.txt
Normal file
@ -0,0 +1 @@
|
||||
compiler can be smarter about how it optimizes. And there are a variety of other things. So with this simple technology, we now choosing a good optimization flag in this case O2, we got for free basically a factor of 3.25 without having to do much work at all. And now we're actually starting to approach 1% of peak performance. We got point.
|
1
test_transcripts_before/30s/out073.txt
Normal file
1
test_transcripts_before/30s/out073.txt
Normal file
@ -0,0 +1 @@
|
||||
3% of peak performance. OK? So what's causing the low performance? Why aren't we getting most of the performance out of this machine? Why do you think? Yeah. You know, we're not using all the cars so far. We're using just one core and how many cores we have? 18. 18 cores. 18 cores. Just sitting there, 17 sitting idle while we are trying to opt.
|
1
test_transcripts_before/30s/out074.txt
Normal file
1
test_transcripts_before/30s/out074.txt
Normal file
@ -0,0 +1 @@
|
||||
optimize one. So, multi-core. So, we have nine cores per chip, and there are two of these chips in our test machine. So, we're running on just one of them. So, let's use them all. To do that, we're going to use the Silk infrastructure. And in particular, we can use what's called a parallel loop, which in Silk you call Silk 4. And so, you just relay that outer
|
1
test_transcripts_before/30s/out075.txt
Normal file
1
test_transcripts_before/30s/out075.txt
Normal file
@ -0,0 +1 @@
|
||||
loop, for example, in this case, you say silk for it, says do all those iterations in parallel. Compiling runtime system are free to schedule them, and so forth. And we could also do it for the inner loop. And it turns out you can't also do it for the middle loop, if you think about it. So I'll let you do that as a little bit of a homework problem. Why can't I just do a silk?
|
1
test_transcripts_before/30s/out076.txt
Normal file
1
test_transcripts_before/30s/out076.txt
Normal file
@ -0,0 +1 @@
|
||||
for of the inner loop. So the question is, which parallel version works best? So we can do parallel the i loop. We can parallel the j loop, and we can do i and j together. You can't do k just with a parallel loop and expect to get the right thing. So, and that's this thing. So if you look, wow, what a spread of running times, right? So if I parallelize the just the i loop, it's 3.18 seconds.
|
1
test_transcripts_before/30s/out077.txt
Normal file
1
test_transcripts_before/30s/out077.txt
Normal file
@ -0,0 +1 @@
|
||||
And if I paralyze the J loop, it actually slows down, I think, right? And then if I do both INJ, it's still bad. I just want to do the out loop there. This has to do, it turns out with scheduling overhead. And we'll learn about scheduling overhead and how you predict that and such. So the rule of thumb here is paralyze outer loops rather than inter loops. And so when we do parallel loops, we get an almost 18x speed up on 18 cores.
|
1
test_transcripts_before/30s/out078.txt
Normal file
1
test_transcripts_before/30s/out078.txt
Normal file
@ -0,0 +1 @@
|
||||
So let me assure you not all code is that easy to paralyze. But this one happens to be. So now we're up to what? Just over 5% of peak. So where are we losing time here? Why are we getting just 5%. Yeah?
|
1
test_transcripts_before/30s/out079.txt
Normal file
1
test_transcripts_before/30s/out079.txt
Normal file
@ -0,0 +1 @@
|
||||
vectorize the e. Yep, good. So that's one. And there's one other that we're not using very effectively. OK, that's one. And those are the two optimizations we're going to do to get a really good code here. So what's the other one? Yeah? Just to go to the most line add from the operation. That's actually related to the same question. But there's another completely different source of, um,
|
1
test_transcripts_before/30s/out080.txt
Normal file
1
test_transcripts_before/30s/out080.txt
Normal file
@ -0,0 +1 @@
|
||||
of opportunity here. Yeah. We could also do a lot better on our handling of cache misses. Yeah, OK. We can actually manage the cache misses better. So let's go back to hardware caches and let's restructure the computation to reuse data in the cache as much as possible, because cache misses are slow and hits are fast. And try to make the most of the cache by reusing the data that's already there. So let's just take a look. Suppose that we're.
|
1
test_transcripts_before/30s/out081.txt
Normal file
1
test_transcripts_before/30s/out081.txt
Normal file
@ -0,0 +1 @@
|
||||
We're going to just compute one row of C. So we go through one row of C. That's going to take us since a 4,096 long vector there. That's going to basically be 4,096 rights that we're going to do. And we're going to get some spatial locality there, which is good. But we're basically doing the processors doing 4,096 rights. Now to compute that row, I need to access
|
1
test_transcripts_before/30s/out082.txt
Normal file
1
test_transcripts_before/30s/out082.txt
Normal file
@ -0,0 +1 @@
|
||||
4,096 reads from A. And I need all of B, okay? Because I go each column of B, okay, as I'm going through, to fully compute C. Do people see that? Okay? So I need to just compute one row of C. I'm going to compute, I need to access one row of A.
|
1
test_transcripts_before/30s/out083.txt
Normal file
1
test_transcripts_before/30s/out083.txt
Normal file
@ -0,0 +1 @@
|
||||
and all of B. Because the first element of C needs the whole first column of B. The second element of C needs the whole second column of B. Once again, don't worry if you don't fully understand this, because right now I'm just ripping through this at high speed. We're going to go into this in much more depth in the class, and there'll be plenty of time to master this stuff. But the main thing to understand is, you're going through all of B. Then I want to compute another row of C. I'm going to do the same thing. I'm going to go through one row.
|
1
test_transcripts_before/30s/out084.txt
Normal file
1
test_transcripts_before/30s/out084.txt
Normal file
@ -0,0 +1 @@
|
||||
of A and all of B again, so that when I'm done, we do about 16 million, 17 million memory accesses total. Okay, that's a lot of memory accesses. So what if instead of doing that, I do things in blocks, okay? So what if I want to compute a 64 by 64 block of C rather than a row of C? So let's take a look at what happens there. So remember, by the way, this number, 16.
|
1
test_transcripts_before/30s/out085.txt
Normal file
1
test_transcripts_before/30s/out085.txt
Normal file
@ -0,0 +1 @@
|
||||
17 million because we're going to compare with it. So what about the computer block? So if I look at a block, that is going to take me 64 by 64. Also takes 4,096 writes to see. Same number. But now I have to do about 200,000 reads from A, because I need to access all those rows. And then for B, I need to access 64 columns of B. And that's another 2,000.
|
1
test_transcripts_before/30s/out086.txt
Normal file
1
test_transcripts_before/30s/out086.txt
Normal file
@ -0,0 +1 @@
|
||||
26,000, 262,000 reads from B, which ends up being half a million memory accesses total. So I end up doing way fewer accesses if those blocks will fit in my cache. So I do much less to compute the same size footprint if I compute a block rather than computing a row. A much more efficient.
|
1
test_transcripts_before/30s/out087.txt
Normal file
1
test_transcripts_before/30s/out087.txt
Normal file
@ -0,0 +1 @@
|
||||
And that's a scheme called tiling. And so if you do tiled matrix multiplication, what you do is you bust your matrices into, let's say, 64 by 64 sub-matrices. And then you do two levels of matrix multiply. You do an outer level of multiplying of the blocks using the same algorithm. And then when you hit the inner to do a 64 by 64 matrix multiply, I then do another
|
1
test_transcripts_before/30s/out088.txt
Normal file
1
test_transcripts_before/30s/out088.txt
Normal file
@ -0,0 +1 @@
|
||||
three nested loops. You end up with six nested loops. OK? And so you're basically, you know, busting it like this. And there's a tuning parameter, of course, which is, you know, how big do I make my tile size? You know, if it's s by s. What should I do at the least there? Should it be 64? Should it be 128? Should it be, what number should I use there? How do we find the right value of, um, how do we find the right value of s, this tuning parameter? OK?
|
1
test_transcripts_before/30s/out089.txt
Normal file
1
test_transcripts_before/30s/out089.txt
Normal file
@ -0,0 +1 @@
|
||||
ideas are how we might find it. Yeah. You could do that. You might get a number. But who knows what else is going on in the cash while you're doing this? Yeah, test a bunch of them. Experiment. Try them. See which one gives you good numbers. And when you do that, it turns out that 32 gives you the best performance. For this particular problem.
|
1
test_transcripts_before/30s/out090.txt
Normal file
1
test_transcripts_before/30s/out090.txt
Normal file
@ -0,0 +1 @@
|
||||
So you can block it and then you can get faster and when you do that you now end up with that gave us a Speed of about 1.7 Okay, so we're now up to what we're almost 10% of peak okay and The other thing is that if you use cash grinder or a similar tool you can figure out how many cash references there are and so forth And you can see that the in fact it
|
1
test_transcripts_before/30s/out091.txt
Normal file
1
test_transcripts_before/30s/out091.txt
Normal file
@ -0,0 +1 @@
|
||||
dropped quite considerably when you do blocked the the tiling versus just the straight parallel loops. Okay, so once again you can use tools to help you figure this out and to understand the cause of what's going on. Well, it turns out that our chips don't have just one cache. They've got three levels of caches, okay? There's L1 cache, okay? And there's data and instructions. So we're thinking about data here.
|
1
test_transcripts_before/30s/out092.txt
Normal file
1
test_transcripts_before/30s/out092.txt
Normal file
@ -0,0 +1 @@
|
||||
of the data for the matrix. And it's got an L2 cache, which is also private to the processor, and then a shared L3 cache. And then you go out to the DRAM. You also can go to your neighboring processors and such. And there are different size. And you can see they grow in size 32 to 252 kilobytes, 256 kilobytes to 25 megabytes to main memory, which is 60 gigabytes. So what you can do is, if you want.
|
1
test_transcripts_before/30s/out093.txt
Normal file
1
test_transcripts_before/30s/out093.txt
Normal file
@ -0,0 +1 @@
|
||||
to do two-level tiling, you can have two tuning parameters, S and T. And now you get to do, you can't do binary search to find it, unfortunately, because it's multidimensional. You kind of have to do it exhaustively. And when you do that, you end up with nine nested loops. But of course, we don't really want to do it. We have three levels of caching.
|
1
test_transcripts_before/30s/out094.txt
Normal file
1
test_transcripts_before/30s/out094.txt
Normal file
@ -0,0 +1 @@
|
||||
Okay, can anybody figure out the inductive number? How many, for three levels of caching? How many levels of tiling do we have to do? This is a, this is a gimme, right? 12, 12, okay? Yeah, it's in do 12, okay. That really, and man, you know, when I say the code gets ugly when you start making things go fast, okay, right? This is like, ooh.
|
1
test_transcripts_before/30s/out095.txt
Normal file
1
test_transcripts_before/30s/out095.txt
Normal file
@ -0,0 +1 @@
|
||||
OK. OK. But it turns out there's a trick. You can tile for every pyro of two simultaneously by just solving the problem recursively. So the idea is that you do divide and conquer, divide each of the matrices into four submatrices. And then if you look at the calculations you need to do, you have to solve eight subproblems of half the size and then do a...
|
1
test_transcripts_before/30s/out096.txt
Normal file
1
test_transcripts_before/30s/out096.txt
Normal file
@ -0,0 +1 @@
|
||||
and then do an addition. Okay? And so you have eight multiplications of size n over 2 by n over 2 and one addition of n by n matrices and that gives you your answer. But then of course what you're gonna do is solve each of those recursively. Okay? And that's gonna give you essentially the same type of performance. Here's the code. I don't expect that you understand this, but we've written this using in parallel because it turns out you can do four of them in parallel and the silk spawn.
|
1
test_transcripts_before/30s/out097.txt
Normal file
1
test_transcripts_before/30s/out097.txt
Normal file
@ -0,0 +1 @@
|
||||
here says, go and do this subroutine, which is basically a subproblem. And then while you're doing that, you're allowed to go and execute the next statement, which will do another spawn and another spawn and finally this. And then this statement says, but don't start the next phase until you finish the first phase. And we'll learn about this stuff. When we do that, we get a running time of about 93 seconds,
|
1
test_transcripts_before/30s/out098.txt
Normal file
1
test_transcripts_before/30s/out098.txt
Normal file
@ -0,0 +1 @@
|
||||
which is about 50 times slower than the last version. We're using cash much better, but it turns out, nothing is free, nothing is easy, and typically in performance engineering, you have to be clever. Why, what happened here? Why did this get worse even though? Turns out if you actually look at the caching numbers, you're getting great hits on cache. I mean, you have very few cache misses, lots of hits.
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
x
Reference in New Issue
Block a user