गंगनम स्टाइल के यूट्यूब दृश्यों की संख्या की भविष्यवाणी करने के लिए मॉडल


73

PSY का संगीत वीडियो "गंगनम स्टाइल" लोकप्रिय है, 2 महीने से अधिक समय के बाद इसमें लगभग 540 मिलियन दर्शक हैं। मैंने पिछले सप्ताह रात के खाने में अपने पूर्वजों से यह सीखा और जल्द ही चर्चा इस दिशा में चली गई कि 10-12 दिनों में कितने दर्शक होंगे और कब (/ यदि) गीत के बारे में किसी तरह की भविष्यवाणी करना संभव था? 800 मिलियन दर्शकों या 1 बिलियन दर्शकों को पास करेगा।

पोस्ट किए जाने के बाद से दर्शकों की संख्या इस प्रकार है: PSY OGS

यहाँ No1 "जस्टिन बीवर-बेबी" और नंबर 2 "एमिनेम - लव लाइक द लेट" म्यूजिक वीडियो के दर्शकों की संख्या से चित्र हैं जो दोनों लंबे समय से हैं। जस्टिन एमिनेम

मॉडल के बारे में तर्क करने का मेरा पहला प्रयास यह था कि एस-वक्र होना चाहिए लेकिन यह नंबर 1 और नंबर 2 गाने के लिए उपयुक्त नहीं लगता है और यह भी फिट नहीं है कि संगीत वीडियो कितने विचारों पर कोई सीमा नहीं है केवल धीमी वृद्धि हो सकती है।

तो मेरा सवाल है: मुझे किस तरह के मॉडल का उपयोग संगीत वीडियो के दर्शकों की संख्या की भविष्यवाणी करने के लिए करना चाहिए?


21
गंगनम से सांख्यिकी के लिए डिनर टेबल वार्तालाप को संचालित करने के लिए +1। हमें आप जैसे लोगों की जरूरत है!
Stephan Kolassa

4
मैं इस चर्चा में जोड़ सकता हूं कि मुझे आशा है कि gui11aume या अन्य जो इस मॉडल को बनाने के लिए समीकरण लिख रहे हैं, उपयोगी होगा, यह है कि इस Kona उदाहरण में, भौगोलिक क्लस्टरिंग वायरल प्रसार का एक महत्वपूर्ण पहलू था। तथ्य यह है कि पीएसवाई एक कोरियाई और फिर पहले एशियाई घटना है, कहानी का एक महत्वपूर्ण हिस्सा है। निश्चित रूप से नहीं कि यह कैसे मॉडल होगा, लेकिन यह एक सुराग हो सकता है।

नवंबर 2012 के दौरान वीडियो के विचारों, टिप्पणियों, पसंद और नापसंद के बारे में डेटा docs.google.com/spreadsheet/…
FredrikD

जवाबों:


38

अहा, उत्कृष्ट प्रश्न !!

मैंने भोलेपन से एस आकार के लॉजिसिटिक वक्र का भी प्रस्ताव रखा होगा, लेकिन यह स्पष्ट रूप से एक खराब फिट है। जहाँ तक मुझे पता है, निरंतर वृद्धि एक सन्निकटन है क्योंकि YouTube अद्वितीय विचारों (प्रति आईपी पते में से एक) को गिनता है, इसलिए कंप्यूटर से अधिक विचार नहीं हो सकते।

x(t)y(t)tXY

x˙(t)=r1(x(t)+y(t))(Xx(t))
y˙(t)=r2(x(t)+y(t))(Yy(t)),

r1>r2Yy

x˙(t)=r1x(t)(Xx(t))
y˙(t)=r2x(t),

एक बार जब उच्च जोखिम समूह पूरी तरह से संक्रमित हो जाता है, तो रैखिक विकास की भविष्यवाणी करता है। ध्यान दें कि इस मॉडल के साथ वहाँ ग्रहण करने के लिए कोई कारण नहीं है कि , काफी विपरीत है क्योंकि बड़े अवधि अब में सम्मिलित कर लिया है ।r1>r2Yy(t)r2

यह प्रणाली हल करती है

x(t)=XC1eXr1t1+C1eXr1t
y(t)=r2x(t)dt+C2=r2r1log(1+C1eXr1t)+C2,

जहाँ और एकीकरण स्थिरांक हैं। कुल "संक्रमित" जनसंख्या तब , जिसमें 3 पैरामीटर और 2 एकीकरण स्थिरांक (प्रारंभिक स्थितियां) हैं। मुझे नहीं पता कि फिट होना कितना आसान होगा ...C1C2x(t)+y(t)

अद्यतन: मापदंडों के साथ खेलना, मैं इस मॉडल के साथ शीर्ष वक्र के आकार को पुन: पेश नहीं कर सका, से तक का संक्रमण हमेशा ऊपर से तेज होता है। एक ही विचार के साथ आगे बढ़ते हुए, हम फिर से मान सकते हैं कि दो प्रकार के इंटरनेट उपयोगकर्ता हैं: "शार्क" और "loners" । हिस्सेदार एक दूसरे को संक्रमित करते हैं, अकेला मौका द्वारा वीडियो में टकराता है। मॉडल है0600,000,000x(t)y(t)

x˙(t)=r1x(t)(Xx(t))
y˙(t)=r2,

and solves to

x(t)=XC1eXr1t1+C1eXr1t
y(t)=r2t+C2.

We could assume that x(0)=1, i.e. that there is only patient 0 at t=0, which yields C1=1X11X because X is a large number. C2=y(0) so we can assume that C2=0. Now only the 3 parameters X, r1 and r2 determine the dynamics.

Even with this model, it seems that the inflection is very sharp, it is not a good fit so the model must be wrong. That makes the problem very interesting actually. As an example, the figure below was built with X=600,000,000, r1=3.6671010 and r2=1,000,000.

growth model of Gangnam style

Update: From the comments I gathered that Youtube counts views (in its secret way) and not unique IPs, which makes a big difference. Back to the drawing board.

To keep it simple, let's assume that the viewers are "infected" by the video. They come back to watch it regularly, until they clear the infection. One of the simplest models is the SIR (Susceptible-Infected-Resistant) which is the following:

S˙(t)=αS(t)I(t)
I˙(t)=αS(t)I(t)βI(t)
R˙(t)=βI(t)

where α is the rate of infection and β is the rate of clearance. The total view count x(t) is such that x˙(t)=kI(t), where k is the average views per day per infected individual.

In this model, the view count starts increasing abruptly some time after the onset of the infection, which is not the case in the original data, perhaps because videos also spread in a non viral (or meme) way. I am no expert in estimating the parameters of the SIR model. Just playing with different values, here is what I came up with (in R).

S0 = 1e7; a = 5e-8; b = 0.01 ; k = 1.2
views = 0; S = S0; I = 1;
# Exrapolate 1 year after the onset.
for (i in 1:365) {
   dS = -a*I*S;
   dI = a*I*S - b*I;
   S = S+dS;
   I = I+dI;
   views[i+1] = views[i] + k*I 
}
par(mfrow=c(2,1))
plot(views[1:95], type='l', lwd=2, ylim=c(0,6e8))
plot(views, type='n', lwd=2)
lines(views[1:95], type='l', lwd=2)
lines(96:365, views[96:365], type='l', lty=2)

Extrapolation of the views of the Gangnam style Youtube video

The model is obviously not perfect, and could be complemented in many sound ways. This very rough sketch predicts a billion views somewhere around March 2013, let's see...


5
(+1) As a first approach. Note that youtube's policiy for counting views is not well understood given that they have not made their algorithm public. They only say: "A view is counted whenever someone watches a video on YouTube. We do not get more specific than this to avoid attempts at artificially inflating view counts" (see).

3
@FredrikD thanks. You can still remove the 'accept' in March 2013 if I got it wrong :D
gui11aume

2
SIR model parameter estimation, see rsfs.royalsocietypublishing.org/content/2/2/156.full
FredrikD

1
It seems I am going to lose this one! They may hit the million even before 2013...
gui11aume

2
engadget.com/2012/12/21/gangnam-style-one-billion-views So the world didn't end but 1 Billion views was hit today.
DanTheMan

5

Probably the most common model for forecasting new product adoption is the Bass diffusion model, which - similar to @gui11aume's answer - models interactions between current and potential users. New product adoption is a pretty hot topic in forecasting, searching for this term should yield tons of info (which I unfortunately don't have the time to expand on here...).


yes, that is also a candidate model. However, it seems like it assumes that you only can be a user once. Here, you view the video a number of times if you are "infected".
FredrikD

1
@FredrikD: point taken. (Though I personally didn't manage to sit even through a single "use" of this "product"...) There should be generalizations of Bass to deal with this. (Shameless plug:) Next year's International Symposium of Forecasting is in Seoul, so anyone should consider presenting his/her favorite Gangnam forecasting model there! ;-)
Stephan Kolassa

4

I would look at the Gompertz growth curve.

The Gompertz curve is a 3-parameter (a,b,c) double-exponential formula with time, T, as an independent variable.

R code:

gompertz_growth <- function(a=a,b=b,c=c, t) { a*exp(b*exp(c*t)) }

Gompertz growth formula is known to be good at describing many life-cycle phenomena where at first growth is accelerating, then tapers off resulting in a asymmetric sigmoid curve whose derivative is steeper on the left than on the right of the peak. For example, the total number of articles on Wikipedia which is also viral in nature, has been following a Gompertz growth curve (with certain a,b,c parameters) for many years with great accuracy.

Chart of the Gompertz curves: total size and its growth rate derivative

Edit: If the Gompertz curve isn't enough to approximate the shape you're looking for, you may want to add parameters d & θ as described in The Exponentaited Generalized Weibull Gompertz Distribution. Note that this paper uses x instead of t for the independent time parameter. Interestingly, Wikipedia also modified their best approximation by adding a single 4th parameter d, to account for a prediction divergence from the actual value after 2012. The modified 4-param Gompertz curve formula is:

gompertz_2 <- function(a=A,b=B,c=C,d=D, t) {a * exp(b * exp(c*t) + d*t)}

The Gompertz function is named after Benjamin Gompertz (1779-1865), a Gauss contemporary (just 2 years Gauss' junior), the first mathematician to describe it.


Good point! However, what challenges the model is that it doesn't seem to be a limit (see the No1 and No2 ). That is, the factor a in the model is also increasing over time.
FredrikD

I would challenge the "There doesn't seem to be a limit." Can Gangnam style reach 1B? 10B? 100B? views? eventually the growth rate gets to near zero and the curve plateaus. This is hard to see when you're at the high growth phase, like we are now with Gangnam, but just wait a few years and you'll Gompertz win :) The trick is of course, to figure out the right (a,b,c) parameters for this specific case.
arielf

2
Here is a reference for estimating the parameters of the Gompertz model, see weibull.com/RelGrowthWeb/…
FredrikD

3

I think you need to separate phenomena like Gangnam Style, which owes much of it's views to being a meme/viral thing, from Justin Bieber and Eminem, who are big artists in their own right and who also would spread widely in a traditional setting - JB or Eminem would sell a lot of singles too, I'm not sure that PSY would.


good point. After reading & listening to interviews of PSY and the team behind "OGS" (Oppa Gangnam Style), it is clear that they are well aware of which button to press to create a viral thing. Through some image analysis of the views picture above, it seems like the no of views are linear up to about 90 days after launch, then PSY appears on the Korean Grand Prix and the number of views per time unit increases.
FredrikD

- and how does these two classes differ from "classics" - songs that were presumeably well-known when they were first uploaded on YouTube (I'm thinking David Bowie)?
abaumann

2

5
Welcome to the site, @ProfRoy47. Would you mind elaborating on this post somewhat? It's not clear that this is actually an answer to the OP's question yet / that it quite stands on it's own. OTOH, it wouldn't fit as a comment, & I think it has the makings of a helpful contribution to this thread. Our FAQ has some discussion re providing answers on CV, which may be helpful to you.
gung

1

The model is obviously not perfect, and could be complemented in many sound ways. This very rough sketch predicts a billion views somewhere around March 2013, let's see...

Looking at the slowdown in views over the past week, the Mar-13 date looks like a decent bet. The majority of the new views appear to be already infected users that return multiple times per day.

With regards to complementing your model, one method that researchers use to track a virus' spread is to monitor its genome mutations - when and where it mutated can show researchers how fast a virus is transmitted and spread (see tracking West Nile Virus in USA).

In a practical sense, videos like Gangnam Style and Party Rock Anthem (by the group LMFAO) are more likely to 'mutate' into parodies, flash mobs, wedding dances, remixes and other video responses than say, Justin Bieber's Baby or Eminem's songs.

Researchers could analyse the number of video responses (and parodies in particular) as a proxy for mutations. Measuring the frequency and popularity of these mutations early in the life of the video could be useful is modelling its lifetime YouTube views.


Welcome to the site, @lucasng. CV is intended for serious, factual answers to substantive questions (you may want to read our faq), & I think the OP has asked w/ this in mind. Your answer is on the borderline here; I think it should stay based on its ideas about mutations etc, but note that opinions about the merits of the videos isn't really germane.
gung

I think the idea is good. @gung True that it is not an answer to the OP, but the second answer also isn't.
gui11aume

@gung: (A Google search suggests that) lucasng was not stating an opinion in the part you redacted but rather citing the name of the group that performs the song!
cardinal

1
@cardinal, thanks for the heads up. Lucasng, sorry about the confusion; I have put the group name back.
gung
हमारी साइट का प्रयोग करके, आप स्वीकार करते हैं कि आपने हमारी Cookie Policy और निजता नीति को पढ़ और समझा लिया है।
Licensed under cc by-sa 3.0 with attribution required.