1st SAPO Data Challenge – Traffic Prediction

10 Março 2011
Sem comentários


Challenge Rules

1. Introduction

SAPO is the largest Portuguese web portal, being owned by PT Comunicações, S.A. (hereinafter “PT Comunicações”). Its home page, http://www.sapo.pt, receives hundreds of thousands of visits every day. The home page has several sections that link to different types of contents, such as news, videos, opinion articles, blog previews, etc.

Those contents are linked from specific sections of the home page. Sapo.pt editors decide which contents are potentially interesting to the users, and select the most appropriate section of the home page to post a link to them. Links remain on the home page for several hours (e.g. during the morning), and are later replaced by others pointing to more recent contents. Decisions regarding which contents to link and where to place the links in the home page are not easy, and are mostly based on experience and intuition.

Considering the above, PT Comunicações decided to propose a challenge to the data-mining community, by launching the challenge “1st Sapo Data Challenge – Traffic Prediction” (hereinafter the “Challenge”) aimed at predicting the number of visits that a given linked content displayed in SAPO home page will receive.

The Challenge is ruled by the rules set forth herein (hereinafter “Challenge Rules”). Please read the Challenge Rules before entering this Challenge. Please note in addition that in order to enter the Challenge you will be asked to sign a Non Disclosure and Acceptance Agreement. By signing such agreement, you agree to be bound by these Challenge Rules and represent that you satisfy all of the requirements set forth below.


2. Purpose and Scope

2.1.   This Challenge is aimed at encouraging the participants to predict the number of visits that certain linked contents (news) displayed in SAPO home page have received.

2.2.   For this purpose, and subject to the rules and conditions set forth herein, the participants will have to submit to PT Comunicações predictions concerning said number of visits, taking into account namely:

  1. the place in the home page where the link is placed;
  2. SAPO’s thematic channel that produced the content;
  3. the period of the day in which the link is active; and
  4. the title of the content.

2.3.   The participants may generate a method or system (e.g. a computer program or software) in order to produce the predictions referred to in this clause.

2.4.   There will be three winners in this Challenge, corresponding to the best ranked participants in what concerns the predictions referred to in this clause. Subject to the rules and conditions set forth herein, PT Comunicações will award a prize to each of the three winners of the Challenge.


3. Data Set

3.1.   Upon receipt of the Non-Disclosure and Acceptance Agreement duly signed by the Participant in accordance with clause 6.5. PT Comunicações will provide each participant with a set of true data extracted from the access logs of the home page. Such data concerns the number of hits that contents being linked in the SAPO home page received. This data is intended to help the participants to produce the predictions referred to in clause 2.

3.2.   In particular, PT Comunicações will provide each participant with a number of data set lines.  Each line in the data set to be provided by PT Comunicações contains information about the number of hits that a given news item has received during one hour. Each link (pointing to a news item) is placed in a specific section and subsection of the news area.

More specifically, the data set contains 13140 lines, each containing the following 8 fields (separated by tab):

  1. Line number: a sequential integer for identifying the log line.
  2. Date + Time information: the date and daytime at which the hits took place. We keep track of the number of hits with hourly precision, so a value of “2011-03-08 23:00:00” means that the hit took place between 23:00:00 and 23:59:59 of 2011-03-08.
  3. Channel ID: an integer identifying Sapo’s channel that produced the content being hit. In this data set there are contents from 18 different channels (numbered sequentially from 1 to 18).
  4. Section: the name of the top placeholder inside the area dedicated to news in which the link was placed. There are 5 possible sections: “geral”, “desporto”, “economia”, “tecnologia” and “vida”. These sections can be interpreted as the “high-level topic” of the news item. Note that there are topics that are much more popular than others.
  5. Subsection: each placeholder is further divided in five subsections: “manchete”, “headlines” and “related”, “footer” and “null”. This is an important parameter because certain subsections are visually larger than others when rendered in the user’s browser. For example “manchete”, which holds only one link to a news item (and a photo), is about as large as the other two subsections, which contain several links to various news items simultaneously.
  6. News ID: an integer identifying the content. There are 1217 distinct news items numbered sequentially from 1 to 1217.
  7. Number of hits: the number of hits that the linked content received, during one hour (see field 2 above). The link is placed in the previously identified section and subsection.
  8. Title: the title of the news item at stake.

The following line exemplifies the fields just described:

[13116] [2011-03-08 23:00:00] [2] [geral] [manchete] [1214] [401] [Barcelona segue para os quartos-de-final]

3.3.   PT Comunicações will restrict the challenge to links placed inside a specific area of the home page that is dedicated to news. Thus, all the contents pointed by these links are news items. Also, all the contents pointed by these links are owned by PT Comunicações (we will not provide information concerning links that point to contents owned by other entities, such as PT Comunicações’ partners). The information to be provided covers 15 consecutive days.

3.4.   In no event will PT Comunicações provide the participants with personal data, namely traffic personal data concerning the users of the web portal SAPO.


4. Methodology and Output

4.1.   Alongside with the data set referred to in clause 3. PT Comunicações will provide each of the participants with a number of data set lines in which the information regarding the number of hits (field 7 of clause 3.2.) has been removed. These data set lines correspond to about 5% of the total number of data set lines provided to the participant. This means that for each hour during which the link pointing to a news item is online, PT Comunicações removed the number of hits for all corresponding entries (links can remain on the home page for several hours). For example:

[285] [2011-02-23 07:00:00] [2] [desporto] [headlines] [57] [?] [Armindo Araújo estreia-se ao volante do Mini]

The number of hits (field 7) was replaced by “?”. All other lines concerning the same news item (news_id = 57) had field (vi) replaced by “?”.

4.2.   Based on the information available, participants will have to predict the missing value for each data set line in which the field (vii) has been replaced by “?” taking into account the period during which the link was active on the home page. Participants will have to further submit to PT Comunicações a file containing the prediction for each missing value in the following format (one per line):

[Line Number]

[hit prediction]

For example, for the previous missing line, a valid prediction would be:



4.3.   A prediction is expected for each missing value. Even if the participant is not confident, a prediction must be generated.

4.4.   Each participant can submit up to three runs, which should be named participants_first_name_last_name_run#.tsv (e.g. “luis_sarmento_2.tsv”).

4.5.   All runs must be submitted by e-mail to: las@co.sapo.pt (cc:junior@co.sapo.pt) before June 12th, 2011, until 23:59 (PT time).


5. Evaluation and Prize

5.1.   The predictions generated by the participants will be evaluated and valued by PT Comunicações in accordance with the procedure described in Appendix I, which forms part of the Challenge Rules.

5.2.   Considering the procedure described in Appendix I, PT Comunicações will concede a prize to (using this order):

  1. the best ranked participant in the Cumulative Absolute Error Ranking (first winner);
  2. the best ranked participant in the Cumulative Relative Error Ranking (second winner); and
  3. the best ranked participant in the Combined Ranking (third winner).

5.3.   Each participant is only eligible to receive one prize (even if it ranks first on the three rankings).

5.4.   Rankings will be considered sequentially by the order referred to in clause 5.2.

5.5.   When a participant is considered a winner, all the corresponding runs will be removed from the subsequent rankings. Thus, all runs of the top ranked user of the Cumulative Absolute Error Ranking (the first winner) will be excluded both from the Cumulative Relative Error Ranking and from the Combined Ranking. The same applies from the Cumulative Relative Error Ranking to the Combined Ranking.

5.6.   The prize to be awarded by PT Comunicações consists of a free registration to the LXMLS – Lisbon Machine Learning Summer School that will take place in Lisbon from July 20 to 25, 2011. Each of the three winners will receive a free registration to this course.


6. Registration

6.1.   Registrations must be submitted by the participant, by e-mail with subject “SAPO Data Challenge” to las@co.sapo.pt and cc to junior@co.sapo.pt from April 11th to April 22nd, 2011. Registrations must be made before 23:59 PT time.

6.2.   The email referred to in the previous number must contain the following information concerning the participant:

  1. Complete name;
  2. Age;
  3. Nationality;
  4. Degree and University;
  5. Email;
  6. Website (if applicable);
  7. A few words about yourself (the participant).

6.3.   Upon receipt of the registration email, PT Comunicações will send the participant a Non-Disclosure and Acceptance Agreement  form. The participant must read and sign the form and send a copy of such form duly signed to PT Comunicações by email, to the email addresses referred to in clause 6.1.

6.4.   After receiving the copy of the Non-Disclosure and Acceptance Agreement signed by the participant, PT Comunicações will provide him/her with the data set referred to in clause 3, so he/she can start experimenting the prediction algorithms.



7. General Conditions

7.1.   Participants represent and warrant that the predictions submitted to Comunicações and the method/software used to generate them (if applicable) are the participant’s own original creation and do not infringe any third party intellectual property right or any other third party rights.

7.2.   In no event shall the participation in the Challenge imply the collection and/or any form of processing of personal data by the participants, namely personal data (including traffic data) concerning the users of the web portal SAPO.

7.3.   Participants’ email addresses must be kept active and up to date during the period of the Challenge.

7.4.   PT Comunicações may not be held liable for any possible loss or non-receipt of registration emails or other communications caused by network or systems failures and/or any other causes that are beyond its control.


8. Participants’ Personal Data

8.1.   Participants’ personal data provided to PT Comunicações in the scope of the Challenge will be processed by PT Comunicações (the controller of such data) for the purpose of conducting the Challenge.

8.2.   The provision of the data referred to in clause 6.2. is mandatory. Failure to provide such data means that the participant may not enter the Challenge.

8.3.   The participants may access and rectify the data provided to PT Comunicações, by sending an email to las@co.sapo.pt (cc to junior@co.sapo.pt).

8.4.   PT Comunicações will disclose the winners’ names to the public by displaying the results of the Challenge on SAPO Labs official blog: http://labs.sapo.pt.

8.5.   By signing the Non-Disclosure and Acceptance Agreement the participant gives PT Comunicações his/her consent to the processing of his/her personal data as described in these Challenge Rules.


9. Winners’ Announcement

9.1.   The correct answer set concerning the missing values in the data set lines provided to the participants will be released on June 13th, 2011.

9.2.   The winners will be announced on June 17th, 2011 both by e-mail and by posting the ranks in Sapo Labs blog (http://labs.sapo.pt)

9.3.   PT Comunicações will contact the winners by email in order to inform them on the results of the Challenge.


10. Important Dates

10.1. Registration: From April 11th to April 22nd 2011.

10.2. Submissions: all runs (up to three per participant) need to be submitted by e-mail to: las@co.sapo.pt (cc:junior@co.sapo.pt) before June 12th, 2011, 23:59 PT time.

10.3.    Release of Answer Set: the answer set will be released on June 13th, 2011.

10.4.    Announcement of the Winners: the winners will be announced on June 17th 2011.

10.5.    LXMLS: Lisbon Machine Learning Summer School will take place 20-25 July 2011.


11. Law and Jurisdiction

11.1. The Challenge and the Challenge Rules are governed by Portuguese law.

11.2. By registering into the Challenge, the participant submits to the exclusive jurisdiction of the Portuguese courts.



12. Miscellaneous

12.1  PT Comunicações reserves the right, acting reasonably and in accordance with all relevant legislation, to modify the terms and conditions of this Challenge.

12.2. For any clarification concerning the Challenge and/or the Challenge Rules the participant may send an email to las@co.sapo.pt (cc:junior@co.sapo.pt).


Appendix I

Evaluation Procedure

The following Appendix describes the evaluation procedure concerning the predictions to be submitted by each participant.

Taking into account the data set lines that are provided to the participants, let us assume that the prediction for line i is p(i) and the corresponding true value is t(i). For each prediction, i.e. each line, two values will be computed:

1) Absolute Error: AE(i) = ABS(p(i) – t(i)) 2) Relative Error: RE(I) = AE(i) / t(i)

Thus, for each run (i.e. for all predictions submitted in one run) we will compute two performance figures:

1) Cumulative Absolute Error = SUM(AE(i)) 2) Cumulative Relative Error = SUM(RE(i))

For each measure, we will rank all runs (the rankings will not necessarily match). A third ranking will be computing by summing the positions of each run in both ranking and ordering them ascending by this sum. This will be called the Combined Ranking.

Sem comentários