IPTC is looking for software developers to design, develop, document and test EXTRA, an open source rules-based classification engine for news. First preference will be given to applications received by 21st October 2016, and review will continue until the positions are filled.
“Classification” means assigning one or more categories to the text of a news document. Rules-based classifiers use a set of Boolean rules, rather than machine-learning or statistical techniques, to determine which categories to apply.
EXTRA is the EXTraction Rules Apparatus, a multilingual open-source platform for rules-based classification of news content. IPTC was awarded a grant of €50,000 from the first round of Google’s Digital News Initiative Innovation Fund to build and freely distribute the initial version of EXTRA. DNI granted IPTC €50,000 for the entire project.
We are working with news providers to supply sets of news documents and with linguists to write rules to classify the documents. IPTC is looking for qualified developers to create the rules engine to accurately and efficiently categorize the documents using the rules.
Please consult this page for more information and to let us know if you’re interested in being considered.
The IPTC NewsCodes family of controlled vocabularies has a new member: Product Genre.
The Product Genre vocabulary was developed at the request of the broadcast industry. A broad category of terms was needed – one that specifies the kind of content by media product type – in addition to metadata that describes the content. The Product Genre scheme includes terms such as comedy, drama, entertainment, travel and sport.
NewsCodes are sets of concepts created and maintained by the IPTC, also known as controlled vocabulary or taxonomy. They are assigned as metadata values to news objects like text, photographs, graphics, audio and video files and streams. This allows for a consistent coding of news metadata across news providers and over the course of time.
The Product Genre vocabulary was an idea initiated by Andy Read, IPTC delegate and BBC’s Service Development and Delivery Manager for News, who has worked with IPTC for more than 20 years. This was based on feedback from broadcast members that highlighted the value of the forum engagements in driving the progression of the data set.
“There was a need to extend the breadth of the controlled vocabularies,” said Read. “The new Product Genre vocabulary codes describe the type of program itself, and help to broaden the program to a wider audience and general TV/broadcast industry.”
NewsCodes vocabularies can be very specific. A broader category like Product Genre allows identification of an entire broadcast program or package – not just smaller segments. For example, a longer 60-minute program overview about Syria’s war can be coded according to Product Genre – supplemented by metadata specific to a minute-long clip about a possible chemical attack, in the context of the larger news program.
“The Product Genre needed to be added to help facilitate use of these codes with IPTC’s NewsML-G2 standards,” said Read.
The new Product Genre vocabulary is also beneficial on the business side, said Jennifer Parrucci, senior taxonomist for the New York Times.
“Advertising is often sold based on the type of program – not necessarily subject tags or more specific terms,” Parrucci said. “The Product Genre vocabulary identifies advertising opportunities at a more comprehensive level.”
The IPTC NewsCodes Working Group, chaired by Parrucci, collaborated to define the vocabulary terms, based on concrete examples and actual TV programs. For each Concept identifier and name, a definition is listed. The notes section gives an example of what that Concept describes, for clarity and accurate use.
Any NewsCode provided by the IPTC can be used at any stage of a news workflow, without any royalty fee. But if one includes IPTC NewsCodes into an application, the intellectual property and the copyright of the IPTC must be explicitly attributed.
Interesting stats and info about the International Press Telecommunications Council’s technical standards for exchange of news information:
1.) The International Press Telecommunications Council publishes 14+ technical standards that are intended for the business-to-business exchange of news among news agencies, other news providers and publishers.
2.) At least one or two IPTC standards are in use at virtually every newspaper and news web site in the world.
Publishers use IPTC standards to save money and improve the ability of their news products to be used by customers.
3.) IPTC standards for news exchange are available for downloading at no cost – and there are no royalties or fees.
The only source of income for IPTC is membership dues. Membership currently consists of more than 50 organizations and individuals worldwide.
4.) All IPTC standards are designed to be independent of any specific language.
Although our publications are written in English and meetings are conducted in English, every recent standard is usable by any written language that is supported by Unicode.
5.) More than 70 software applications support IPTC Standards.
Software developers seamlessly integrate IPTC standards into their products – often in subtle ways that are not obvious to customers.
It’s an Olympic year for IPTC’s SportsML 3.0 standard, the recently released update to the most comprehensive tech-industry XML format for sports data.
“We figured, why not use the latest technology available?” said Trond Husø, system developer for NTB, who worked on the standard’s update, released in July. “SportsML 3.0’s use of controlled vocabularies for sport competitions and other subjects now provides many benefits, including more flexibility. Storing results is also more convenient.”
SportsML 3.0 is the ideal structure and back-end solution used by many major news organizations because it is the only open global standard for scores, schedules, standings and statistics. “It saves the time and cost of developing an in-house structure,” said Husø, also a member of IPTC’s Sports Content Working Party.
The Rio Games, which will host about 10,500 athletes from 206 countries, for 17 days and 306 events, are revolutionary for big data and new approaches for managing it. For the first time, the International Olympic Committee (IOC) used cloud-based solutions for work processes including volunteer recruitment and accreditation.
And consider the experimental technologies and apps launched by key broadcasters and Olympic Broadcasting Services, the Olympic committee responsible for coordinating TV coverage of the Games: virtual reality footage, online streaming, automated reporting, drone cameras, and Super-High Vision, which is supposedly 16 times clearer than HD.
Billions of Olympic spectators worldwide have naturally come to expect real-time results and accurate scores to be delivered to them, with a side of historical perspective. All with little thought as to how the information reaches the public, be it via tickers on websites, graphic stats on TV screens, or factoids offered by commentators.
Schedules, competitors’ names, bio information, times, rankings, medalists – how does all of this data get served up so quickly and uniformly among networks and news services? And how does it get integrated into existing news systems, namely SportsML 3.0?
It starts with the IOC – the non-profit, non-governmental body that organizes the Olympic Games and Youth Olympic Games. They act as a catalyst for collaboration for all parities involved, from athletes, organiser committees, and IT, to broadcast partners and United Nations agencies. The IOC generates revenue for the Olympic Movement through several major marketing efforts, including the sale of broadcast rights.
The IOC produces the Olympic Data Feed (ODF), the repository of live data about past and current games. The IOC is responsible for communicating the official results; they use the specific ODF format for their ODF data.
Paying media partners sign a licensing agreement to use ODF, to report on results through their own channels, and build new apps, services and analysis tools.
The goal of ODF is to define a unified set of messages valid for all sports and several different news systems – so that all partners are receiving the same data, at the same time. It was introduced for the Vancouver Games in 2010 and is an ongoing development effort.
According to the IOC’s website, ODF plays the part of messenger. From a technical standpoint, the data is machine-readable. ODF sends sports information from the moment it is generated to its final destination via Extensible Markup Language (XML). XML, a framework for storing metadata about files, is a flexible means to electronically share structured data via the Internet, as well as via corporate networks.
IPTC’s SportsML 3.0 easily imports data from ODF. Using SportsML to structure the ODF’s data is a broad and comprehensive solution to approaching all sports and competitions worldwide. ODF has identifiers for sports and awards (gold, silver, and bronze medals) executed at the Olympic Games; sports outside of ODF are identified by vocabulary terms of SportsML.
“SportsML 3.0 provides one structure for the data for developers to work in,” said Husø. “The structure will be the same, even if there are changes to ODF in future Olympic Games; the import and export process of the data will not change.”
Among content providers that use SportsML (various versions) are NTB, AP mobile (USA), BBC (UK), ESPN (USA), PA – Press Association (UK), Univision (USA, Mexico), Yahoo! Sports (USA), and Austria Presse Agentur (APA) (Austria), and XML Team Solutions (Canada).
SportsML 3.0 is based on its parent standard, NewsML-G2, the backbone of many news systems, and a single format for exchanging text, images, video, audio news and event or sports data – and packages thereof. SportsML 3.0 is fully compatibility with IPTC G2 structures.
Media Topics is an IPTC standard – a 1,100-term taxonomy with a focus on categorizing text. Released in 2010 as a development based on the IPTC Subject Codes, use of Media Topics is free and available in different formats. They can be viewed on the IPTC Controlled Vocabulary server, or in a user-friendly tree hierarchy tool.
IPTC creates and maintains taxomonies and controlled vocabularies – to assign terms as metadata values to news objects like text, photographs, graphics, audio and video files and streams. This allows for a consistent coding of news metadata across news providers, over the course of time.
“The idea of semantic mapping and being involved in a linked data initiative like Wikidata is a natural step for IPTC,” said Jennifer Parrucci, chair of the IPTC NewsCodes Working Group and senior taxonomist for The New York Times. “When linking an existing taxonomy to another, Wikidata serves as a central point of reference.”
Wikidata is a free, collaborative, multilingual knowledge base that can be read and edited by both humans and machines. It provides centralized storage for an access to structured data for all Wikimedia projects, as well as for use on external websites.
In total about 100 mappings from Media Topics to Wikidata have been manually applied. The mappings use SKOS mapping relationships.
Media Topics began with the Subject Codes vocabulary and extended the tree from 3 to 5 levels and reused the same 17 top-level terms. The lower-level terms have been revised and rearranged. Each Media Topic provides a mapping back to one of the Subject Codes.
Mittmedia and Journalism++ Stockholm, two news organizations in Sweden, are successfully developing and incorporating AIPs, automation tools and robots into workflows to enhance to the capabilities of newsrooms, as reported at the International Press Communications Council’s (IPTC) Spring Meeting 2016.
News organizations continue to experiment with bots as part of a frontier in automation journalism, as publishers draw on the benefits of the massive amounts of data available to newsrooms, including information about their own audiences. Despite some apprehension, the benefits of automating parts of the publishing process are many: aiding journalists in storytelling with the ability to sift through big data, refining workflows and reducing workloads, and more precise and faster content delivery to customers.
Mittmedia began their automation efforts in 2015 with a weather forecast text bot, which pulls data from the Swedish Meteorological and Hydrological Institute.
Set up initially as a testing tool based on a simple minimum viable product (MVP), it now delivers daily forecasts for 42 municipalities, soon to be 63.
Mittmedia’s next project was Rosalinda, a sports robot that transforms data into text for immediate publishing. Data is pulled from the Swedish website Everysport API, giving developers access to information on 90,000 teams and 1,500,000 matches. Rosalinda now reports all football, ice hockey and floor ball matches played in Sweden, which filled a need in the market. United Media, owned by Mittmedia and two other companies, developed the tool.
Mittmedia has adopted a data-driven mindset and work process to gain a competitive edge over other local news sources. “We aim to deliver more content – faster, and provide it to the right person, at the right time and at the right place,” said Mikael Tjernström, Mittmedia API Editor.
Faster publication and more personal and relevant content were also among the reason for Journalism++’s development of the automated news service Marple, which focuses on story finding and investigation, rather than text generation. according to Jens Finnäs, the organization’s founder.
One of four Swedish projects to receive funding from Google’s Digital News Initiative (DNI) this year, Marple is used for finding targeted local stories in public data. For example, Marple has analyzed monthly crime statistics and found a wave of bike thefts in Gothenburg and a record number of reported narcotics offences in Sollefteå.
“Open data has been a highly underutilized resource in journalism. We are hoping to change that,” Finnäs said. “We don’t think the robots will replace journalists, but we are positive that automation can make journalism smarter and more efficient, and that there are thousands of untold stories to be found.”
The grant from Google’s DNI gives Journalism++ a unique opportunity to test Marple and possibly turn it into a commercially viable product, Finnäs said.
Jens Finnäs: firstname.lastname@example.org Twitter @jensfinnas
Mikael Tjernström: Twitter @micketjernstrom
Photo by Photo by CC/FLICKR/Peyri_Herrera.
Join us for IPTC’s Autumn Meeting 2016 in Berlin! Anyone interested in IPTC’s work can attend our face-to-face meetings, held three times a year, or take part in regular conference call sessions as a guest. Our meetings are the perfect opportunity to network with industry peers, learn about emerging industry topics from leading professionals, and simplify product development with technical standards.
The venue for the Autumn 2016 Meeting is dpa Headquarter Berlin, Markgrafenstraße 20, 10969 Berlin. Please contact us for hotel accommodations and conference registration information.
The agenda will include Video Day on 25 October: IPTC will release and introduce its new Video Metadata Hub Recommendation. Speakers from video makers, video suppliers, video publishers and system vendors will discuss how video workflows can be improved.
Additionally, the IPTC membership will hold its Annual General Meeting. Locations for IPTC’s three face-to-face meetings per year are rotated worldwide, with at least one meeting held in Europe annually.
Interested in attending? Contact Us, please.
For the new version 2.23 of NewsML-G2, the specification part of the Annual Release is now available and can be downloaded from the NewsML-G2 Release Section of the IPTC Developer Site.
The NewsML-G2 standard provides state-of-the-art XML format metadata to combine rich functionality, ease of use, compactness and compatibility with the Semantic Web. It is a single format for exchanging text, images, video, audio news and event or sports data – and packages thereof.
This specification part of the Annual Release of NewsML-G2 v2.23 includes the XML Schemas and the Structure Matrix document. The updated Quick Start Guides, Implementation Guidelines and full specifications will be released in October. This is part of an ongoing incremental development of NewsML-G2, as providers expand their content use-cases.
Three changes/improvements in version 2.23 are:
- It allows the addition of further Rights Expression properties <rightsInfo> and now covers these Rights cases:
– Allows embedding or referencing a rights expression
– Allows use of XML or JSON as format for embedding
- It allows address properties (locality, area, country, etc.) to include a World Region.
- It allows the addition of facets to the Item Class property, this provides for more flexibility.
Please visit the IPTC Standards Page for a full list of the available IPTC standards.
The International Press Telecommunications Council (IPTC) released SportsML 3.0, the recently approved comprehensive update of the open and highly flexible standard for the interchange of sports data.
Developed by the Sports Content Working Party of IPTC, which includes organisations from eight different countries, SportsML 3.0 is designed to be easy to understand and implement, and covers the full gamut of sports events. Sports Markup Language is the tech-industry standard XML vocabulary for Sports scores, lineups, schedules, standings and statistics.
SportsML has been adopted by many international news organizations, including the BBC (UK), NTB (Norway), TT (Sweden), APA (Austria), AP (USA), and more. It has been applied to results from the Olympics, European football competitions, as well as the major North American sports leagues, for team, individual and head-to-head sports.
“We’ve had 12 years of input from sports experts at news organizations since SportsML 1.0,” said Paul Kelly, Chairman of the Sports Content Working Party and Director of Software Development at XML Team Solutions, a sports-focused agency. “SportsML 3.0 addresses the requirements of anyone handling sports results and statistics and will save the time and cost of developing an in-house format. Companies can also defend against vendor lock-in caused by adopting proprietary formats.”
Highlights of the new SportsML version include the public release of 113 sports-related controlled vocabularies (CVs). “The most important thing we did was design SportsML to play well with the current generation of semantic technologies,” says Kelly. These CVs cover the statistical properties, player positions, on-field actions, infractions, etc., of 11 sports plus 37 CVs that cover all sports. These will be available publicly as a package.
These terms can be combined with SportsML 3’s new generic stat structure to incorporate both IPTC and external properties, such as those published by the IOC or any other vendor. “You can easily add new properties and continue to process gracefully using the powerful vocabulary-management the IPTC has devised,” says Kelly. “That’s usually missing from even the most prominent sports formats.”
The specification and documentation can be downloaded from https://iptc.org/standards/sportsml-g2/. Additionally, the IPTC Developer Site provides technical information about SportsML, and the SportsML Users Forum is used to share experiences and raise questions, and also connects companies, organizations and vendors.