Guidelines for the 2011 TREC Medical Records Track
The goal of the Medical Records track is to foster research on providing content-based access to the free-text fields of electronic medical records. In this initial year, the track will focus on a task that models the real-world task of finding a population over which comparative effectiveness studies can be done. The task
The test document collection for the Medical Records track is a set of de-identified medical records made available for research use through the University of Pittsburgh BLULab NLP Repository[url=]. Participants must obtain the data set directly from the University of Pittsburgh after first getting a "Letter of Participation" from NIST. Look[/url]herefor details about how to obtain the collection.
Each report in the dataset has a report ID, called the "reports_checksum". Most reports are associated with a "visit" identified by a "visitreports_visitid". (A small percentage of the reports have no associated visit because the data linking the record to a visit has been lost. These reports have a NULL visitreports_visitid.) The repository contains a simple ASCII table called the Report-to-Visit Mapping Key that specifies which reports belong to the same visit. A visit may contain one or more reports: in the UPMC data set, the number of reports per visit varies between 1 and 415, with a median of 3 reports per visit and very few visits with more than 100 reports.
The Medical Records track will use the *visit* as the response unit. That is, your retrieval system must return visitreports_visitid's, and relevance judgments will be based on the visit as a whole. Note that the use of visits as the retrieval unit means those reports that are not associated with any visit are effectively removed from the collection.
The retrieval task for the track is an ad hoc search task as might be used to identify cohorts for comparative effectiveness research. Topics will specify a particular disease/condition set and a particular treatment/intervention set and your system should return a list of visits ranked by decreasing likelihood that the visit satisfies the specification. For example, a topic might be "find patients with gastroesophageal reflux disease who had an upper endoscopy".
Topics will be developed by physicians who are also students at Oregon Health and Sciences University. Similar students will also do the relevance judging, and we will call both groups "assessors". Assessors will devise topics using a list of priority areas for comparative effectiveness research issued by the US Institute of Medicine of the National Academies as inspiration. /
Topic development assessors have been instructed that we desire topics that exploit information from the text fields--- in other words, that are not answerable solely by the diagnostic codes contained in the records. However, this does not rule out the possibility that the diagnostic codes might contribute to the fact that an item is a match. We will not intentionally try to create topics for which the diagnostic codes are "gotchas", but as always in TREC, relevance will be in the eyes of the assessor.
NIST and OHSU have produced four sample topics with a few corresponding relevance judgments. *The primary purpose of these example topics is to illustrate the syntactic format of the test topics.* They will also be suggestive of the type of language use that might be expected in the test topics. They are explicitly not guaranteed to be representative of anything else.
The test set of 35 topics will be posted to the Tracks web page on June 15. Results of running your system on those topics (a "run") will be due August 9.
Your runs may be created completely automatically or with some level of manual intervention. Automatic methods are those in which there is no human intervention at any stage---the system takes the topic statement as input and produces a ranked list of visit ids as output with no human in the loop. Manual methods are everything else. This manual methods category encompasses a wide variety of different approaches. There are intentionally few restrictions on what is permitted to accommodate as many experiments as possible. In general, the ranking submitted for a topic is expected to reflect a ranking that your system could actually produce --- the result of a single query in your system (granting that that query might be quite complex and the end result of many iterations of query refinement) or the automatic fusion of different queries' results. However, it is permissible to submit a ranking produced in some other way, provided the ranking supports some specific hypothesis that is being tested and the conference paper gives explicit details regarding how the ranking was constructed.
You may not change your system once you have looked at the test set of topics. This precludes any possibility of tweaking the system to benefit test topics. Working on your system after the test topics have been posted but before you fetch them is fine. TREC purposely allows a long time-window between topic release and run submission to accommodate as many participants' schedules as possible and to allow time for manual runs.
Submitting runs
Runs are submitted through an automatic run submission system hosted at NIST. This submission system will perform sanity-checking on the submission file and reject any runs that do not pass the checks. Runs that have been rejected are not counted as submitted runs. NIST will not accept emailed submissions; in particular, runs that are emailed because they do not pass the sanity checking in the submission system will simply be discarded. The script that is used to do the sanity checking will be made available to participants once the submission system is open. You are very strongly encouraged to check for errors yourself prior to submitting a run.
When you submit your run, you will be asked to specify the run's features on the submission form. These features will include at least whether the run is a manual or automatic submission; the judging priority of the run (see below); and a short textual description of the run. Other features may be added and will be announced on the mailing list.
In TREC tradition, a deadline of August 9 officially means runs must be submitted by 11:59pm EDT on August 9. In practice, it means runs must be submitted before NIST personnel disable the submission system on the morning of Aug 10; this generally means the effective submission deadline is about 8:00am EDT on August 10.
Format of a Submission
The Medical Records track will use the standard TREC submission format for ad hoc runs. A submission consists of a single file that contains retrieval results for all test topics. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.
20 Q0 E6t97jn7a1sA 1 4238 prise1 20 Q0 78CrJwsWXvYq 2 4223 prise1 20 Q0 NoXWN9vdBXTO 3 4207 prise1 20 Q0 yZZStV5RDJPP 4 4194 prise1 20 Q0 TwI3ghHE0JEk 5 4289 prise1 etc. where: * the first column is the topic id * the second column is the literal 'Q0' * the third column is an official visitreports_visitid * the fourth column is the rank at which the visit is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order and is important to include so that we can handle tied scores (for a given run) in a uniform fashion (the evaluation routines rank documents from these scores, not from your ranks). If you want the precise ranking you submit to be evaluated, the SCORES must reflect that ranking. * the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run must have a different tag and that tag should identify the group and the method that produced the run. Run tags must contain 12 or fewer characters and may not contain whitespace or a colon (:).Each topic must have at least one visit retrieved for it and no more than 1000. Provided you have at least one visit, you may return fewer than 1000 visits for a topic, though note that the standard ad hoc retrieval evaluation measures used in TREC count empty ranks as not relevant. You cannot hurt your score, and could conceivably improve it for these measures, by returning 1000 visits per topic.
Judging
Groups may submit up to four runs to the track. Judging will be done on pools created from a subset of the runs. The number of visits per topic per run that are added to the pool ("pool depth") will be determined after submissions are complete such that the final pool sizes are within the bounds that assessors can handle. We are targeting a pool size of roughly 500 visits per topic. Assessors will have access to all of the reports that constitute a visit at the time of judgment. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run(s) to assess arbitrarily. Judgments will be binary: the visit either satisfies the query's specification or it does not, in the opinion of the relevance assessor.
Scoring
NIST will score all submitted runs using the relevance judgments produced by the assessors. The primary measure for the track will be mean average precision, though all of the various trec_eval measures will be reported.
Timetable
Documents available: nowSample topics available: June 6, 2011Test set of topics available: June 15, 2011Results due at NIST: Aug 9, 2011Qrels for topics available: target of Oct 1, 2011Conference notebook papers due: late October, 2011TREC 2011 conference: November 15--18, 2011
Last updated: Tuesday, 07-Jun-2011 09:48:02 EDT
Date created: Monday, May 23, 2011
For further information contact Ellen Voorhees |