您说:
COMP3009J Information Retrieval Semester 2 2015_16.pdf
PDF
把这份试卷翻译成中文,只翻译试题部分Translate this test paper into Chinese and only translate the question part
2/2
a) 解释什么是停用词删除。在什么情况下这可能会对检索性能造成不利影响?a) Explain what stop word deletion is. Under what circumstances might this adversely affect retrieval performance?
[6分][six minutes]
b) 描述什么是信息需求以及它与提交给信息检索系统的查询有什么不同。b) Describe what an information requirement is and how it differs from a query submitted to an information retrieval system.
[6分][six minutes]
c) 简要解释为什么数据库重叠在融合领域很重要。c) A brief explanation of why database overlap is important in the area of convergence.
[6分][six minutes]
d) 解释什么是对抗性信息检索。在网页搜索的背景下,举两个例子。d) Explain what adversarial information retrieval is. In the context of web search, here are two examples.
[6分][six minutes]
e) 大多数信息检索系统中通常存在三个关键阶段。简要描述这三个阶段。e) There are usually three key stages in most information retrieval systems. Briefly describe these three phases.
[6分][six minutes]
[总分 30分][Total score 30]
a) 布尔模型使用了查询操作符 AND、OR 和 NOT。解释这些操作符如何工作以及它们如何影响信息检索系统返回的文档数量。同时展示如何使用集合论的运算来实现这些操作符。a) The Boolean model uses the query operators AND, OR, and NOT. Explain how these operators work and how they affect the number of documents returned by the information retrieval system. It also shows how to use set-theoretic operations to implement these operators.
[6分][six minutes]
b) TF-IDF 是一种计算信息检索系统词项权重的方法。b) TF-IDF is a method for calculating the weights of items in an information retrieval system.
(i) 为什么词项权重在信息检索系统中很重要?(i) Why is term weighting important in information retrieval systems?
(ii) 使用 TF-IDF 时,为什么某个词项会有较高的权重?(ii) Why does an item have a higher weight when using TF-IDF?
[5分][five minutes]
c) 下面是一个包含三个文档的小型文档集合。回答以下问题,并展示每个问题的计算过程。c) Below is a small collection of three documents. Answer the following questions and show how each question is calculated.
停用词: a, but, he, is, of, theStop words: a, but, he, is, of, the
文档1: A crowd of lions is called a prideDocument 1: A crowd of lions is called a pride
文档2: The crowd of injured people called the hospitalDocument 2: The crowd of injured people called the hospital
文档3: He suffered nothing but injured prideDocument 3: He suffered nothing but injured pride
(i) 使用 TF-IDF 加权系统为每个文档计算向量。你应该使用提供的停用词列表,但不要求进行词干提取。(i) Vectors are calculated for each document using the TF-IDF weighting system. You should use the list of stop words provided, but don't require stemming.
(ii) 计算查询 "injured people hurt" 的向量。(ii) Calculate the vector of the query "injured people hurt".
(iii) 计算查询向量与每个文档向量之间的余弦相似度,并展示该查询的文档排名列表。(iii) calculate the cosine similarity between the query vector and each document vector, and present a ranking list of documents for that query.
[15分][15 minutes]
d) 信息检索的概率模型使用了两个与查询词项相关的概率。一个是相关文档包含该词项的概率,另一个是非相关文档包含该词项的概率。然而,这些概率不能直接计算,必须估计。d) The probabilistic model for information retrieval uses two probabilities related to query terms. One is the probability that the related document contains the term, and the other is the probability that the non-related document contains the term. However, these probabilities cannot be calculated directly and must be estimated.
(i) 简要描述如何生成这些概率的初始值。(i) A brief description of how the initial values of these probabilities are generated.
[3分][three minutes]
(ii) 解释如何通过用户反馈改进这些初始估计值。and (ii) explain how these initial estimates can be improved through user feedback.
[6分][six minutes]
[总分 35分][Total score 35]
a) 使用至少4个文档和至少6个链接的例子,展示如何计算 PageRank 分数。使用阻尼因子 d=0.8,并至少执行3次迭代。a) Use examples of at least 4 documents and at least 6 links to show how to calculate the PageRank score. Use the damping factor d=0.8 and perform at least 3 iterations.
[12分][12 minutes]
b) 下面是搜索引擎对一个查询返回的一组结果。b) The following is a set of results returned by a search engine for a query.
检索到的文档 = d0, d7, d19, d1, d12, d18, d6, d16, d10, d9, d8, d13Retrieved documents = d0, d7, d19, d1, d12, d18, d6, d16, d10, d9, d8, d13
下面是同一查询的相关性判断:Here's how relevant the same query is:
相关 = {d0, d1, d12, d16, d17}correlation = {d0, d1, d12, d16, d17}
不相关 = {d6, d8, d9, d10, d18, d19}Not correlated = {d6, d8, d9, d10, d18, d19}
对于上述查询,计算 MAP 和 bpref 分数。For the above query, the MAP and bpref scores are calculated.
[11分][11 minutes]
c) 信息检索的评估传统上基于 Cranfield 模型。c) The evaluation of information retrieval has traditionally been based on the Cranfield model.
(i) 详细描述如何使用该模型进行评估,并如何计算出准确率和召回率。and (i) describe in detail how the model will be used for evaluation and how the accuracy and recall will be calculated.
[6分][six minutes]
(ii) 准确率和召回率通常不用于评估现代信息检索系统。现代系统有哪些特性使这些指标不适用?其他评估指标是如何解决这些问题的?(ii) Accuracy and recall are not typically used to evaluate modern information retrieval systems. What are the characteristics of modern systems that make these indicators inapplicable? How do other metrics address these issues?
[6分][six minutes]
[总分 35分][Total score 35]
a) 融合算法可能会利用三个效果。a) The fusion algorithm may take advantage of three effects.
(i) 简要描述这三个效果。(i) A brief description of the three effects.
[6分][six minutes]
(ii) 描述 ProbFuse 算法如何利用这些效果。and (ii) describe how the ProbFuse algorithm takes advantage of these effects.
[3分][three minutes]
b) 用一个例子解释交错融合系统如何工作。指出可能提高该算法性能的一个改进。b) Explain with an example how a staggered fusion system works. An improvement that may improve the performance of the algorithm is noted.
[9分][nine minutes]
c) 下表显示了三个搜索引擎对同一查询的结果。每组结果由一组唯一文档标识符(DocID)和用于排名的分数组成。完成以下任务,并展示每个任务的计算过程。c) The following table shows the results of three search engines for the same query. Each set of results consists of a set of unique document identifiers (DocIDs) and scores used for ranking. Complete the following tasks and show the calculation process for each task.
(i) 对每个排名列表的分数进行归一化。(i) Normalize the scores for each ranking list.
[6分][six minutes]
(ii) 使用 CombSUM 算法融合结果。and (ii) the fusion of results using the CombSUM algorithm.
[4分][four minutes]
(iii) 使用 CombMNZ 算法融合结果。and (iii) fusion of results using the CombMNZ algorithm.
[4分][four minutes]
(iv) 解释为什么对基于分数的融合进行归一化是必要的。and (iv) explain why normalization of score-based fusion is necessary.
[3分][three minutes]
[总分 35分][Total score 35]
停用词删除是信息检索系统中的一种预处理技术,旨在移除一些在大多数情况下对文档区分度贡献较小的常用词(如“的”、“和”、“是”等)。这些词被认为对文档的主要内容和主题贡献不大,因此在构建文档索引时通常会被删除。Stopword deletion is a pre-processing technique in information retrieval systems that aims to remove common words (e.g., "of", "and", "yes", etc.) that in most cases contribute less to the differentiation of a document. These words are considered to contribute little to the main content and subject matter of the document, so they are often removed when building a document index.
然而,在某些情况下,停用词删除可能会影响检索性能。例如:However, in some cases, stopword deletion may affect retrieval performance. For example:
信息需求(Information Need)是用户对特定信息的需求,即用户想要从信息系统中获取的知识或数据。这通常是用户在某个情境下所面临的问题或困惑。Information need is a user's need for specific information, that is, the knowledge or data that a user wants to obtain from an information system. This is usually a problem or confusion that the user is facing in a situation.
而查询(Query)是用户输入到信息检索系统中的具体词语或短语,是用户为了满足其信息需求所使用的表达方式。查询通常是信息需求的简化表达。信息需求和查询的区别在于,前者是用户内心的需求,后者是为了实现该需求而输入到检索系统中的语句。由于信息需求的复杂性和语言表达的局限性,查询可能无法完全反映用户的真实需求。Query is a specific word or phrase entered into the information retrieval system by the user, and it is an expression used by the user to meet his or her information needs. A query is often a simplified expression of an information need. The difference between an information requirement and a query is that the former is the user's inner need, and the latter is a statement that is entered into the retrieval system in order to fulfill that need. Due to the complexity of information requirements and the limitations of language expression, queries may not fully reflect the real needs of users.
数据库重叠(Database Overlap)是指在融合多个数据库或搜索引擎时,不同数据库中可能存在相同或相似的文档。这对于融合技术来说非常重要,因为:Database Overlap is when multiple databases or search engines are fused, and the same or similar documents may exist in different databases. This is important for convergence technology because:
对抗性信息检索(Adversarial Information Retrieval)指的是与信息检索系统对抗的技术或行为,通常是为了操纵或误导检索系统以获取不正当的利益。在网页搜索的背景下,这种行为通常表现为:Adversarial Information Retrieval refers to techniques or actions that work against information retrieval systems, usually with the aim of manipulating or misleading the retrieval system for improper gain. In the context of web search, this behavior typically manifests itself as: